A team from the Faculty of Medicine and Health Sciences and the Institute of Neurosciences at the University of Barcelona (UBneuro) has applied advanced artificial intelligence techniques to better understand why Huntington's disease can begin at very different ages in patients. This hereditary neurodegenerative condition, which causes motor, cognitive, and psychiatric impairments, is caused by a mutation in the HTT gene, which encodes the huntingtin protein.
The mutation in this gene produces a series of CAG repeats that alter the properties and functionality of the huntingtin protein in the brain. Although the length of the CAG repeats in the HTT gene influences the age at which the first symptoms appear, this factor does not fully explain the wide variability observed in disease onset among patients. This study now analyzes which additional genetic factors might play an important role in determining when the disease begins in affected individuals.
The article has been published in the proceedings of the 20th Machine Learning in Computational Biology meeting (MLCB, 2025), one of the most internationally recognized scientific forums exploring the frontiers of knowledge between machine learning and computational biology. The study was led by the research group of Ramón y Cajal investigator Jordi Abante (UB-UBneuro), with master's student Caterina Fuses as the first author of the paper.
The study represents a pioneering and innovative application of an artificial intelligence language model to genomic information for multimodal genotype-to-phenotype prediction. The study outlines a new framework for investigating complex genetic diseases and demonstrates how multimodal machine learning can help uncover biologically meaningful patterns that are difficult to detect with conventional methods.
Beyond traditional statistical techniques
In the study, the researchers used nonlinear machine learning models - such as tree-based models and graph neural networks (GNNs) - to identify genetic modifiers, that is, genes that can delay or accelerate disease onset depending on a patient's genetic background. Unlike traditional statistical approaches, these models can detect complex interactions between genes and reveal effects that depend on the length of the CAG triplet expansion.
To make the analysis more efficient and interpretable, the team also developed a method to compress genetic information using gene-specific neural networks, which reduced computational costs without losing predictive power. Additionally, changes in gene expression - predicted and generated by a state-of-the-art genomic language model - were incorporated. This innovation allowed them to link regulatory DNA variants with changes in gene activity in the brain regions affected by the disease.
As part of the study, the researchers analyzed genetic data from more than 9,000 patients with Huntington's disease. This enabled the team to identify both previously known modifier factors related to DNA repair and new candidate genes involved in processes such as transcriptional regulation and cellular metabolism. Notably, the results show for the first time that different biological mechanisms can influence disease onset in patients with shorter versus longer CAG expansions, thus revealing the context-dependent nature of these genetic effects.
This study shows that the genetic factors modifying Huntington's disease are not universal, but largely depend on genetic context. Using nonlinear and multimodal machine learning, we can uncover interactions that were essentially invisible with traditional approaches."
Jordi Abante, principal investigator of the study, professor at the UB's Faculty of Medicine and Health Sciences, and member of UBneuro
"This approach could also be applied to other hereditary and neurodegenerative disorders, potentially opening new avenues for research and, in the future, more personalized therapeutic strategies," Abante concludes.