- Letter to the Editor
- Open access
- Published:
Letter: the limitations of gene set-based predictive models: a critical assessment
Journal of Translational Medicine volume 23, Article number: 521 (2025)
Predefined gene set-based predictive models are increasingly applied in the field of translational bioinformatics [1]. Many studies have developed models based on gene sets associated with specific biological processes, such as ferroptosis (728 genes) [2] and lactylation (374 genes) [3]. This approach has led to a surge in related research, with more than 1,000 such studies reported in 2024 alone. However, the biological specificity and validity of many gene set models have not been fully validated, raising concerns about their scientific and clinical value. To critically evaluate the robustness of such models, we analyzed three cancers with the highest mortality rates: lung, colorectal, and liver cancer [4], utilizing data from The Cancer Genome Atlas (TCGA). After performing univariate Cox regression for each cancer type, we identified genes significantly associated with patient survival, with 2,011 genes for colorectal cancer, 2,211 for lung cancer, and 6,324 for liver cancer. We then randomly selected gene sets of sizes ranging from 20 to 500 genes and repeated the process 100 times for each cancer. Nearly all randomly selected gene sets successfully stratified patients by survival outcome. Moreover, as the number of genes in the gene set increased, model performance significantly improved, with larger gene sets yielding higher accuracy, as reflected in higher AUC values, hazard ratios, and more significant p-values across the three cancer types (see Fig. 1). These results suggest that the success of many gene set-based models may stem more from statistical chance or model flexibility than from real biological mechanisms.
The relationship between gene set size and prognostic model performance across three cancer types. To examine the relationship between gene set size and prognostic model performance, correlation analysis was performed, with all analyses repeated 100 times. For colorectal cancer: panel (A) shows the relationship between AUC and gene set size, (B) shows the relationship between gene set size and log-rank p-value, and (C) shows the relationship between gene set size and hazard ratio (HR). Similarly, panels (D), (E), and (F) show the corresponding relationships for lung cancer, and panels (G), (H), and (I) show the corresponding relationships for liver cancer
Our analysis reveals that arbitrary gene sets can easily achieve significant survival stratification, highlighting a major issue with gene set models. While many studies include some experimental validation of their models, the biological specificity of these models is often insufficiently explored. Statistically significant models may not necessarily reflect the true disease mechanisms. In practice, predefined gene sets are often used to construct predictive models without rigorous functional validation, raising concerns that their performance may be driven by statistical artifacts, cohort-specific biases, or overfitting, rather than true disease pathology. This issue is not limited to prognostic models, but also affects classification models used for disease diagnosis, immunotherapy response prediction, and chemo-radiotherapy sensitivity assessment. Even if these models show statistical significance, they may lack real clinical relevance. Despite these shortcomings, gene set models can still serve as valuable exploratory tools in some contexts, such as generating hypotheses about disease-related pathways or guiding basic research.
Another significant limitation of current predictive modeling is the lack of normal (healthy) control data for comparison. In TCGA, most tumor datasets lack matched normal tissue samples. The absence of normal controls introduces systematic biases, making it difficult to differentiate disease-specific molecular signals from background noise, further reducing the biological relevance of computational models. To improve the reliability and interpretability of gene set-based predictions, future studies should aim to include diverse patient cohorts and ensure proper matching with normal control samples. Even if a gene set model performs well statistically, it must undergo comprehensive biological validation to be considered clinically meaningful. Many computational biomarkers that initially seemed promising have failed to translate into clinical practice due to the lack of mechanistic validation or functional assays. To avoid this pitfall, future research must validate predictive models across independent cohorts and confirm findings through laboratory experiments. For example, integrating multi-omics data (such as genomics, transcriptomics, proteomics, and metabolomics) will provide a more holistic understanding of disease biology, help identify real molecular interactions, reduce false positives, and refine disease subtype classification. Additionally, direct experimental validations, such as functional assays and single-cell transcriptomic analysis, can verify whether the gene sets in a model truly affect the disease biology. Moreover, stricter statistical methods, such as permutation testing, cross-validation, and independent dataset validation, should be applied to prevent overfitting and ensure the model’s predictive power can be generalized beyond the original cohort. The widespread application of gene set-based predictive models highlights the urgent need for more rigorous standards in biomarker discovery and validation. Researchers should prioritize biologically interpretable and mechanistically grounded findings over mere statistical significance. Journals and funding agencies also play a crucial role in enforcing higher requirements, such as ensuring computational models are supported by experimental or clinical validation. By moving beyond purely statistical associations and focusing on biological relevance, we can ensure that new predictive models truly advance disease biology and ultimately improve patient outcomes.
Data availability
Not applicable.
References
Zhou W, Bi W, Zhao Z, Dey KK, Jagadeesh KA, Karczewski KJ, et al. SAIGE-GENE + improves the efficiency and accuracy of set-based rare variant association tests. Nat Genet. 2022;54(10):1466–9.
Zhou N, Yuan X, Du Q, Zhang Z, Shi X, Bao J, et al. FerrDb V2: update of the manually curated database of ferroptosis regulators and ferroptosis-disease associations. Nucleic Acids Res. 2022;51(D1):D571–82.
Cheng Z, Huang H, Li M, Liang X, Tan Y, Chen Y. Lactylation-Related gene signature effectively predicts prognosis and treatment responsiveness in hepatocellular carcinoma. Pharmaceuticals (Basel Switzerland). 2023;16(5).
Bray F, Laversanne M, Sung H, Ferlay J, Siegel RL, Soerjomataram I, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. Cancer J Clin. 2024;74(3):229–63.
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
Xianqiang liu: Conceptualization, Methodology, Validation, Writing - Original Draft.Wenbo zhao: Software, Validation, Data Curation, Visualization. Lijing yang: Investigation, Writing - Review & Editing. Min Jiang: Investigation, Writing - Review & Editing.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Ethical approval was not applicable for this study as publicly available data were used for the analysis.
Consent for publication
Not applicable.
Conflict of interest
All authors declare no competing interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yang, L., Zhao, W., Jiang, M. et al. Letter: the limitations of gene set-based predictive models: a critical assessment. J Transl Med 23, 521 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12967-025-06476-5
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12967-025-06476-5