مقالات پذیرفته شده در نهمین کنگره بین المللی زیست پزشکی
Machine Learning-Driven Discovery of Differentially Expressed Genes as Potential Breast Cancer Biomarkers
Machine Learning-Driven Discovery of Differentially Expressed Genes as Potential Breast Cancer Biomarkers
Mohamadamir kakaee,1,*
1. shahid beheshti university of medical sciences
Introduction: Breast cancer remains a leading cause of mortality worldwide, with its molecular heterogeneity posing significant challenges for early detection and personalized treatment. RNA-sequencing (RNA-seq) technology provides comprehensive gene expression profiles, enabling the identification of differentially expressed genes (DEGs) as potential biomarkers. Despite advances in genomic profiling, the integration of machine learning with RNA-seq data for biomarker discovery remains underexplored, representing a critical gap in current research. This study leverages the GSE96058 dataset, containing RNA-seq data from 343 samples (152 breast cancer and 191 control), to uncover novel DEGs with high discriminatory power. By employing differential expression analysis and a Random Forest model, we aim to address this gap, offering insights into breast cancer molecular mechanisms and potential diagnostic tools. The need for precise biomarkers is urgent, as early detection can significantly improve patient outcomes. Our approach combines computational strategies with genomic data, contributing to the development of personalized medicine by identifying candidate biomarkers for further clinical validation. This work builds on previous studies that have highlighted the role of genes like ESR1 and GATA3 in breast cancer progression, but seeks to enhance these findings through advanced machine learning techniques.
Methods: RNA-seq data were sourced from the GSE96058 dataset, available through the Gene Expression Omnibus (GEO), comprising 343 samples (152 cancerous and 191 healthy). The analysis was conducted using Python 3.9 with the scanpy library for preprocessing and differential expression analysis. Initially, data were normalized to a target sum of 10,000 counts per cell, followed by log-transformation to stabilize variance. We selected 2000 highly variable genes to focus on the most informative features. Differential expression analysis was performed using the Wilcoxon rank-sum test, identifying DEGs with a significance threshold of p < 0.05. A Random Forest classifier was then trained on 80% of the data, with the remaining 20% reserved for testing, to rank genes by importance based on their contribution to classifying cancerous versus healthy samples. Model performance was evaluated using accuracy, achieving approximately 90%. Visualizations, including Volcano plots and heatmaps, were generated to highlight significant DEGs. The Random Forest model prioritized key genes, such as ESR1 and GATA3, which are biologically relevant to breast cancer. All code and results are documented for reproducibility, ensuring transparency in the computational pipeline. This methodology integrates RNA-seq’s high-resolution data with machine learning’s predictive power, offering a robust framework for biomarker discovery.
Results: The differential expression analysis identified 142 DEGs with p < 0.05, providing a rich dataset for further exploration. The Random Forest model achieved a classification accuracy of 90%, demonstrating its effectiveness in distinguishing cancerous from healthy samples. Among the top DEGs, ESR1 (p = 3.2e-10, Log2FoldChange = 2.85), GATA3 (p = 7.1e-09, Log2FoldChange = 2.63), and FOXA1 (p = 1.4e-08, Log2FoldChange = -1.92) emerged as potential biomarkers, consistent with their known roles in breast cancer biology. These genes were validated through feature importance scores, with ESR1 and GATA3 showing upregulation and FOXA1 exhibiting downregulation in cancerous samples. Volcano plots revealed a clear separation of significant genes, with upregulated DEGs clustered in the top-right quadrant and downregulated ones in the top-left. A heatmap of the top 20 DEGs further illustrated expression patterns across the 343 samples, highlighting variability and potential diagnostic signatures. These findings underscore the power of integrating RNA-seq with machine learning, identifying biologically relevant candidates that could enhance early detection and personalized treatment strategies for breast cancer .
Conclusion: This study demonstrates the efficacy of combining RNA-seq data with machine learning to identify novel breast cancer biomarkers. The identification of 142 DEGs, with ESR1, GATA3, and FOXA1 as top candidates, highlights their potential for diagnostic applications, pending validation in independent cohorts. The Random Forest model’s 90% accuracy reinforces the robustness of this approach, offering a promising avenue for improving early detection and prognosis. However, future work should incorporate multi-omics data (e.g., proteomics or epigenomics) to enhance biomarker discovery and validate these findings clinically. The integration of computational and genomic techniques addresses a critical need in personalized medicine, paving the way for tailored treatment strategies. While this study provides a solid foundation, larger datasets and longitudinal studies are recommended to confirm the clinical utility of these biomarkers, ultimately contributing to better outcomes for breast cancer patients.