• Deep Learning for Cortical Lesion Detection in Ultra-High-Field MRI of Multiple Sclerosis Patients: A Systematic Review of Annotation Strategies and Performance Metrics
  • Alireza Pourrahim,1,* Mohammad Mahdi Pourrahim,2 Abdollah Karimi,3 Alireza Vasiee,4 Omid Raiesi,5
    1. Student Research Committee, Faculty of Medicine, Ilam University of Medical Sciences, Ilam, Iran.
    2. Student Research Committee, Faculty of Medicine, Ilam University of Medical Sciences, Ilam, Iran.
    3. Department of Computer Engineering and Information Technology, Amirkabir University of Technology, Tehran, Iran.
    4. Department of Nursing, Faculty of Nursing and Midwifery, Ilam University of Medical Sciences, Ilam, Iran.
    5. Department of Parasitology, School of Allied Medical Sciences, Ilam University of Medical Sciences, Ilam, Iran.


  • Introduction: Cortical lesions (CLs) in multiple sclerosis (MS) are increasingly recognized as diagnostically and prognostically important, yet CLs are small, sparse, and frequently occult on conventional MRI. Ultra-high-field MRI (7T) and advanced contrasts (e.g., MP2RAGE, FLAWS, synthetic multi-contrast) increase CL conspicuity, enabling the development of deep learning (DL) methods for automated detection and segmentation. However, heterogeneity in annotation strategies (single-rater vs consensus; multi-contrast vs single-contrast), evaluation metrics (detection rate, F1-score, DSC), minimum lesion-size thresholds, and validation paradigms (internal cross-validation vs external/out-of-domain testing) impedes cross-study comparison and clinical translation. We systematically reviewed DL studies applied to ultra-high-field and advanced-contrast MRI for MS cortical lesion detection, focusing on annotation approaches and aggregated performance metrics.
  • Methods: A comprehensive literature search was performed in accordance with PRISMA guidelines. The databases searched included PubMed/MEDLINE, Embase, Scopus, and Web of Science up to May 2025. The search strategy incorporated a combination of keywords and their MeSH terms including “cortical lesion”, “ultra-high field”, “7T”, “MP2RAGE”, “FLAWS”, “deep learning”, “convolutional neural network”, “nnU-Net”, “segmentation”, and “multiple sclerosis” combined with Boolean operators. Two reviewers independently screened titles/abstracts and extracted data in duplicate. Extracted fields included sample size, MRI contrast(s), annotation strategy, minimum lesion volume, validation paradigm, and primary performance metrics (lesion-wise true positive rate/detection rate, false positive rate, F1-score, and Dice similarity coefficient where reported). Study quality and bias for diagnostic AI tasks were assessed using a modified QUADAS-2 framework emphasizing annotation reliability and external validation; key sources of bias recorded were single-rater ground truth, limited external testing, and heterogeneous lesion-size thresholds.
  • Results: The review included five studies (aggregate reported sample counts/sans one large trial dataset ≥1,000 scans/patients when reported). Aggregating available lesion-wise detection and benchmark metrics across studies that reported comparable measures yields the following central estimates: mean lesion-wise detection (true positive) rate ≈ 74.5% (range 67%–86%), mean false-positive rate ≈ 26.8% (range 8.4%–42%), mean in-domain F1-score ≈ 0.654 (from studies reporting in-domain F1s: 0.64 and 0.667), and mean out-of-domain F1 ≈ 0.525 (0.50 and 0.55 reported). Reported lesion-wise detection depended strongly on minimum lesion-size thresholds (higher detection and lower FPR for thresholds ≥6 μL vs thresholds down to 0.75 μL) and on annotation strategy: consensus multi-rater annotations and multi-contrast labeling (e.g., MP2RAGE + FLAWS or MMCLE) consistently produced higher median F1 and improved generalization compared with single-rater or single-contrast training. Domain adaptation and multi-site/out-of-distribution testing produced a measurable drop in performance: for example, an in-domain F1 of 0.64 declined to 0.50 out-of-domain (≈21.9% relative reduction). Transformer and nnU-Net–based architectures yielded comparable segmentation capacity when combined with robust multi-contrast inputs and consensus labels. Notably, studies that used multi-contrast synthetic enhancement (MMCLE, FLAWS) or MP2RAGE×4 reported the highest lesion-wise detection (up to 86% TPR) and the lowest FPR (≈8.4%) in their best configurations. Reporting of segmentation Dice scores and uncertainty quantification was inconsistent, precluding pooled DSC estimates.
  • Conclusion: Across contemporary DL approaches for cortical lesion detection in advanced MRI, pooled lesion-wise detection approximates 75% with a substantial and study-dependent false positive burden (mean ≈27%). Best performance is observed when models are trained on multi-contrast inputs and consensus multi-rater annotations and when domain-adaptation or external validation are incorporated. Performance degrades out-of-domain (≈22% relative F1 decline), underscoring the need for standardized annotation protocols, minimum lesion-size reporting, and multi-site external validation. Future work should prioritize harmonized annotation standards, lesion-size thresholds, uncertainty estimation, and open benchmarks to accelerate clinical translation.
  • Keywords: Multiple sclerosis; cortical lesions; 7T MRI; deep learning; nnU-Net; F1-score