• Applications of Artificial Intelligence in Next-Generation Sequencing
  • Behdokht Fathi Dizaji,1,*


  • Introduction: Next-generation sequencing (NGS) technologies have revolutionized our understanding of the genome by enabling high-throughput analysis of DNA. NGS can produce massive datasets, like The Cancer Genome Atlas (TCGA). However, the vast volume and intricacy of NGS data, encompassing raw signal processing, alignment, variant detection, and interpretation, present substantial difficulties for conventional computational approaches. Artificial intelligence (AI), especially machine learning (ML) and deep learning (DL), has arisen as a powerful tool to confront these issues. This review discusses key stages of the NGS workflow, where AI can be applied.
  • Methods: Literature was scrutinized from databases ScienceDirect and PubMed to provide an overview of recent advancements in AI applications for NGS of the human genome.
  • Results: The initial step in NGS is base calling, by which raw electrical or optical signals are converted into nucleotide sequences (A, T, C, G). Conventional methods often deal with noise and errors, particularly in long-read technologies like Pacific Biosciences (PacBio) and Oxford Nanopore (ONT). AI, especially DL models, can revolutionize this process by learning patterns from large datasets to improve accuracy. AI tools, Bonito and Guppy, have utilized convolutional neural networks (CNNs) in conjunction with connectionist temporal classification (CTC) decoders for ONT data to achieve more consensus accuracy. DeepConsensus, a transformer-based encoder, addresses gaps in PacBio sequences and increases sequence quality in human genome studies. A benchmark study in 2023 compared DL models for nanopore base calling, demonstrating improvements in error production by up to 20% over conventional approaches. These advancements are crucial for human genome projects, where precise base calling is a substantial basis for downstream analyses. Read alignment is the mapping of sequenced reads to the human reference genome. Conventional aligners like BWA or Bowtie use heuristic methods, while hybrid models enriched with AI, including PEPPER-Margin-DeepVariant, use CNNs for consensus-based alignment that is particularly effective in determining repetitive regions of the human genome. Tools like DAVI use CNNs and recurrent neural networks (RNNs) to handle alignments in variant-rich regions, which reduces mismatches in human exome data. It provided faster processing for large sets, as indicated in the UK Biobank study of 500,000 individuals, although challenges of computational intensity, with AI models requiring GPUs, remained. Variant detection recognizes genetic variations, like single-nucleotide variants (SNVs), insertions/deletions (InDels), and structural variants, which are essential in understanding diseases. AI perfectly distinguishes true variants from sequencing artifacts, especially in noisy long-read data. DeepVariant, a CNN-based tool, considers reads as images, achieving 99.9% accuracy for SNVs in Illumina and PacBio data. Rare disease diagnosis, achieved through DeepTrio, which comprised familial information to detect de novo mutation. Clair3, a hybrid CNN-RNN model, detects low-coverage long-read frameworks, reaching 99.9% SNP accuracy. DNAscope utilizes ML genotyping without GPUs with conventional haplotype calling, yielding efficient analysis. For genome assembly, AI assists editing and error correction. Tools like polishCLR integrate PacBio and Illumina data for high-quality human assemblies, while Medaka uses neural networks for haploid calling, particularly in ONT. AI tools are superior to classic tools like GATK, which use statistical models; they analyze high-coverage data with F1 scores above 99%, though ONT InDels with 76-99% accuracy remain challenging. AI interprets NGS data to uncover biological meaning, such as gene expression patterns and epigenetic modifications, that are crucial for clinical genetics. SpliceAI uses residual neural networks to predict splicing forms from unprocessed mRNA sequences and variant annotation. Epigenetic analysis recruits DeepHistone for histone modifications and DeepCpG to predict methylation, enabling determination of functional SNPs in human genomes. Chromogen predicts chromatin conformations in single cells. In clinical settings, AI refines cancer management through NGS analysis. DeepVariant and AlphaMissense classify variants in tumor genomes, achieving liquid biopsies 91-98% sensitivity for early detection. Tools like Face2Gene use DL for phenotype analysis via integrating with NGS for diagnosing rare genetic disorders. Studies emphasize AI's role in precision medicine, predicting drug responses and tumor subtypes; however, integration with multi-omics data remains challenging.
  • Conclusion: The integration of AI into NGS has changed human genome analysis from an overload of data to actionable insights, with tools enhancing its workflow from base calling to clinical decision-making. While AI intensifies accuracy and personalization in NGS, interpretation and equity are still challenging. Future studies need to focus on various datasets and interpretable AI to fully leverage AI's capabilities in genomics.
  • Keywords: Artificial Intelligence, Next-Generation Sequencing, Machine learning, Deep learning