The Human Genome Project generated the first reference sequence of the human genome, representing approximately 3.1–3.2 billion base pairs. The project began in 1990, produced a draft genome in 2001, and by 2003 had generated a high-quality reference covering approximately 92% of the genome. Most of the remaining gaps were highly repetitive regions that were difficult to sequence. In 2022, the first gap-free human genome assembly was completed, marking the culmination of more than three decades of work.
Today, thanks to next-generation sequencing, or NGS, an entire human genome can be sequenced in about a day, although the complete laboratory workflow—including sample preparation, sequencing, and data analysis—typically takes longer. This dramatic improvement in speed is possible because NGS sequences millions to billions of DNA fragments simultaneously. In contrast, Sanger sequencing analyzes individual DNA fragments in separate reactions, making it far less suitable for large-scale genome sequencing.
Many NGS workflows align sequencing reads to an existing reference genome, such as the human reference genome produced through the Human Genome Project. However, NGS can also be used for de novo genome assembly when no reference genome exists. The basic principle behind NGS is that DNA is fragmented into many short pieces, each of which is sequenced independently. The resulting sequences, known as reads, are then computationally assembled or aligned to reconstruct the original DNA sequence. NGS can be used to sequence both DNA and RNA.
The workflow begins with sample collection, followed by purification of DNA or RNA. The nucleic acids are then assessed to ensure they are of sufficient quality and are not degraded. For most RNA sequencing applications, RNA is first converted into complementary DNA, or cDNA, through reverse transcription before library preparation.
A sequencing library is then prepared from the DNA or cDNA. A library consists of many short DNA fragments derived from a longer DNA molecule. These fragments are generated by mechanically shearing the DNA using high-frequency sound waves or by enzymatic fragmentation.
Specialized DNA sequences called adapters are then ligated to both ends of each DNA fragment. These adapters contain the sequences required for binding to the flow cell, sequencing primer binding sites, sample indices or barcodes for multiplexing, and, in some workflows, unique molecular identifiers. After ligation, excess unbound adapters are removed, completing library preparation.
Depending on the application, the library may undergo PCR amplification to increase the amount of DNA available for sequencing. Many modern whole-genome sequencing workflows are PCR-free to minimize amplification bias. Before sequencing, the library is assessed to confirm that the fragment size distribution and DNA concentration meet the instrument's requirements.
One of the most widely used NGS technologies is Illumina sequencing, which uses a method known as sequencing by synthesis.
Sequencing takes place on a glass flow cell coated with millions of short DNA oligonucleotides. These oligonucleotides are complementary to the adapter sequences attached to the library fragments.
First, the double-stranded library is denatured to produce single-stranded DNA molecules. These strands bind to complementary oligonucleotides on the flow cell surface. DNA polymerase then synthesizes the complementary strand, after which the original template strand is removed, leaving a single DNA strand attached to the flow cell.
At this stage, the fluorescent signal from a single DNA molecule would be too weak for reliable detection. Therefore, each bound DNA fragment undergoes bridge amplification, also known as solid-phase PCR, to generate a cluster of genetically identical DNA molecules.
During bridge amplification, the attached DNA strand bends over and hybridizes with a nearby oligonucleotide on the flow cell, forming a bridge. DNA polymerase synthesizes the complementary strand, creating a double-stranded bridge. The strands are then denatured, and the process repeats multiple times. This produces a localized cluster containing thousands of identical DNA copies. One strand of each cluster is then removed, leaving single-stranded templates ready for sequencing.
A sequencing primer is added, followed by DNA polymerase and four fluorescently labeled reversible terminator nucleotides corresponding to A, T, G, and C. During each sequencing cycle, only one nucleotide is incorporated because the reversible terminator temporarily blocks further DNA synthesis.
After nucleotide incorporation, high-resolution imaging captures the fluorescent signal emitted by each cluster, identifying the base that was added. The fluorescent label and blocking group are then chemically removed, allowing the next sequencing cycle to begin. This process is repeated for the programmed read length.
After completion of the first sequencing read, index reads are generated to identify the sample from which each DNA fragment originated. In paired-end sequencing, additional chemistry regenerates the complementary strand so that the opposite end of the original DNA fragment can also be sequenced. Unique dual indexing increases multiplexing capacity while reducing index hopping, although the maximum number of samples that can be pooled depends on the indexing kit and sequencing platform.
Once sequencing is complete, image analysis software converts the fluorescent signals into DNA sequences. Low-quality reads are filtered out. On patterned flow cells, overlapping clusters are largely eliminated, although polyclonal clusters—where more than one library fragment occupies the same nanowell—may still occur and are also removed during quality filtering.
The remaining high-quality reads are then demultiplexed using their index sequences to assign each read to its original sample.
Depending on the application, the reads are either aligned to a reference genome or assembled de novo. During alignment, overlapping reads reconstruct the original DNA sequence. In paired-end sequencing, the software recognizes that the two reads originate from opposite ends of the same DNA fragment, improving alignment accuracy, particularly across repetitive or structurally complex genomic regions.
An important sequencing metric is read depth, also known as sequencing depth, which refers to the number of sequencing reads covering a particular nucleotide position. Average read depth describes the average coverage across the target region. Approximately 30× average depth is considered standard for whole-genome sequencing, while targeted oncology assays may use average depths of around 1,500× to detect rare somatic mutations.
Another key metric is coverage, or breadth of coverage, which refers to the proportion of the target genome or genomic region represented by sequencing reads. High coverage ensures that few or no regions are missed during sequencing.
NGS has transformed both research and clinical practice. It is widely used for diagnosing rare genetic disorders, identifying inherited and somatic variants, guiding cancer treatment, monitoring infectious diseases, studying microbial communities, and supporting research across medicine, ecology, agriculture, and evolutionary biology.
Both DNA and RNA can be sequenced using NGS. Applications include whole-genome sequencing, whole-exome sequencing, targeted gene panels, transcriptome sequencing, and sequencing of coding and non-coding RNAs such as microRNAs and long non-coding RNAs. Specialized sequencing methods also enable the analysis of cell-free DNA, single cells, DNA methylation, chromatin accessibility, and protein–DNA interactions through techniques such as ChIP-seq.
