In: Biology
Describe how the first draft of the human genome was obtained and compare this with the next generation sequencing (NGS) technologies used to sequence human genomes. (Min 2 and a half pages.)
Human genome sequencing
In 2001, the Human Genome Project international consortium published a first draft and initial analysis of the human genome sequence. At the same time, Craig Venter and colleagues working at Celera Genomics Corporation published another version of the human genome sequence. A wealth of information was obtained from the initial analysis of the human genome draft. For instance, the number of human genes was estimated to be about 30,000 (later revised to about 20,000). Researchers also reported that the DNA sequences of any two human individuals are 99.9 percent identical.
Phases of the Human Genome Project
Based on the insights gained from the yeast and worm studies, the Human Genome Project employed a two-phase approach to tackle the human genome sequence (IHGSC, 2001). The first phase, called the shotgun phase, divided human chromosomes into DNA segments of an appropriate size, which were then further subdivided into smaller, overlapping DNA fragments that were sequenced. The Human Genome Project relied upon the physical map of the human genome established earlier, which served as a platform for generating and analyzing the massive amounts of DNA sequence data that emerged from the shotgun phase. Next, the second phase of the project, called the finishing phase, involved filling in gaps and resolving DNA sequences in ambiguous areas not obtained during the shotgun phase. The exponential increase in DNA sequence information deposited in the High-Throughput Genomic Sequences (HTGS) division of GenBank by the end of the shotgun phase. Indeed, the shotgun phase yielded 90% of the human genome sequence in draft form.
The shotgun phase of the Human Genome Project itself consisted of three steps:
The approach used by the members of the IHGSC was called the hierarchical shotgun method, because the team members systematically generated overlapping clones mapped to individual human chromosomes, which were individually sequenced using a shotgun approach. The clones were derived from DNA libraries made by ligating DNA fragments generated by partial restriction enzyme digestion of genomic DNA from anonymous human donors into bacterial artificial chromosome vectors, which could be propagated in bacteria.
When possible, the DNA fragments within the library vectors were mapped to chromosomal regions by screening for sequence-tagged sites (STSs), which are DNA fragments, usually less than 500 base pairs in length, of known sequence and chromosomal location that can be amplified using polymerase chain reaction (PCR). Library clones were also digested with the restriction enzyme HindIII, and the sizes of the resulting DNA fragments were determined using agarose gel electrophoresis. Each library clone exhibited a DNA fragment "fingerprint," which could be compared to that of all other library clones in order to identify overlapping clones. Fluorescence in situ hybridization (FISH) was also used to map library clones to specific chromosomal regions. Collectively, the STS, DNA fingerprint, and FISH data allowed the IHGSC to generate contigs, which consisted of multiple overlapping bacterial artificial chromosome (BAC) library clones spanning each of the 24 different human chromosomes (i.e., 22 autosomes and the X and Y chromosomes).
Next, individual BAC clones selected for DNA sequence analysis were further fragmented, and the smaller genomic DNA fragments were subcloned into vectors to generate a BAC-derived shotgun library. The inserts were sequenced using primers matching the vector sequence flanking the genomic DNA insert, and overlapping shotgun clones were used to generate a DNA sequence spanning the entire BAC clone. The members of the IHGSC agreed that each center would obtain an average of fourfold sequence coverage, with no clone having less than threefold coverage. The term "shotgun" comes from the fact that the original BAC clone was randomly fragmented and sequenced, and the raw DNA sequence data was then subjected to computational analyses to generate an ordered set of DNA sequences that spanned the BAC clone.
Celera: Shooting at Random and Organizing Later
Before the IHGSC had completed the first phase of the Human Genome Project, a private biotechnology company called Celera Genomics also entered the race to sequence the human genome. Led by Dr. Craig Venter, Celera proclaimed that it would sequence the entire human genome within three years. Celera used two independent data sets together with two distinct computational approaches to determine the sequence of the human genome. The first data set was generated by Celera and consisted of 27.27 million DNA sequence reads, each with an average length of 543 base pairs, derived from five different individuals. The second data set was obtained from the publicly funded Human Genome Project and was derived from the BAC contigs (called bactigs); here, Celera "shredded" the Human Genome Project DNA sequence into 550-base-pair sequence reads representing a total of 16.05 million sequence reads. The company then used a whole-genome assembly method and a regional chromosome assembly method to sequence the human genome.
In the whole-genome assembly method (also called the whole-genome random shotgun method), Celera generated a massive shotgun library derived from its own DNA sequence data combined with the "shredded" Human Genome Project DNA sequence data, which together corresponded to a total of 43.32 million sequence reads. Celera used computational methods and sophisticated algorithms to identify overlapping DNA sequences and to reconstruct the human genome by generating a set of scaffolds
In contrast, with the regional chromosome assembly approach (also called the compartmentalized shotgun assembly method), Celera organized its own data and the Human Genome Project sequence data into the largest possible chromosomal segments, followed by shotgun assembly of the sequence data within each segment; this approach was similar to the hierarchical shotgun approach used by the IHGSC. The first step of the regional assembly approach involved separating Celera reads that matched Human Genome Project reads from those that were distinct from the public sequence data. Of the 27.27 million Celera reads, 21.38 million matched a Human Genome Project bactig, and 5.89 million did not match the public sequence data. These reads were assembled into Celera-specific or Human Genome Project-specific scaffolds, which were then combined and analyzed using whole-gene assembly algorithms. The resulting bactig data were again "shredded" to permit unbiased assembly of the combined sequence data.
Celera's whole-genome and regional chromosome assembly methods were independent of each other, permitting direct comparison of the data. Celera found that the regional chromosome assembly method was slightly more consistent than the whole-genome assembly method. Using these complementary approaches, Celera generated data that was in strong agreement with that of the IHGSC.
In February 2001, drafts of the human genome sequence were published simultaneously by both groups in two separate articles. Due to technical advances in DNA sequencing methods and a productive level of synergy between the two groups, they tied at the finish line, and both projects were completed ahead of schedule.
How to sequence DNA.
A) DNA polymerase binds to a single-stranded DNA template (blue) and synthesizes a complementary strand of DNA (red).
B) When DNA polymerase randomly incorporates a fluorescently labeled ddNTP base, synthesis terminates. This step produces a mixture of newly synthesized DNA strands that differ in length by a single nucleotide. Each strand is labeled at the 3′ end with a fluorescently labeled ddNTP base.
C) The DNA mixture is separated by electrophoresis.
D) The electropherogram results show peaks representing the color and signal intensity of each DNA band. From these data, the sequence of the newly synthesized DNA strand is determined, as shown above the peaks.
Next generation sequencing
There are a number of different NGS platforms using different sequencing technologies, a detailed discussion of which is beyond the scope of this article. However, all NGS platforms perform sequencing of millions of small fragments of DNA in parallel. Bioinformatics analyses are used to piece together these fragments by mapping the individual reads to the human reference genome. Each of the three billion bases in the human genome is sequenced multiple times, providing high depth to deliver accurate data and an insight into unexpected DNA variation. NGS can be used to sequence entire genomes or constrained to specific areas of interest, including all 22 000 coding genes (a whole exome) or small numbers of individual genes.
Potential uses of NGS in clinical practice
Clinical genetics
There are numerous opportunities to use NGS in clinical practice to improve patient care, including:
NGS captures a broader spectrum of mutations than Sanger sequencing
The spectrum of DNA variation in a human genome comprises small base changes (substitutions), insertions and deletions of DNA, large genomic deletions of exons or whole genes and rearrangements such as inversions and translocations. Traditional Sanger sequencing is restricted to the discovery of substitutions and small insertions and deletions. For the remaining mutations dedicated assays are frequently performed, such as fluorescence in situ hybridisation (FISH) for conventional karyotyping, or comparative genomic hybridisation (CGH) microarrays to detect submicroscopic chromosomal copy number changes such as microdeletions. However, these data can also be derived from NGS sequencing data directly, obviating the need for dedicated assays while harvesting the full spectrum of genomic variation in a single experiment. The only limitations reside in regions which sequence poorly or map erroneously due to extreme guanine/cytosine (GC) content or repeat architecture, for example, the repeat expansions underlying Fragile X syndrome, or Huntington's disease.
Genomes can be interrogated without bias
Capillary sequencing depends on preknowledge of the gene or locus under investigation. However, NGS is completely unselective and used to interrogate full genomes or exomes to discover entirely novel mutations and disease causing genes. In paediatrics, this could be exploited to unravel the genetic basis of unexplained syndromes. For example, a nationwide project, Deciphering Developmental Disorders, running at the Wellcome Trust Sanger Institute in collaboration with NHS clinical genetics services aims to unravel the genetic basis of unexplained developmental delay by sequencing affected children and their parents to uncover deleterious de novo variants. Allying these molecular data with detailed clinical phenotypic information has been successful in identifying novel genes mutated in affected children with similar clinical features.
The increased sensitivity of NGS allows detection of mosaic mutations
Mosaic mutations are acquired as a postfertilisation event and consequently they present at variable frequency within the cells and tissues of an individual. Capillary sequencing may miss these variants as they frequently present with a subtlety which falls below the sensitivity of the technology. NGS sequencing provides a far more sensitive read-out and can therefore be used to identify variants which reside in just a few per cent of the cells, including mosaic variation. In addition, the sensitivity of NGS sequencing can be increased further, simply by increasing sequencing depth. This has seen NGS employed for very sensitive investigations such as interrogating foetal DNA from maternal blood or tracking the levels of tumour cells from the circulation of cancer patients.
Microbiology
The main utility of NGS in microbiology is to replace conventional characterisation of pathogens by morphology, staining properties and metabolic criteria with a genomic definition of pathogens. The genomes of pathogens define what they are, may harbour information about drug sensitivity and inform the relationship of different pathogens with each other which can be used to trace sources of infection outbreaks. The last recently received media attention, when NGS was used to reveal and trace an outbreak of methicillin-resistant Staphylococcus aureus (MRSA) on a neonatal intensive care unit in the UK. NGS of the pathogens, however, allowed precise characterisation of the MRSA isolates and revealed a protracted outbreak of MRSA which could be traced to a single member of staff.
Oncology
The fundamental premise of cancer genomics is that cancer is caused by somatically acquired mutations, and consequently it is a disease of the genome. Although capillary-based cancer sequencing has been ongoing for over a decade, these investigations were limited to relatively few samples and small numbers of candidate genes. With the advent of NGS, cancer genomes can now be systemically studied in their entirety, an endeavour ongoing via several large scale cancer genome projects around the world, including a dedicated paediatric cancer genome project. For the child suffering from cancer this may provide many benefits including a more precise diagnosis and classification of the disease, more accurate prognosis, and potentially the identification of ‘drug-able’ causal mutations. Individual cancer sequencing may, therefore, provide the basis of personalised cancer management. Currently pilot projects are underway using NGS of cancer genomes in clinical practice, mainly aiming to identify mutations in tumours that can be targeted by mutation-specific drugs.
In principle, the concepts behind Sanger vs. next-generation sequencing (NGS) technologies are similar. In both NGS and Sanger sequencing (also known as dideoxy or capillary electrophoresis sequencing), DNA polymerase adds fluorescent nucleotides one by one onto a growing DNA template strand. Each incorporated nucleotide is identified by its fluorescent tag.
The critical difference between Sanger sequencing and NGS is sequencing volume. While the Sanger method only sequences a single DNA fragment at a time, NGS is massively parallel, sequencing millions of fragments simultaneously per run. This high-throughput process translates into sequencing hundreds to thousands of genes at one time. NGS also offers greater discovery power to detect novel or rare variants with deep sequencing.