DNA sequencing

DNA sequencing is the process of discovering the specific base sequence of a strand of DNA. It was traditionally performed by chemical and enzymatic processes, such as Sanger sequencing, but now next-generation sequencing technologies, such as Illumina, are preferable due to their massively parallel and high-throughput properties.

Chemical sequencing

Now-redundant, chemical sequencing was the original method used to sequence DNA. It produced reads of up to 100 base pairs at a time and involved conducting four chemical reactions in parallel:

1. Cleavage of A and G, but A preferentially
2. Cleavage of A and G, but G preferentially
3. Cleavage of C
4. Cleavage of C and T

The DNA is radio-labelled at the 5' end and then extended until a specific base, according to the particular chemical reaction, is reached. At this point, the DNA is chemically modified and then cleaved. This generates a series of fragments, each radio-labelled at the 5' end and cleaved base-specifically at the 3' end. These strands can be separated by electrophoresis and then detected by autoradiography. It is important to limit chemical sequencing so that, for example, the C residues are cleaved approximately every 100 bases rather than at every C position.

Separation and detection can be done in one of two ways:

1. Denature DNA fragments, radio-label the single strands and then size-separate by high-resolution polyacrylamide gel electrophoresis, or

2. Label both strands of the dsDNA fragments, then perform a restriction digest to remove the labelled end on the negative strand (so only the positive strand can be detected by autoradiography)

In both methods, radioactive end-labelling of DNA fragments traditionally involved using γ[32P]-labelled ATP plus a T4 polynucleotide kinase which transfers the labelled phosphate to the 5’ end of the DNA. If the fragments are 5’ labelled, it is necessary to ‘read up’ the gel to determine DNA sequence. This was done manually, so the accuracy of chemical sequencing was very hit-and-miss.

Enzymatic methods: Sanger sequencing

The dideoxy reaction is based on the fact that DNA polymerisation requires a free 3' OH group for the addition of a 5' phosphate group from an incoming deoxyribonucleotide triphosphate (dNTP) precursor [for more on this, see: DNA replication]. DNA polymerisation can thus be terminated prematurely by the incorporation of a nucleotide that lacks a 3' OH group for a new precursor to be latched on to (i.e. such a nucleotide only has only a H atom on both its 2' and 3' carbons); such a nucleotide is aptly named a di-deoxyribonucleotide triphosphate (ddNTP) precursor, because oxygen is missing on two carbons when compared to a standard ribonucleotide triphosphate.

In the Sanger method, first DNA is denatured and then it is mixed with DNA polymerase, 4 dNTP precursors (i.e. one dNTP with each base) and 1 type of ddNTP in low concentration. Keeping the ddNTP in low concentration offers a limiting factor to prevent termination from occurring too frequently, analagous to the need for cleavage limitation in chemical sequencing. When the procedure is run, DNA strands are polymerised as normal, but terminated at different stages depending on where the ddNTP precursor is incorporated. Keeping the dNTP:ddNTP ratio low ensures that a nested set of terminated strands is formed. If the ddNTP is radiolabelled then it can be identified by polyacrymalide gel electrophoresis and then autoradiography with X-rays; however, fluorescent tagging of the ddNTP and laser excitation is more common these days. Fluorescent labelling of ddNTPs enables the sequencing reactions to be conducted in the same container: 'one-pot' reactions.

Sequencing by chain termination requires a single-stranded template that can be used to synthesise the extension product against. This is why traditionally, bacteriophage M13 was considered an ideal vector for Sanger sequencing (due to its single-stranded genome). However, nowadays double-stranded DNA templates can be used, providing they are first denatured in a thermal cycler. As with all DNA extension procedures, an RNA primer with a free 3' OH is required, as is a thermostable Taq polymerase - specifically a Klenow fragment, which polymerises nucleotides while lacking the ability to 'chew backwards' into the primer (5'-3' exonuclease).

The primer is labelled with γ[32P]-ATP and the extension product is labelled with a[P32/S35/P33]dATP (the latter 2 radio-labels are safer to use in the lab).Although polyacrylamide electrophoresis was traditionally used in Sanger sequencing, it is now more common to use capillary electrophoresis which uses a detector which detects the fluorescence of individual bases and passes the data to a computer that shows each base as a ‘peak’. Interpreting these peaks can be difficult, however, especially where the peaks overlap.

Insert DNA can be sequenced in a plasmid using ‘universal primers’. Although the DNA sequence of the insert may not be known, the DNA sequence of the plasmid and cloning site usually is known, so a primer can be designed to anneal just outside the cloning site. This is extended to give the first bit of sequence in the insert, which can then be used to synthesise another primer for the next bit of insert to be sequenced from. This bit-by-bit sequencing method is called ‘primer walking’.

Sequencing data can be used to build up entire genomes by generating contigs (contiguous sequences) where fragment DNA sequences appear to overlap.
Although computing power is important in modern sequencing, there is still a place for manual sequencing – in transcript mapping and DNA footprinting.

DNA methylation represents an extra layer of genetic information – albeit transient. With chemical methods, methylation states can be 'frozen' as another substrate for DNA sequencing. Methylation of C residues is common in eukaryotic genomes; treatment with sodium bisulphate converts unmethylated C residues into U residues, while leaving methylated C residues untouched. These products can be amplified by PCR and then sequenced. Sequences can be compared to ascertain where a C has been methylated (i.e. the C will be present in the methylated copy, and will be a T in the unmethylated copy).

Next-generation sequencing technology: 454 sequencing

Next-generation sequencing, above all, saves time, because it negates the need for vector recombination, bacterial transformation and cloning: the fragmented DNA is directly sequenced as a raw material.

454 sequencing is a procedure related to pyrosequencing, or ‘sequencing by synthesis’. Pyrosequencing involves immobilising single-stranded DNA, which acts as a template for DNA synthesis, and then systematically pairing nucleotides from one of four solutions (A, C, G or T). When a nucleotide pairs correctly with the template DNA, pyrophosphate (PPi) is liberated. Another enzyme, ATP sulfurylase, converts the displaced PPi into ATP in the presence of a substrate, adenosine 5’ phosphosulfate (APS); the ATP produced drives the conversion of luciferin to oxyluciferin, catalysed by the enzyme luciferase. Oxyluciferin is bioluminescent and light is emitted proportionally to the amount of ATP present. Nucleotides that are not incorporated are degraded by an enzyme, apyrase, and no light is emitted. The sequence of solutions which cause light to be emitted is recorded in a ‘pyrogram’, thereby providing the DNA sequence.

In 454 sequencing, larger sources of DNA, such as genomic DNA and bacterial artificial chromosomes (BACs), require shearing to 300-800bp fragments prior to sequencing, while smaller sources such as non-coding RNA can be sequenced as they are. The template DNA fragments are ‘polished’ (given blunt ends) and then hybridised with short molecules called adaptors on their 3’ and 5’ ends. The final single-stranded template DNA (sstDNA) fragments are each hybridised to their own DNA capture bead, and the beads are emulsified in a water and oil mixture so that each bead and template is isolated from every other. This is to avoid sequence contamination between the templates. The DNA capture beads are supplied in vast molar excess of the DNA available, in order to ensure optimum binding of the DNA to the beads.

Each emulsified bead constitutes a ‘micro-reactor’ that is approximately 10µm in diameter and contains all of the reagents required to amplify the template by PCR. This procedure is aptly named emulsion PCR or emPCR. Amplification of the template in this way is essential in order to provide sufficient copies for pyrosequencing. The copy number per bead is usually in the order of several million. After amplification, the beads are placed in a device called a pico-titre plate which allows only one bead per well. The reagents and enzymes needed for pyrosequencing are added, and the reaction proceeds. Light emitted by the luciferase-catalysed reaction, described above, is monitored by a charge-coupled device (CCD) camera that forms part of the apparatus. The 454 technique is sometimes referred to as ‘parallel pyrosequencing’ because multiple fragments, such as all of the fragments that constitute an individual’s genome, can be sequenced simultaneously in the pico-titre plate. This method is not only quicker and cheaper than sequencing by shotgun cloning, but also it eliminates the risk of losing genetic material as 454 sequencing is performed entirely in vitro, whereas shotgun sequencing requires the transformation of bacterial cells.

454 sequencing was scrutinised by comparison with traditional Sanger methods by its ability to accurately sequence a complex eukaryotic genome (barley) in a 2006 study by Wicker et al. The analysis picked up some crucial flaws in the technique. For instance, 454 requires shearing the genomic DNA into many more fragments (each fragment 300-800bp in 454 versus 2-10kbp in Sanger sequencing), and produces much shorter reads (100-200bp versus 800-1000bp in Sanger). A key flaw in the 454 method is its inability to accurately sequence highly repetitive regions such as the long terminal repeats (LTRs) of retrotransposons and di-nucleotide repeats. Essentially the method for sequencing these repetitive sequences should be identical in both 454 and Sanger; however, the shorter read length of 454-sequenced templates makes pooling the sequences into contigs a far less accurate process. Nevertheless, it is clear that for whole-genome sequencing 454 is more time- and cost-effective than Sanger sequencing.

Next-generation sequencing: the Illumina (Solexa) method

Another next-generation sequencing method is Solexa, developed by the biotechnological company Illumina. This method enables reads of the genome, the transcriptome and even of epigenetic modifications, at single-base resolution. The high-throughput nature of the procedure enables an entire transcriptome to be profiled within 24 hours. In principle, it is very similar to 454 sequencing – PCR amplification, and then sequencing by synthesis via pyrosequencing. It involves three steps: preparation of libraries, generating PCR-amplified clusters within the library and then sequencing templates in the clusters.

As with 454 sequencing, longer stretches of DNA must first be fragmented. Sheared ends are repaired and then adenylated. The adenine residue enables adaptor molecules to be annealed to both ends of the fragment. The adenylated and adapted molecules are then purified, ready for hybridisation in a ‘flow cell’. The flow cell contains a lawn of oligonucleotides which are secured to its surface. These oligonucleotides recognise and bind to the adaptor regions of the DNA fragments, securing them in isolation on the surface. Once secured, the templates are amplified by PCR to create hundreds of copies per cluster (solid phase amplification), and the excess DNA is removed.

Sequencing primers are then attached to every fragment in each cluster and the templates are extended by pyrosequencing: excitement with a laser causes light to be emitted, and the colour of the light indicates which base was incorporated. Because Solexa sequencing offers sequencing at single-base resolution, it can more reliably sequence otherwise difficult regions of DNA such as di- and tri-nucleotide repeats and other micro-satellite sequences. This is essential in a medical context as a significant proportion of the variation between individuals is found in variable number tandem repeats (VNTRs) where DNA polymerase accidentally inserts or deletes single bases.

The Solexa method involves complex software that can function almost completely independently to perform statistical analysis of the reads given. This is important in determining where there may have been sequencing error by comparing the generated reads to a given reference. In the case of sequencing patient genomes, this reference might be the fully sequenced human genome: a computer compares the sequenced reads to regions of the human genome to look for areas of maximum overlap.

Next-generation sequencing: SOLiD sequencing

A third next-generation sequencing technology is SOLiD, developed by the industry Applied Biosystems (ABI). It has extremely high-throughput, allowing multiple experiments to be performed simultaneously, and an accuracy rate of approximately 99.94%. SOLiD sequencing has already been used with success in analysing the exome (complete exon constitution) of patients with the autosomal recessive disorder cranioectodermal dysplasia to identify heterozygous mutations in the WDR35 gene.

The procedure itself begins in a similar way to 454 sequencing: hybridising fragments to magnetic beads, ligating adaptor molecules to either end (called P1 and P2) and then amplifying the fragments by emulsion PCR. After emPCR, a so-called ‘bead enrichment’ step is used to separate beads with correctly extended templates from the beads that are not bound to templates. This is done in a glycerol gradient. Next the templates have to be sequenced. They are randomly bound to a glass side; molecules are added to the 3’ end of the templates in order to facilitate this binding. Unlike the pyrosequencing method used by 454 and Solexa, SOLiD sequencing is done by di-nucleotide pairings rather than single bases. The primers needed for sequencing hybridise to the P1 and P2 adaptor molecules. Fluorescently labelled di-nucleotides try to ligate with the sequencing primer, and as they are successful they emit a particular colour of light to indicate which di-nucleotide pairing has been sequenced. After one cycle of ligation in this manner, the extended sequence is cleaved of one di-nucleotide at its end, and a second round of di-nucleotide ligation begins. This is called ‘primer resetting’ and there may be five rounds of primer resetting and ligation for a given read.

Nanotechnology and the future of DNA sequencing

Another future sequencing technology is the use of nanopores. The DNA is ‘threaded’ through a nanopore and the shape of different bases induces a different change in electrical current. This is a good technology because it doesn’t require the addition of bases, just the threading of the DNA molecule.