SBI Research Feature: High-Throughput DNA Sequencing Technology at OSU

In 2007, a small group of researchers and OSU’s Center for Genome Research and Biocomputing, with help from the College of Veterinary Medicine and the Botany and Plant Pathology Department, purchased a high-throughput DNA sequencing machine called the Illumina 1G Genome Analyzer. Its main advantages over traditional sequencing methods are lower costs, greater depth of sequencing and speed – it can sequence the entire genome of a bacterium in three days. Jeff Chang, assistant professor of Botany and Plant Pathology, is one of the users on the OSU campus.  In this short web interview, Jeff describes how the Illumina sequencer works, some of its applications, and the logistics to get started.

Photo of Jeff Chang.

Jeff Chang, Assistant Professor of Botany and Plant Pathology

Can you give an overview of how the Illumina sequencing machine works?

What the Illumina machine does is something called sequencing by synthesis. It simultaneously sequences many small fragments of DNA. It typically generates approximately 6 million reads of 36 bases long in about 3 days and it’s pretty inexpensive. Depending on the size, you can sequence a genome for as little as $650. It’s parallel sequencing, and as you can imagine it is really revolutionizing genomics.

Let’s say you want to sequence a genome, well you take your genome and you shear it up into small fragments. Next, you add adaptors, specific nucleotide sequences that bind to the ends of the fragments, then you add the fragments to a flow cell. The flow cell has a sequence complementary to the adaptors, so the DNA fragments bind to it. The flow cell has eight channels and you can sequence a different genome in each of the channels.  Then you basically do a PCR but your product is on a solid matrix and it creates ‘clusters’ of amplified fragments of DNA. 

Next, the machine adds each of the four bases that have a unique fluorescent color to each cluster sequence. If a cluster has a “T” then an “A” will ligate on. If another cluster has a “C” then a “G” will ligate on that one. For each cycle, the machine will take four pictures of each cluster –  for example, one picture for each color such as yellow, green, blue, red, and based on what fluoresces, it can say, in this cluster it was an “A”, in this cluster a “T” and so on. So now what you’ve done is sequenced one base of each DNA fragment. So that’s what I mean by sequencing by synthesis. Of course Illumina does this for every cluster in parallel. The machine will repeat this process for 36 cycles and give you a 36-nucleotide long read for every cluster it photographed. 

You can see the power of parallel sequencing – in the span of three days, you can sequence on average 6 million individual clusters per channel. There is a flash demonstration of how it works on the Illumina Web site.

What are the limitations?

There are a number of caveats with this technology. The most important is that it determines sequences in 36 nucleotide long fragments and you need to figure out how the fragments fit back together to make up the genome. With traditional sequencing you get fragments that are much longer, for example ~750 base pairs long. There are several sequence assembler programs that assemble these reads into even longer sequences of DNA based on a minimum overlap of 30 or so bases between reads while still allowing for mismatches caused by sequencing error. In contrast, if you have 36 base reads, and you match 30 bases, that only extends your read by a little bit so you can imagine that you need extremely deep sequencing to complete a genome. However, if you decrease your overlap to something less, like 11 bases, you’ll extend your reads by more but then the probability of that shorter sequence being repeated in more than one place in the genome goes up. So the shorter the sequence you allow for overlap, the greater the chance of assembly mistakes – reads that correspond to different parts of a genome may be assembled together.  Additionally, sequencing errors can really ruin short read assembly.  Right now, Illumina is claiming less than 1.5% error on reads of 24 nucleotides in length.

If you work on eukaryotes such as corn or humans, for example, they tend to have more repeated DNA, which makes their genomes harder to assemble after you’ve sequenced with the Illumina machine. Microbes, on the other hand, generally have much simpler genomes. So the microbes that Dan Arp (Botany and Plant Pathology) works with, for example, are 3 million base pairs in size and the genomes I work with are 6-9 million base pairs in size. Theoretically then, these genomes are easier to assemble “de novo,” in the absence of any a priori information. However, despite these ‘simpler’ genomes, we are still not able to use Illumina to completely sequence a genome. Instead, we can only generate ‘draft’ genome sequences.

What are some of the applications for the Illumina sequencer?

Illumina is most suited for ‘re-sequencing’ - sequencing genomes of organisms that are very closely related to ones with “completed” genome sequences – something referred to as a reference genome. Illumina sequencing can also be used to explore mechanisms of evolution or causes of change to an organism, such as sequence variants of an organism with a completed genome sequence. For example, Martin Schuster (Microbiology) works on Pseudomonas aeruginosa, an opportunistic pathogen that affects cystic fibrosis and burn patients. He has identified mutant versions of P. aeruginosa and one of his goals is to find out what genes have been altered in his mutant strains that allow them to behave differently from the wild-type strain. All he needs to do is sequence the genomes of his mutants and compare their sequences to the genome sequence of the wild-type strain. He can very quickly eliminate the majority of the genes that are identical to the wild type genome and simply search for the few that are different. 

Another use that we are exploring is sequencing bacterial transcriptomes – all the genes that are being expressed under a certain condition. When you take an organism, at any given time or in a specific environment, only a certain percentage of its genes are expressed and it’s a big challenge to determine which genes are important for survival in that particular environment. Illumina can be used to survey the differences in gene expression of a bacterium in different conditions. You simply isolate the RNA, make copy DNA or cDNA and sequence the cDNA fragments using the Illumina. You then align the reads back to the genome sequence and then compare between treatments to find the genes that are specifically upregulated or downregulated.

Illumina has many applications. Some of the heavier users of the Illumina at OSU use it to profile small RNAs of plants, characterize RNA metabolism, empirically annotate genomes, and examine genome evolution. In each of these cases, the organism of study has a very good reference genome.

What is the focus of your research?

My group is interested in understanding the mechanisms by which bacteria establish symbiotic relationships with their plant hosts. Specifically, we focus on something called the type III secretion system and type III effectors. These effector proteins are injected directly into host cells by bacteria and reprogram the host to make a more hospitable environment for the bacterium. We are using genomic methods to identify type III effectors and to understand what they do to host cells. 

The Illumina sequencer produces so much data – what are some of the computational needs for working with it?

The data derived from Illumina sequencing is computationally demanding. The amount of data that comes out from just one flow cell is more than one terabyte – remember the machine takes 1200 pictures per channel per cycle. We have four terabytes of memory space. And analyzing all this data also demands a lot of RAM, so we have two dual quad processors with 16 gigs of RAM each and I fear that may still be insufficient.

So a very important message is that using Illumina sequencing for whatever application requires the necessary computational hardware and more importantly, the right personnel. Groups that are interested in using Illumina will need to either develop their own computational expertise or find a collaborator who can help them with the analyses. The CGRB maintains the computational hardware for us and they have been an excellent resource in helping us get started. 

Who is currently using the sequencer and how are the logistics set up?

Right now at OSU, there is a small core of heavy users mostly comprised of the groups that helped purchase the machine. Faculty members include Jim Carrington (Botany and Plant Pathology), Dee Denver (Zoology), Todd Mockler (Botany and Plant Pathology), Michael Freitag (Biochemistry and Biophysics), Erica Bakker (Horticulture), and I. In addition, a number of other groups at OSU are starting to explore the use of Illumina.

The CGRB maintains and operates the Illumina machine. Jim Carrington is the director of the CGRB and the computational support is provided by Scott Givan, Chris Sullivan, and Steve Drake. They work closely with Mark Dasenko who operates the Illumina machine. Scheduling of machine time is done through the CGRB Web site

Each group is responsible for preparing their sample and each of the different applications requires a specific kind of preparation. Mark Dasenko then takes the sample, adds it to the flow cell, builds the clusters and then sequences it. Steve, Scott and Chris do some of the preliminary analyses on each run. After that, the individual groups are responsible for analyzing their own data. Chris and Scott are good people to talk to about hardware needs and Scott is the person to talk to if a group has computational (software) questions. There is also a high-throughput sequencing group that has a listserv. Groups that are interested in the Illumina shouldn’t hesitate to contact any of us if they have any questions.