This page contains a Flash digital edition of a book.
by T. Chris Boles and Jonas Korlach


AL


Long Reads and the Return of Reference Genomes


The availability of inexpensive, massively parallel short-read sequenc- ers shifted the focus of high-quality reference genome production to resequencing and draft genome assemblies, emphasizing quantity over quality. Long-read sequencing, however, now makes it possible to achieve both quantity and quality.


Reference genomes were originally deemed a necessary tool to study an organism or family of organisms, but many scientists held that the pro- duction of a single reference assembly would be sufficient for any given species. As the field of genomics matured, however, researchers have realized that these polished assemblies are critical for applications from personalized medicine to comparative genomics and more. Current ef- forts to produce population-specific reference assemblies for the human genome are expected to reveal a significant amount of previously unde- tected natural genetic variation among individuals. Similarly, scientists are moving beyond the days of aligning, for instance, goat sequence to a bovine assembly, for lack of a proper reference. They are now generating multiple reference-grade assemblies for agriculturally important plants and animals as a foundation for efforts to improve the health, hardiness and yield of these species.


These reference assemblies are providing critical new information about underlying biology, genetic mechanisms of interest and more. Scientists involved in precision medicine contend that eventually each person will be his or her own best reference: that comparing each of us to a third- party reference assembly, even a high-quality one, will be less effective for targeting treatments and understanding disease than sequencing every individual to reference quality.


The challenge with short-read sequence data lies in mapping and aligning data. Genomes of all sizes include highly repetitive regions, often hun- dreds or thousands of bases in length. Because short reads are typically less than 500 bases, they cannot span these challenging regions. These regions—which may have great importance for understanding disease or other phenotypes of interest—confound the process of assembling short reads, collapsing into themselves in the final assembly. Pseudogenes may not be distinguishable from the genes they mirror, and even short stretches of high-identity sequence can make it impossible for reads to be mapped accurately.


In contrast, long-read sequencing can produce reads that are tens of kilo- bases in length, fully spanning these difficult regions or including enough


AMERICAN LABORATORY 42


unique sequence information to facilitate accurate mapping. Scientists studying microbial genomes with long-read sequencing often produce fully closed assemblies on their first try, with the organism genome in a single piece and the accessory genome represented as well. In more com- plex genomes, researchers have used long-read sequencing to generate the highest-quality assemblies, many times filling gaps in existing refer- ence genomes. Long-read sequencing has produced the most contiguous assemblies ever generated, providing an essential resource for scientists.


To maximize the information obtained from long-read sequencing, scientists have paired it with automated DNA size selection. This step removes smaller fragments from libraries, allowing sequencers to work with the longest fragments and produce longer reads. By combining these technologies, researchers have demonstrated the ability to significantly increase the average DNA fragment lengths sequenced, leading to even higher-quality assemblies.


Building better assemblies Loomis and Eid et al.1


presented the first known sequence of the


gene responsible for fragile X syndrome, a repeat expansion disorder. Previously intractable with short-read sequencers, the gene and its hundreds of triplet repeats were fully sequenced with Single Molecule, Real-Time (SMRT) Sequencing from Pacific Biosciences (Menlo Park, Calif.). The authors noted that getting an accurate repeat count is critical for patient prognosis: having more than 200 repeats causes fragile X syndrome, while 55–200 copies are indicative of a related but different syndrome.


In another example, scientists from the Icahn School of Medicine at Mount Sinai (New York, N.Y.) determined that many more structural variants could be detected in the human genome from long-read data than from short- read data.2


These elements are important for human health, especially


in interrogating cancer. The findings suggest that estimates of structural variation based on short-read data alone may significantly underrepre- sent the actual variation across a genome.


In a large study of microbes, scientists at the U.S. Department of Agriculture (Washington, D.C.) and the National Biodefense Analysis and Countermeasures Center (Frederick, Md.) reported that long-read sequencing has made it possible to produce finished microbial genome assemblies in an automated pipeline.3


Based on a comprehensive MARCH 2016


Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32  |  Page 33  |  Page 34  |  Page 35  |  Page 36  |  Page 37  |  Page 38  |  Page 39  |  Page 40  |  Page 41  |  Page 42  |  Page 43  |  Page 44  |  Page 45  |  Page 46  |  Page 47  |  Page 48  |  Page 49  |  Page 50  |  Page 51  |  Page 52  |  Page 53  |  Page 54  |  Page 55  |  Page 56  |  Page 57  |  Page 58  |  Page 59  |  Page 60