This page contains a Flash digital edition of a book.
Navigating the sea of genes

Few areas of study have seen their use of data expand as much as in the field of genetics, explains

Felix Grant O

ver my working life, statistical data analysis has explosively expanded in significance across every area of scientific endeavour. In the last

couple of decades, computerised methods have ceased to be an aid and become, instead, simply the way in which statistical data analysis is done. Partly as a result, and partly as a driver, data set sizes in many fields have grown steadily. As with every successful application of technology, a ratchet effect has kicked in and data volumes in many fields have reached the point where manual methods are unthinkable. Some areas of study have seen their data

mushroom more than others. Of those, few can match the expansion found in genetics, a field which has itself burgeoned alongside scientific computing over a similar time period. Te drive to map whole genomes, in particular, generates data by the metaphorical ship load; an IT manager at one university quipped that ‘it’s called the selfish gene for a reason and not the one Dawkins gave: it selfishly consumes as much computational capacity as I can allocate and then wants more.’ What is called Next Generation Sequencing

(NGS) is speeding up to a dizzying degree what was not so long ago the heroically laborious business of transcribing a genome. Sequencing was a matter for representative model genotypes; now the time is fast-approaching when it will, in principle at least, be available for the study of any and every individual organism. Tat fact is a fundamental game changer not only for genomic and genetic studies, but for the data analyses which inform them. Both statistical tools and the computing platforms that support them must accept that their future is one of continuing exponential growth in the quantities of data they handle. Te tiny sample


Comparisons of proteomes of human alveolar epithelial cells infected with nonstructural protein NS1-GFP hRSV and inserted gene WT-GFP hRSV by two-dimensional differential gel electrophoresis. The numbered red spots indicate proteins whose abundance appeared to increase in response to WT-GFP hRSV compared with NS1-GFP hRSV, and the converse is true for numbered green spots (From Hastie et al[1]


sizes that made up the bedrock of my training and professional life are still being used to educate the next generation of statisticians, but bear no relation to present analytic reality. Nor is sheer size the only issue: in any such

period of rapid development, heterogeneity is both a blessing and a curse. Different analytic tools and approaches are generated by those involved in specific tasks, then taken up for development, modification and adaption by those related by slightly different needs. Studies are likewise designed to suit the requirements of particular enquiries. Te combination of new or altered procedures and variant study designs produces a huge and ever growing ocean of information content, enriching the primordial soup from which the most productive methods will evolve, cross fertilise and stabilise, but it also produces a messy landscape in which direct comparison of different results is oſten difficult. Somewhere in that soup lie the answers to an unimaginable spectrum of questions asked or

as yet unframed, but the data analyst must first fish out relevant components and then figure out how to make them work together. As IDBS’ Robin Munro points out (see box: Struggling to keep pace), ‘good quality metadata management is vital.’ Scale and diversity have required the

development of new analytic, meta-analytic and platform technologies of various kinds. IDBS provides, in Munro’s words, ways ‘to ensure well managed data and results as well as orchestration of industry standard genetic and genomic tools.’ Termo Fisher Scientific’s Proteome Discoverer offers the opportunity to automate large parts of the proteomics (that part of genetics which studies the complete protein set encoded by a genome) informational management and analysis loop – not necessarily doing analyses, but choreographing them in time with, among other things, data search, acquisition and management. Companies like SGI provide hardware that can run analyses in

Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32  |  Page 33  |  Page 34  |  Page 35  |  Page 36  |  Page 37  |  Page 38  |  Page 39  |  Page 40  |  Page 41  |  Page 42  |  Page 43  |  Page 44  |  Page 45  |  Page 46  |  Page 47  |  Page 48  |  Page 49  |  Page 50  |  Page 51  |  Page 52