SCW_JUNJUL12

statistical tools in genetics

Struggling to keep pace

Given the advances in genomics sequencing technology, genetic analysis and management of that data has been struggling to keep pace. Recently there has been a plethora of statistical tools for genetic analysis coming to market and a growing understanding that good data management is crucial. Open-source tools, together with some

proprietary algorithms from commercial vendors, are the basis for how genetics data is generated and analysed using Next Generation Sequencing (NGS). Notable mentions include the SOAP set of tools from BGI, GATK from Broad Institute and CASAVA from Illumina. These methods are configured to achieve the best results, which can cause variation when trying to analyse data consistently across studies.

Profiling cellular response to GSK1120212 (from Gilmartin et al)[13]

a single ‘on chip’ memory space of 16 terabytes, on in excess of 2,500 computing cores, to accommodate the need for completing ever- larger analyses within viable timescales. More and more analyses are being

approached in the broad bandwidth, high volume manner that this collateral ballooning of data sets and scientific computing technologies make possible. Hastie et al describe[1]

how,

in their investigation of respiratory syncytial virus, ‘tandem mass spectra... searches were submitted... using Proteome Discoverer.’ Not every genetic study is drowning in

immensity, however, and genome mapping is only one end of an investigation spectrum which also encompasses the operation of individual genes. Much valuable data collection and analysis, though it inevitably becomes part of the larger ocean, is done in much smaller tributary contexts and is analysed in established standard desktop soſtware. Tis is especially so when it is part of experimental work on closely focused topics, oſten linking genetic and environmental or other effects. Dipping into a pile of recent papers, for instance, I find a range of data analyses underpinning genetic links to macro concerns from agriculture to oncology conducted in generic soſtware statistics tools. Statistical analyses in a recent investigation[2]

of the periplasmic chaperone role played by HdeA and HdeB genes in acid tolerance of shiga toxin producing E coli, to take one example,

www.scientific-computing.com

were conducted using Systat’s widely popular generic Sigmaplot soſtware. Deletion of these genes was found to reduce acid survival rates by two or three orders of magnitude in various haemorrhagic strains, but not in strain O157:H7 serotype where, by contrast, loss of hdeB had no effect and hdeA produced only half an order of magnitude effect. A point mutation which altered the subsequent sequence seemed to be the key to this divergent evolution. VSNi’s GenStat has specific provision

(as an extension of generic mixed models) for quantitative trait loci analysis and other genetics-centred analyses, as well as a strong life sciences history in general and agriculture in particular. It’s not, therefore, surprising to find it well represented in areas as diverse as gene expression under varying phosphorus levels[3]

and germination timing[4]

polymorphism and intramuscular fat[5] heterosis[6] cell lines[7] resistance[8]

This is why most of the NGS technology vendors now offer streamlined ways of analysing primary data with freely available tools, onsite or in cloud environments. For example, Ion Reporter from Life Technologies and BaseSpace from Illumina offer cloud access to their users as part of their services. Sequence service providers like BGI and Complete Genomics also support this model. These offerings aim to control the data, metadata and information, allowing a more standardised approach. However one still needs to be able to interpret the mass of data. Genetic data analysis presents problems, not least around the size of the data that can be generated and good quality metadata management is vital for all studies. Often the data is collected, but sometimes the design of experiment and information about the sequenced individuals is lacking or incomplete and this can cause problems.

Getting to the value in the data by being able

in brassicæ, in pigs,

in maize, blood and lymphoblastoid in humans, or rain damage

in Australian strawberries. ASReml

(a separate product also from VSNi) is also represented, oſten in the same studies although my personal favourite was one[9]

that attempts to

disentangle heritable from learned antipredator behaviour components. An important focus, particularly in

agricultural genetics, is on mapping of marker loci in the DNA sequence onto trait variations in the phenotype. Association mapping (AM),

to perform tertiary analysis and meta-analysis, allowing for interpretation is the key. Also essential is providing an overview of cohort identification for patients with a given sets of variants so that scientists can then look at comparison groups of patients to determine statistical significance, as in a case control study. Part of the solution to these problems is by

providing a good foundation in data and results management: knowing where data comes from, what a statistician did and which parameters were used as variant calling. These can all vary depending on the type of tools used and the way in which they are used. It is vital that these workflows and results are properly managed.

Content by Robin Munro, Translational Medicine Solutions director with IDBS

JUNE/JULY 2012 13

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48 | Page 49 | Page 50 | Page 51 | Page 52