Big data opens new avenues for genomics research


What was the earliest example of digital data? The answer is surprisingly clear and probably

earlier than you think. Some 30,000 years ago, in the Palaeolithic era, someone put 57 scratches on a wolf bone. These are arranged in groups of five, making the first known example of something that is still used from time to time: a tally stick. It is also an example of digital data. For almost all human history, however, such data was scarce, relatively few people handled it and it was easy to manage. Only half a century ago the Moon landings were controlled by banks of computers that together held less data than one of today’s iPhones. More than 200 million of these were sold in 2017 alone. So we are now in an age of big data, but what makes it big? The term was coined in 2006, but five years earlier an analyst called Doug Laney had described the growing complexity of data through ‘three Vs’: volume, velocity and variety. That is, there is a lot of data, there are a lot of different kinds and it is growing very fast. Two more Vs have been added since: veracity (i.e. data can somehow be verified) and value (it can and should be useful). The 57 scratches on that Palaeolithic

wolf bone form one byte of data, as does any integer up to 255. If you are interested

18 Scientific Computing World October/November 2018

enough to be reading this, you probably own at least one device with a hard drive with a capacity of at least 1TB. You could fit one trillion virtual wolf bones, or, only just less implausibly, almost 20,000 copies of the complete works of Shakespeare onto such a hard drive. Scientific data is a few orders of magnitude further on. Within the sciences, particle physics is at the top of the data league; the data centre at CERN, home of the Large Hadron Collider, processes about 1PB (1,000TB, or 20 million Shakespeares) of data every day. Biomedicine may be some way behind, but it is fast catching up, driven first and foremost by genomics.

Sequencing the human genome The first human genome sequence was completed in 2003, after 15 years’ research and an investment of about $3bn. The same task today takes less than a day and costs less than a thousand dollars. There is probably no tally of the number of human genomes now known, but Genomics England’s project to sequence 100,000 genomes from people with cancer and rare disease and their close

“Data analysis on the fly is commonly used in some disciplines, including particle physics and crystallography, but it is only recently becoming common in bioinformatics, which is a younger science”

relatives within a few years gives an idea of what is now possible. The raw sequence data for a single genome will occupy about 30GB (30 x 109 bytes) of storage, and the processed data about 1GB, so you could theoretically fit a thousand on your home hard drive (if with little space for anything else).

About a third of that first human genome

was sequenced at the Wellcome Trust Sanger Institute, south of Cambridge. This is now one of the largest repositories of gene sequences and associated data in the world, and possibly the largest in Europe. Data in the form of DNA sequences – As, Cs, Gs, and Ts – pours off its sequencers at an unprecedented rate, initially destined for the Sanger Institute’s private cloud and massive data centre. This centre was extended from three

‘quadrants’ to four in the summer of 2018, giving the Institute a massive 50PB in storage space. ‘We now generate raw data at the rate of about 6PB per year, but even keeping scratch space free for processing, we should have enough capacity for the next few years’, says the Institute’s director of ICT, Paul Woobey. The data is managed using an open source data management system, iRODS, which is becoming a favourite of research funding bodies in the UK and elsewhere. ‘One benefit of iRODS is that it is highly queryable’, adds Woobey. ‘This makes it easy to locate, for instance, all the data produced by a particular sequencer on a particular day.’ By itself, DNA sequence data means

very little; like almost any form of data, it only becomes meaningful once it is analysed. Much of this analysis is done in-house, but many researchers worldwide need access to the Sanger data.

@scwmagazine |

Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32  |  Page 33  |  Page 34  |  Page 35  |  Page 36