This page contains a Flash digital edition of a book.
HPC APPLICATIONS: BIOMEDICINE





on a desktop computer, but you really don’t get the same sense of immersion and you don’t get the big picture of what’s going on,’ Borcherding continues. ‘3D visualisation allows the scientist to see spatial relationships, which is critical when talking about these types of datasets.’ The pixel density of the Cave allows the large datasets acquired to be displayed and analysed effectively.


Advances in genomics Biomedical research and personalised medicine are becoming more dependent on technologies that can quickly sequence large portions of a genome. As the resolution of next-generation sequencing instruments increases, so does the size of the data sets – an Illumina high-throughput DNA sequencer, for example, can produce around 1 to 2TB of raw data per run. The European Bioinformatics Institute (EBI), based at the Wellcome Trust Genome Campus in the UK, runs a 200-gigabyte database of flat files on protein and genomic sequencing, which is expected to double in file size every year simply due to the scale of sequencing data being produced.


The database is made available to the


researcher community, and EBI implemented a grid solution using Platform MultiCluster to control and coordinate jobs across multiple geographic locations. The solution allowed EBI to connect its own infrastructure with other international research organisations. In addition to increases in resolution of next-gen sequencing, the cost is coming down (currently around $10,000 to sequence a complete human genome). However, this needs to drop further before next-gen sequencing can become a more widely-used tool, with $1,000 considered the price at which the technique could have clinical applications.


Researchers at the University of Illinois Urbana-Champaign (UIUC) are using supercomputing to design nanopore gene sequencers that, if successful, would have the potential to sequence a human genome for less than $1,000. The sequencers use an electric field to drive a strand of DNA through a nanopore, either in silicon or a biological membrane. In theory, the sequencer would then be able to read each base pair in order by measuring the change in ionic current as the DNA moves through the membrane. But experimental design of


24


Storing genomics data


Storing the volumes of data associated with next-gen sequencing, which can reach terabytes per sequencer run, can be problematic and organisations will invest in expensive storage solutions specifically for this purpose. The Friedrich Miescher Institute (FMI) for Biomedical Research, based in Basel, Switzerland and part of Novartis Research Foundation, conducts fundamental biomedical research for various fields, including oncology, epigenetics and neurobiology. The institute runs next-gen sequencers and its strategy for storing the data generated is to only keep the post-processed data online, which is often one tenth of the original raw data size. The raw data sets are archived or deleted, as they will only be needed in the unlikely event of a problem with the original processing. ‘Some people will try and keep all the data online all the time,’ comments Dean Flanders, CIO at FMI. ‘This is a losing proposition, because the data sets will keep getting larger.’ FMI uses a SAM-FS file system from Sun


Microsystems, which archives frequently accessed data to Nexsan storage arrays. This is then automatically backed up into Spectra Logic’s T950 tape libraries. Once the post- processed data is analysed and known to be accurate, the original raw data is often deleted


‘3D visualisation allows the scientist to see spatial relationships, which is critical when talking about these types of datasets’


these sequencers is difficult, with researchers experiencing blockages in the pores or base pairs passing too quickly though, as well as noisy signals.


The researchers are running molecular dynamic simulations on the Ranger supercomputer at the Texas Advanced Computing Center (TACC) to produce atom-by-atom models of the experimental and untested nanopore designs. The models provide insights into how to optimise the design of the nanopore.


‘The difference between A, C, G and T nucleotides is just a few atoms, literally between four and eight,’ says Dr Aleksei Aksimentiev, a computational physicist


SCIENTIFIC COMPUTING WORLD OCTOBER/NOVEMBER 2010


– or, in certain circumstances, stored on tape. ‘In the majority of cases, it’s cheaper to re- run the experiments than to keep these raw data sets, because they’re so large,’ explains Flanders.


Flanders feels that using expensive disk-based petabyte storage systems to simply dump all the data in is ineffective, unnecessary and potentially dangerous. ‘You don’t need to spend millions of dollars to build data systems to handle sequence data,’ Flanders says. ‘The systems can be much less expensive if you decide what data to throw away and what data to store on tape. It’s such a waste of money to keep raw data on disk; it’s data that will probably never be looked at again, so why spend the installation and maintenance cost to keep this data online. ‘Also, the data volumes have become so large they can’t be backed up,’ he continues. ‘Therefore, there is the potential of losing not only the raw data but the data that is important, because it’s all intermixed.’ FMI’s system using Spectra Logic’s tape libraries are scalable to many petabytes. Flanders advises other institutes involved in gene sequencing and biomedical research to think carefully about their IT architecture and not just dump data into a large disk-based file systems.


leading the project. ‘So you have to have all-atom resolution, you have to get the physics right, and you have to simulate for a long time.’ The work requires a lot of computational power; last year the project required more than seven million hours on Ranger, TACC’s most powerful system. It was discovered that the key requirement for sequencing the DNA is to position the strand in the pore for long enough to get an accurate read. ‘If the DNA moves too fast, then one cannot read out the signal to distinguish the difference between the base pairs,’ Aksimentiev says. ‘We have to find a way to trap the DNA.’ By stretching DNA with an electrical field, the strands fit into a pore smaller than their unstretched diameter. Turning off the field traps the DNA in the hole. Then, by pulsing the field, stretching and relaxing the DNA, the strand moves base-by-base through the pore. The most recent simulations aim to improve the functioning of the double-stranded sequencer through variations in its pore geometry and the concentration of electrolytes.


www.scientific-computing.com


Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32  |  Page 33  |  Page 34  |  Page 35  |  Page 36  |  Page 37  |  Page 38  |  Page 39  |  Page 40  |  Page 41  |  Page 42  |  Page 43  |  Page 44  |  Page 45  |  Page 46  |  Page 47  |  Page 48  |  Page 49  |  Page 50  |  Page 51  |  Page 52  |  Page 53  |  Page 54  |  Page 55  |  Page 56
Produced with Yudu - www.yudu.com