Conference Review
by Robert L. Stevenson
Informatics for the Sciences at Molecular Med Tri-Con 2014: Show Me the Data!
ata management is the current choke point in many scientific endeavors. New workflows enabled by advances in hardware and software are making big data a reality for many labs and staff. To me, big data starts when the volume of data exceeds my capability in managing it in my cerebral memory. My Ph.D. dissertation involved syn- thesis and characterization of about 30 new chelating agents. This was manageable, but if it were 300 new compounds, I’d never make sense of it. Any hope of insightful analysis would exceed my limited cranial bandwidth.
D
Big data In September 2010, the number of chemicals in Chemical Abstracts was 55 million. This is up from 4 million when I was in graduate school in the mid-1960s. In our data-driven world, data are proliferating with a doubling time of less than two years. Performance of computer processors is improved by 60% per year. Sure, a great part of these macro measures of growth may be due to espionage and brute force at- tempts to improve public safety, but one only has to see how Amazon and Google are mining data for commercial advantage. Data min- ing combined with analytics is being used to record and then engineer consumer behavior. Consumer behavior is the most important driver of our economy.
So why big data now? The simple answer: Now people have the tools that are starting to en- able the process. Plus there is a need, since some data may be relevant for several decades. FDA inspectors ask, “Is the product you made today the same as that licensed in 19XX? Show me the data….” Drug regulation is a huge data-intense endeavor that continues for the product lifetime.
Spotlight on bioinformatics/
genomics Three tracks at the Molecular Med Tri-Con 2014 provided a forum to record successes and spotlight problems.
The particular
focus was bioinformatics, particularly genomics, but several noted that “data are just data.”
Domain-specific knowledge
provides the context and is thus the key differentiator. Ontologies are essential to understanding context.
Labs are data generators. A report from Cycle Computing (New York, NY) described a high- throughput screen of a 205,000-member library of candidate semiconductors for potential use as photovoltaics. Data analysis was complex, re- quiring integration of many data sources. Using the cloud, they assembled a network utilizing 156,000 microprocessor cores. Throughput was 1.21 petaflops, which compressed 264 years of computing to 18 hr—for a total computing cost of only $33,000.
Life science labs generate huge data files. But electronic medical records, including whole genome sequences for individual patients, will be even larger. Where to store the data? The first response is in the cloud. The cloud promises to reduce costs by sharing servers. Dr. Angel Pizarro of Amazon Web Services (Seattle, WA) described Amazon’s commercial cloud service. On an intuitive level, this seems attractive since server farms are expensive to build and oper- ate. There are problems, and some were solved in the last few months.
Uploads to the cloud Genomics data repositories around the world
provide redundancy and local access for re- search. And, as data are distilled to knowledge, these farms will be essential in serving the needs of regional populations.
AMERICAN LABORATORY • 34 • JUNE/JULY 2014
IBM (Armonk, NY) is one firm that seems well positioned to lead to personalized diagnos- tics and therapy. For example, the principal sequencing operations such as Broad are com- mitted to daily updating the files of other similar organizations. Open communication is essential for intercenter cooperation. However, the files are too large for uploading using file transfer protocol (FTP). The accepted work- around has been the “FedEx Transfer,” where today’s data are loaded onto several TB hard drives and shipped by FedEx to the receiving sites for direct connection to update their server farm. Since generally the receiving site is also actively generating sequences, they transfer their results to the hard drives and send them back to the original source.
Aspera (acquired by IBM in January 2014) developed a capability to transfer very large files by bypassing FTP and look for unused bandwidth between send and receive nodes. Patented software tries to find unused band- width and fill it with portions of the large files. Over time (usually seconds) the large file is on its way.
On the exhibition floor, IBM and Accelrys (San Diego, CA) exhibited results of their cooperation for applying modern analytics to life science/ medical applications. Accelrys has domain- specific knowledge in the life science space that is complementary to IBM’s expertise. One example is a project at SUNY Buffalo (Buffalo, NY) to elucidate the genetics of multiple sclerosis (MS). About a million people globally suffer from a loss of cognitive ability due to inflammation and degeneration of the brain and spinal cord. The project started in 2008 with collection and scanning of genomes from MS patients to see if genomics could predict fac- tors in asymptotic patients contributing to the risk of developing MS. The simplistic look for a
Page 1 |
Page 2 |
Page 3 |
Page 4 |
Page 5 |
Page 6 |
Page 7 |
Page 8 |
Page 9 |
Page 10 |
Page 11 |
Page 12 |
Page 13 |
Page 14 |
Page 15 |
Page 16 |
Page 17 |
Page 18 |
Page 19 |
Page 20 |
Page 21 |
Page 22 |
Page 23 |
Page 24 |
Page 25 |
Page 26 |
Page 27 |
Page 28 |
Page 29 |
Page 30 |
Page 31 |
Page 32 |
Page 33 |
Page 34 |
Page 35 |
Page 36 |
Page 37 |
Page 38 |
Page 39 |
Page 40 |
Page 41 |
Page 42 |
Page 43 |
Page 44 |
Page 45 |
Page 46 |
Page 47 |
Page 48 |
Page 49 |
Page 50 |
Page 51 |
Page 52 |
Page 53 |
Page 54 |
Page 55 |
Page 56 |
Page 57 |
Page 58 |
Page 59 |
Page 60 |
Page 61 |
Page 62 |
Page 63 |
Page 64