SCW_APRMAY11

data intensive computing

Big data, big dreams

We’re starting to drown in data, yet we want immediate

answers to our queries. Paul Schreier examines how researchers and computer suppliers are addressing these seemingly contradictory issues

W

e’ve made tremendous strides in collecting data. Some of today’s advanced experiments can collect Tbytes of data per day.

Now the job is making sense of all this data, and effectively looking for the proverbial needle in a haystack. This task is sometimes referred to as ‘data-intensive computing’ or, in scientific jargon, simply ‘big data’. Addressing the associated demands requires epochal advances in software, hardware and algorithms. Meanwhile, hardware and software suppliers have recognised this problem and market opportunity, and started developing products directed towards this area. This trend is leading to what a number

of researchers at Microsoft now refer to as a fourth paradigm for science. They do so in a free book called The Fourth Paradigm: Data-Intensive Scientific Discovery1

, which

presents an excellent overview of the issues surrounding big data and also looks at its impact on fields such as the earth and the environment, health and well- being, the scientific infrastructure and scholarly communications.

The intersection of IT and science In that book, the late Jim Gray speaks about ‘eScience’ as being the intersection of IT and science. He notes that we started with empirical science thousands of years ago, and then moved into theoretical science with Kepler, Newton, Maxwell and others. In the last few decades a computational branch has arisen by simulating complex phenomena. Today, he wrote, we’re embarking on explorations where data is captured by instruments or generated by a simulator, it’s processed by software, the knowledge is stored in computers and scientists analyse databases and files using data management and statistics. People no longer actually look through instruments such as telescopes; instead they ‘look’ through large-scale complex ‘software instruments’ that relay data to data centers, and only then do they look at the information on their computers. Issues of dealing with large volumes of

unstructured, multidimensional data took on a face that was visible to the public at large recently when IBM’s Watson computer beat two human champions on the Jeopardy

THE FACT THAT WE HAVE INHERENTLY DISTRIBUTED DATA IN A WIDE RANGE OF FORMATS MAKES LIFE

PARTICULARLY DIFFICULT Ian Gorton,

Laboratory fellow and chief architect for the Pacific Northwest National Laboratory’s Data Intensive Computing Initiative

TV quiz show. Watson had access to 200 million pages of structured and unstructured content, including the full text of Wikipedia, consuming four Tbytes of disk storage. But when the question came up, ‘What is “Stylish elegance, or students who all graduated in the same year”?’, Watson guessed “chic” whereas one human contestant correctly responded ‘class’. Just having massive amounts of data doesn’t always help us find answers and sometimes makes it far more difficult. Big data clearly brings big challenges. We’re

coming upon a discover paradigm of deep analysis where you’re not necessarily intuitively aware of what you’re looking for, adds Shoaib Mufti, who heads up Cray’s XMT project. A clearly understandable example comes from the world of commerce and the internet. When you call up a search engine, you expect it to search all the immense data on the web and produce reasonably refined results – and we expect this in a few moments. It’s hard to image the computing power and sophisticated tools needed to do so. Or consider credit card companies trying to reduce fraud. When you swipe your card, a computer system might look at a number of data files to look for your purchasing habits (did you recently purchase an airline ticket so that today’s purchase from a trip abroad is realistic?) – and again, both you and the merchant want this to happen in near real time.

Distributed data ‘The fact that we have inherently distributed data in a wide range of formats makes life particularly difficult,’ comments Ian Gorton, laboratory fellow and chief architect for the Pacific Northwest National Laboratory’s Data Intensive Computing Initiative. ‘Here we need tools and algorithms that can be put together relatively quickly.’ One such open source tool that PNNL has

developed is MeDICi (Middleware for Data Intensive Computing), an evolving platform for building complex high-performance analytical applications. These apps typically consist of a pipeline of software components, each of which performs some analysis on the incoming data and passes on the results to the next step. Previously, programmers built pipelines with scripts, perl or graphical tools, but when parallelisation came into the picture, pipelines became overly complex. MeDICi’s goal is to make the pipeline approach more rigorous using prebuilt components.

Addressing unstructured problems In big data, a particular issue is how to search for small pieces of information within an

www.scientific-computing.com APRIL/MAY 2011 25

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48