This page contains a Flash digital edition of a book.
big data Only connect

Big data offers an unprecedented opportunity to draw greater levels of insight about the world around us than ever before. However, this data can be hard to manage. There are too many disparate data sets for many analytics to be really accurate and useful.

MPL gene expression scatter (left) and PCA (right) plots extracted by Qlucore from research paediatric acute myeloid leukemia data in a study at Cincinnati Children’s Hospital Medical Center, USA. Image courtesy of Dr James Mulloy.

(or otherwise) smaller and more focused data collection, as for instance in Ansolabehere and Hersh’s study[1]

of survey misreporting. As

technology gives us expanding data capture capabilities at ever-finer levels of resolution, all areas of scientific endeavour are becoming increasingly data intensive. Tat means (in principle, at least) knowing the nature of our studies in greater detail than statisticians of my generation could ever have dreamed. A couple of issues back, to look at the smaller end of the scale, I mentioned[2]

the example of an automated entomological field study regime


simultaneously sampling 2,000 variables at a resolution of several hundred cases per second. Tat’s not, by any stretch of the imagination, in LHC territory but it is big enough data to make significant call on a one terabyte portable hard drive. It’s also a goldmine opportunity for small team, or even individual, study of phenomena that not long ago would have been beyond the reach of even the largest government-funded programme: big data has revolutionised small science. Tere is, in any case, no going back; big

data is here to stay – and it will grow ever bigger, because it can. Like all progress, it’s a double-edged sword and the trick as always is to manage the obstacles in ways that deliver the prize. Most of the LHC’s raw output is not used or stored; only the crucial data points for a truly useful sample are analysed. From a scientific computing point of view,

the first problem to be so managed is described succinctly by Informatica’s Greg Hanson (see l @scwmagazine

box: Only Connect): integration of oſten wildly different data sets to allow treatment as a single analytic entity. Traditional relational database management systems (RDBMS) run into difficulties here. Where big data results from the aggregation of similarly structured systems (suggested exploitation of Britain’s National Health Service records as a clinical research base is an example to which I shall also return), old approaches might still work, but most cases don’t fit that model. RDBMS rely upon consistency of field and record structure that is, by definition, missing from multiple data sources compiled for different purposes by different research programmes in different places and times, scattered higgledy- piggledy across the reaches of the internet cloud. Indeed, there may not even be enough coherence between the data streams emerging for immediate examination from different captures within a single organisation. Matt Asay, at scalable data specialist

MongoDB (see box: Space science in real time), describes RDBMS as ‘one of the world’s most successful inventions’, having for four decades ‘played an integral part in a wide array

If we truly want to take advantage of big data, these enormous amounts of disparate data need to be brought together quickly and easily. Without this first step of effective data integration, analytics cannot take place, and insights cannot inform useful action. Only this way can data unleash their potential.

Greg Hanson, chief technology officer for Europe, the Middle East and Africa, Informatica

of industries, and any significant scientific discovery that required a data set’. Now, however, the flood of big data has burst the chreodic banks that made that approach viable. Te solution is emerging in the form of

NoSQL (Not only SQL) database management systems, particularly the document oriented approach that, to considerably over simplify, retrieves documents based on their content and key reference, using their internal structure (rather than one which is externally imposed) to assemble the data within them in a useful form. Once that data is integrated, by whatever means, there remain issues to be resolved. Tere is the human difficulty of mentally getting a handle on huge data volumes. Tere is, in some cases, the physical impossibility of storing that volume or processing it meaningfully in a finite time. And there are issues around the potential compromise of research quality by the siren call of supply quantity. Despite all our current analytic computing

A picture is worth a million data points

Data pours in from multiple automated locations. Yesterday’s information is replaced or supplemented. Complexity grows, new variables await analysis, and processes are reworked. Long-term trends are easy to overlook when data accumulate by the micro-second; minor nuances can be mistaken for significant events. Visualisation is critical in these large data situations. It highlights possible lines of enquiry, brings fuzzy mountains of data into clear focus, and displays multiple variables in a single simple image. Colours and symbols highlight data shifts.

Multiple images combined as video show changes in variables over time. Visualisation communicates big data to audiences: quickly updating a director, funding provider or non-technical team member. New collaborators are teased to produce their own ideas to explore. That simple, well designed graph or map puts everyone on the same page, regardless of whether they understand the underlying data.

Sabrina Pearson, Grapher product manager, Golden Software


Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32  |  Page 33  |  Page 34  |  Page 35  |  Page 36  |  Page 37  |  Page 38  |  Page 39  |  Page 40  |  Page 41  |  Page 42  |  Page 43  |  Page 44  |  Page 45  |  Page 46  |  Page 47  |  Page 48  |  Page 49  |  Page 50  |  Page 51  |  Page 52