DATA ANALYSIS: INFORMATION MINING
Galway. As SAS’s David Smith points out (see ‘Keep taking the tablets’), tweets and blog entries can contain pointers to early identification of potentially vital phenomena. This is an aspect of what is known in the industry as pharmacovigilance, which ‘can be defined as a set of practices aiming at the detection, understanding and assessment of risks related to the use of drugs in a population, and the prevention of consequential adverse effects [or] in a narrower sense ... postmarket surveillance’.[1] A leader in making such text searching
accessible to smaller, nontraditional users and demonstrating the value of placing intelligent defaults in their hands, is the Data Miner Recipes (DMR) tool in Statsoft’s Statistica. It provides a clear cut, step-by-step path through a data mining project from initial cleaning and preparation of the data through to building and evaluating a model. It doesn’t quite pass my ‘10 year old test’ (in which I ask a child to try and use a software tool for a practical purpose), but that is mainly a cognitive difficulty in grasping the ideas involved. Over the past month, on the other hand, I have seen a dozen 14-year- olds embrace my copy with enthusiasm and derive meaningful results from experimental data in a few mouse clicks. Once the data file and variables are selected, the project can carry the user through to finished models without much interference – though any degree of sophisticated control is possible at every stage. There are other approaches, but
Science in the information age
Gerald Weissmann, editor-in-chief, FASEB Journal, says: ‘But the ’omic revolution has not just given us new facts: it has changed the way biologists think. Pioneers of ’omics and systems biology claim that they have either overturned traditional, hypothesis- driven research completely, or at the very least found an alternate way to do science. The novel techniques of microarray analysis, of “connectivity” or “molecular interaction” mapping, of kinetic simulation of cell processes have been made possible by information technologies that owe as much to Oracle and SAP as to Krebs and Chargaff. As discovery research becomes replaced by information-mining, data no longer lead to hypothesis, but make hypothesis unnecessary.’[12]
The data mining framework used by Han et al to automate annotation of gene expressions[8]
the DMR wizard removes most of the barriers to initial adoption and familiarity. Extracting connections from text bases
is, of course, neither the only way to skin a data set nor separate from other approaches. I’ve recently been watching a team of farmers without formal statistical training use DMR to interrogate a combination of numeric and text data. Seeking exploitable patterns in the behaviour and influence of natural pollination vectors, they join a wide range of observational and administrative records using time as a common key and then let DMR do the rest. Agriculture is a rich recipient of the
benefits accruing from information mining approaches. A quick dip into the literature on trait selection over the past year showed them to be behind six of the first seven results: adipocytes[2] production[4] poultry[5]
and growth[3] crop breeding[7] in meat stock, milk
, neuroendocrine correlation in , protein interactions in yeast[6]
. This is not surprising, since , and
information mining is one of the stable of new methods that accompany the explosive blossoming of genetics. As the editor-in- chief of the Federation of American Societies for Experimental Biology’s FASEB Journal observed a few months ago (see ‘Science in the information age’), information mining rides a wave of new approaches which, in aggregate, represent a radical shift away from traditional hypothesis-based science. Or, to borrow a Dutch colleague’s picturesque metaphor, ‘We no longer ask Sinterklaas and Zwarte Piet for a specific named present at the beginning of the research season and wait patiently to see whether we get it at the end; we go and burrow through their lucky dip bran tub to see what they’ve got.’ This is one of those areas where traditional hypothesis-based approaches may never uncover a linkage, because there is nothing to suggest the hypothesis in the first place. Or, alternatively, one function of information mining can be seen as an enhancement of intuition, dramatically accelerating the rate at which hypotheses can be generated. While the benefits of information mining
are real and valuable at the small user end of the research spectrum, they grow exponentially with scale and are significant building blocks for science and for national or regional economies. At that scale, methods development has to be explored outside the user or provider framework. In Europe, the EPCC (Edinburgh Parallel Computing ➤
Wizard assisted text mining in Statistica
www.scientific-computing.com SCIENTIFIC COMPUTING WORLD AUGUST/SEPTEMBER 2010 9
Page 1 |
Page 2 |
Page 3 |
Page 4 |
Page 5 |
Page 6 |
Page 7 |
Page 8 |
Page 9 |
Page 10 |
Page 11 |
Page 12 |
Page 13 |
Page 14 |
Page 15 |
Page 16 |
Page 17 |
Page 18 |
Page 19 |
Page 20 |
Page 21 |
Page 22 |
Page 23 |
Page 24 |
Page 25 |
Page 26 |
Page 27 |
Page 28 |
Page 29 |
Page 30 |
Page 31 |
Page 32 |
Page 33 |
Page 34 |
Page 35 |
Page 36 |
Page 37 |
Page 38 |
Page 39 |
Page 40 |
Page 41 |
Page 42 |
Page 43 |
Page 44 |
Page 45 |
Page 46 |
Page 47 |
Page 48