SCW_AUGSEPT10

DATA ANALYSIS: INFORMATION MINING

improve DMI [data management integration]’. Alongside conceptual development and implementation, it has applications on which its methods can be demonstrated in practice as well as a growing number of studies that call on its capabilities. Sticking, for the moment, with my genetics

Crossing main authors of the ‘Hipparcos’ collaboration with topical key words. From Egret et al, Information mining in astronomical literature with TETRALOGIE [11]

➤

Centre) manages the ADMIRE (Advanced Data Mining and Integration Research for Europe) project, which ‘aims to deliver a consistent and easy-to-use technology for extracting ... meaningful information by data mining combinations of data from multiple heterogeneous and distributed resources... which will give users and developers the power to cope with complexity and heterogeneity of services, data and processes’. ADMIRE emphasises the need to accommodate the increasing size of information stores, sophistication of extraction requirements, and complexity of the resulting systems, responding with a unifi ed conceptual structure based on integration of separate expertise types in relation to a defi ned component library structure. The outputs are ‘a framework, an architecture and a set of use cases that illustrate how they can be used to

thread, ADMIRE-supported projects include automated gene annotation (see ‘Labelling the building blocks’) through ‘a new extensible data mining framework that integrates both the images in the fi le systems and annotation databases and combines image processing with statistical pattern recognition techniques to automatically identify gene expressions in images’[8]

. A different example, picked from

the list of accepted papers on the ADMIRE website, just because it appeals to me, is ‘An ant-colony-optimisation based approach for determination of parameter signifi cance of scientifi c workfl ows’[9]

. Ant colonies segue me neatly away from

life sciences to social network analysis (a topic upon which I focused in the last issue), where once again massive data sets contain numerous yet to be discovered relations. Those relations may be anything from individual two-node edges to whole complex networks, or a blend of the two in the form of association between one or more unsuspected individual nodes and a network or networks or intermediate structures, such as cliques. This is one of those areas, mentioned

above, where hypotheses emerge where they otherwise might not, because information mining throws up associational links and

Labelling the building blocks Liangxiu Han and others[8] (conclusions of an

ADMIRE supported study) state: ‘...we have developed a new data mining framework to facilitate the automatic annotation of gene expression patterns of mouse embryos. There are several important features of our framework: (1) the combination of statistical pattern recognition with image processing techniques can help to reduce the cost for processing large amounts of data and improve the effi ciency. We have adopted the image processing method to standardise and denoise images. Wavelet transform and Fisher Ratio techniques have been chosen for feature generation and feature extraction. The classifi ers are constructed using LDA. (2) For enhancing the extensibility

of our framework, we formulate our multi- class problem into a two-class problem and design our classifi ers with a binary status: ‘yes’ or ‘no’. One classifi er only identifi es one anatomical component. Classifi ers for each gene expression are independent on each other. If new anatomical component need be annotated, we do not have to train previous classifi ers again. The classifi ers can be assembled and deployed into the system based on user requirements. (3) We have evaluated our proposed framework by using images with multi-gene expression patterns and the preliminary result shows our framework works well for the automatic annotation of gene expression patterns of mouse embryos.’

matrices of links not obvious to unaided human intellectual perception. The results can be dramatic. A single node

that is discovered to have links with two (or more) discrete networks, for instance, becomes a bridge and instantly, by defi nition, converts those networks into (potentially highly complex) cliques of a single larger network. Discovery of the subtle clues that a particular node acts as a subliminal bellwether can shift weightings in the whole network of which that node is a part. Professor David Bader, from the Georgia

Institute of Technology (see ‘Keeping a grip on the bigger picture’) and working with the Pacifi c North West Laboratory (PNNL) under a Center for Adaptive Supercomputing Software for Multithreaded Architectures (CASS-MT) sponsorship umbrella, is one of many who are developing HPC software for supercomputer handling of this kind of problem. Across a number of publications he documents his group’s development tools for analysing massive streaming data sets on Cray machines, currently the XMT which ‘has the unique ability to process massive volumes of irregular, data-intensive applications, and is ideal for graphs-data that connects to other data in varying ways’[10]

. References and sources

For a full list of the sources and references cited in this article, please visit www.scientifi c-computing.com/features/ referencesaug10.php

10 SCIENTIFIC COMPUTING WORLD AUGUST/SEPTEMBER 2010 www.scientifi c-computing.com

Keeping a grip on the bigger picture

‘Algorithms that work on complex networks with hundreds to thousands of vertices often disintegrate on real networks with millions (or more) of vertices. For example, betweenness centrality is not robust to noisy data (biased sampling of the actual network, missing friendship edges, etc.) [They require] niche computing systems that can offer irregular and random access to large global address spaces. ... the newest breed of supercomputers have hardware set up not just for speed, but also to better tackle large networks of seemingly random data. ... Applications include anywhere complex webs of information can be found: from internet security and power grid stability to complex biological networks.” David A Bader, Georgia Institute of technology[13]

.

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48