SCW_APRMAY13

statistical science: data retrieval

FlexPRO, whose central focus is time series and refers to variables as signals, emphasises an uncompromisingly database view of its numerical content over the more usual worksheet approach. Again, a product specific automation model allows programmed retrieval of specified data from this store, with the database approach providing a rigorous environment for sample design planning. Te provision of automation methods

A database subset extracted by progressively designed SQL queries into a GenStat spreadsheet for analysis

factory, an adjacent motorway, an industrial dairy, a weather station – and that sort of combinatorial multiple database approach is becoming increasingly common. Te retrieval tools do not necessarily have

to be in the same soſtware as the analytic, of course. Flexibility oſten argues for their separation, in fact. Efficiency, however, favours an integrated system, and this is increasingly reflected by soſtware suppliers. Statsoſt’s Statistica analysis product, for

instance, has long ago evolved into the core of much bigger aggregate solutions aimed at specific purposes. Tere was built in database management from an early stage. Data mining has been a priority for a long time, leading to development of various product clusters including process control and investigation. Automation is handled by SVB, Statistica’s specific implementation of VBA (Visual Basic for Applications). For the Enterprise Edition there is a specific add-in for retrieving analytic data from OSIsoſt’s PI data historian product (which, by comparison with my insect study’s collection rate of well under a million cases a day, can cope with a capture resolution of half a million events per second) and also extends retrieval to other VBA methods. An interface provides for the defining the data repository and the method by which data are to be retrieved, collections of queries specifying data to be retrieved and analysed, and metadata specifying appropriate treatment of the retrieved data for the analysis in hand. In a different direction,

Weisang’s analytic soſtware, 12 SCIENTIFIC COMPUTING WORLD

sometimes obscures the fact that most heavyweight data analysis programs are, behind their graphical user interfaces, actually programming languages in their own right with automated data retrieval potential of their own. Tere are also long-term developments, such as Microsoſt’s ODBC (open database connectivity) standards, which facilitate access to generic data stores by analytic tools. VSNi’s GenStat is a good example of this,

Further

information Adept Science

www.adeptscience. com

BioMart Project www.biomart.org

Ensembl Project ensembl.org

OSIsoft www.osisoſt.com

Statsoft www.statsoſt.co.uk erpconnect.umd. edu/~toh/models/ index.html

VSN International www.vsni.co.uk

Weisang GmbH www.weisang.com

its present interactive graphical face being a relatively recent development on top of a data analysis specific high-level language with a long scientific pedigree. Logical structures and expressions, loops and conditional branching, free (as well as fixed) field input, ability to incorporate user programmes into the main program resources as transparent extensions alongside native directives and procedures, all provide far more scope for automated responsively adaptive approaches to data retrieval than most users ever dream of. It also has an unusually flexible and extensible file import facility which permits users to design their own format templates or (see box: ‘Retrieving the recent past’) draw on the experience of a user community that may have already trodden the same or similar paths. Genetics is, as I noted earlier,

one of the drivers behind the flood of data which has made retrieval such a high priority area. From it, and particularly from the rise of genome sequencing and the HGP, have developed two key practical coping concepts, which point the way to more general solutions: federated database systems (FDBS) and genome browsers, of which BioMart and Ensembl are good representative examples. BioMart, like other FDBS,

is a project designed to provide single entry point access via portals to multiple and disparate databases.

Geographical distribution is irrelevant: of the 45 databases currently federated at the time of writing, 30 are in Europe, 11 in the Americas and four in Asia. Open source in structure, it is designed to ‘promote collaboration and code reuse’, provide ‘unified access to disparate, geographically distributed data sources’ and be ‘data agnostic and platform independent, such that existing databases can easily be incorporated’[1]

. In this it offers

hope of backward and forward compatibility for some of the storage format obsolescence issues mentioned above, as well as addressing contemporary incompatibility problems. Ensembl is one of several genomic

browsers designed to bring bioinformatic data retrieval and relational database principles under a single interface umbrella, providing researchers with a unified retrieval view. Automated annotation of sequence data

EFFICIENCY FAVOURS AN INTEGRATED SYSTEM, AND THIS IS REFLECTED BY SOFTWARE SUPPLIERS

produces a MySQL database, which Ensembl then makes freely available to researchers. Several levels are available, from a web-based GUI to large dataset retrieval through the BioMart data mining tool or tightly defined direct SQL queries. Developed in early response to the HGP, it now includes other key model organisms (such as fruitfly, mouse and zebra fish) and an expanded range of genomic data. It focuses on vertebrates, but a sister project, Ensembl Genomes, has extended the scope to bacteria, fungi, invertebrate metazoa, plants and protists. Tere are similar tools and approaches

being developed in other areas of science, though some are less open and distributed than others. Mapping extreme orbit objects in the

solar system to make them predictable is a particular case, crying out for a shared database in which all positional ‘stranger’ observations can be logged for subsequent query based retrieval and analysis as the size grows and patterns begin to be discerned. Which brings us neatly full circle to Brahe and Kepler.

References and Sources 1. BioMart project. www.biomart.org. [cited 2013 2013-03-01] 2. Kinsella, R.J., et al., Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database (Oxford), 2011. 2011: p. bar030.

@scwmagazine l www.scientific-computing.com

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36