SCW_APRMAY13

statistical science: data retrieval Some areas of scientific work generate

more product, and correspondingly bigger headaches, than others: the life sciences, particularly the mapping of genomes, are particularly fecund for example. A scan of patents shows a marked rise in the number of applications relating to data retrieval methods since the blossoming of the Human Genome Project (HGP). Clinical research programmes increasingly accumulate and share data, with a concomitant need to manage its use efficiently. Few fields are slow to feed the flood, however. Whether probing the picoseconds aſter the Big Bang or seeking to map every fragment of rock in our solar system, the same pattern of explosive data growth is present everywhere. Te key word so far, and not one

universally common among statisticians, is ‘database’. Data analysts (especially of my generation, but things are really only just beginning to change) tend to think in terms of lists, or tables, or worksheets of data, rather than databases. In local terms, we are perhaps right to do so; most statistical tests are applied to quite small subsets of the data available. When a subset contains a few hundred cases from two variables, it looks like (and is) a pair of lists, just as much as if it were only the dozen or so cases in a textbook. But these days the subset will probably have been retrieved from a very much larger database – and the criteria for that retrieval will range far and wide through other variables, and intercase comparators, not visible in the final subset itself. In an ongoing investigation which I’ve

just pulled up, for instance, the comparative concentrations in a scent of two specific chemicals (call them C1 and C2) have been retrieved at 73 different time points for hypothesis testing. Tose paired lists,

The ability to retrieve sequence information for genes of interest is a powerful feature of the BioMart tool. Here a user can download the coding sequence for all genes on chromosome 22 as well as additional information about each gene and this can be exported in a useful format

however, have been retrieved from an operational historian database currently containing more than 300 million cases (and growing) of just over 2,000 variables. Tey do not represent a randomised selection, nor a systematised extraction; they are the result of painstakingly developed queries based on the hypotheses to be tested. Tey represent the concentrations of C1 and C2 at, and only at, all moments when a dozen other variables meet specific criteria: other concentrations, temperature range, light level, atmospheric humidity, wind direction and the occurrence of a very specific and relatively infrequent phenomenon in insect flight. Depending on what this and previous tests show, the query will be adjusted to extract different cases from the same two variables; and so on. Te queries also allow for switching between inclusion or exclusion, as appropriate, from any time frame, of the new entries being

Retrieving the recent past

I was asked to revisit and reanalyse data from a 20-year-old longitudinal study in the light of new knowledge. Large and detailed, the data set was backed up onto VHS video cassettes. Magnetic tape deteriorates with time, and the

stored signal also tends to ‘print through’ from layer to layer. Finding a VHS cassette player that could be connected to a PC was a challenge. But a university IT department was able to solve these problems, copying the content onto a backed-up network.

After investigating several defunct VHS backup systems, I was rescued by a helpful hobbyist in Albania who had a copy of the necessary restore program and a 286 PC on which it would run. We

www.scientific-computing.com l

were now able to decode the backups to yield sets of files created and saved by the spreadsheet Wingz. Wingz was a spreadsheet program from Informix,

ground-breaking in its day, far ahead of its time... but that day and time came to an end in 1994. GenStat, from VSNi, has a well-deserved reputation for being able to import a wide range of file formats, so I tried it. No dice: even GenStat was stumped. I sent a hopeful email off to VSNi (suppliers of GenStat), asking if they had any suggestions, and their ‘expert in NZ’ offered to develop a complete direct file format import solution in a week or two, but also suggested an immediate workaround. The immediate workaround involved a ‘player’ application for Wingz files used for educational

@scwmagazine

chemistry models by Professor Tom O’Haver at the University of Maryland. I could load the epidemiology files into this and then save them as WK2 (middle period Lotus 123) format. WK2 files are readable by a number of current worksheet-oriented products, so could then be saved yet again in any form I wished. (An alternative would have been to copy and paste via the Windows clipboard, but file saves were more elegant and preserved fuller numerical precision.) After rigorously checking that the results were

preserving the integrity of the original data, it then became a fairly simple process to convert the whole archive. But the fact that this degree of ingenuity should be required to read information recorded only two decades ago does make you think.

APRIL/MAY 2013 11

continually added to the database In essence, it’s an idea that Kepler (who arrived at his elliptic orbit solution by progressing from analysis of retrieved triplets on a mistaken circular assumption) would have understood perfectly 400 years ago; in detailed execution, it can only happen in very recent computerised methods. While that example was an in situ

entomological observational study, historian databases are mostly widely used in process analysis, fed by sensors and other data generators integrated into the machinery of industrial contexts. Since they are time- series based, they are (in principle, at least) very easy to link with other databases – not necessarily from the same or even similar context – to produce burgeoning data complexity. It would be perfectly feasible, for example, to relate the database from my insect habitat to that from a nearby

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36