SCW_FEBMAR13

textual analysis

kinds. While they have different customer profiles, a significant area of overlap is the life sciences where text mining is, as noted earlier, a major resource. Although Europe doesn’t yet have a consistent

(or, sometimes, any) electronic medical records (EMR) system, some of its jurisdictions have started down this route with some success. In the US, there is widespread adoption, if not universal satisfaction. Such systems are certain to arrive everywhere, sooner or later, and they bring with them wells of data of immense value to both healthcare and research communities. Toldo and Scheer assess[3]

hand-curated work going on in this area, right down to the level of individuals using desktop tools in innovative ways to shorten the loop between source and result. A popular approach is to use a statistics package and a bibliographic database manager in concert, sometimes with a home-coded utility or two to automate transactions between them. Whilst schlepping around from one lab

the use of Luxid (core

product of Temis) to access information within the free text sections of these records, not just to track adverse events, but for other strands such as ‘clinical trial optimisation and pharmacovigilence purposes’. Temis’ list of clients includes a dozen household names in the medical and life sciences area, from agriculture to industry, plus a string of others from the American Association for the Advancement of Science to aerospace technology leader, Tales. Linguamatics has an equally impressive spread

of application. Again the importance of textual data in medicine and the life sciences is reflected here, with the company’s key I2E product well established in genetics and molecular biology among other scientific growth fields. Particularly intriguing are a number of projects to extract and curate useful research data from conference Twitter feeds and other microblogging sources. ‘We’ve got more than a hundred possible leads to new ideas from I2E processing of hashtagged backchat at just one recent international jolly,’ said a researcher at a European pharma company, cheerfully. ‘If only one of those turns into an actual research proposal, it was still a high-profit exercise.’ Both providers offer online

options as well as client-side processing and links to other soſtware with, for example, Accelrys Pipeline connectivity for I2E and several powerful specialist expansion modules for Luxid. ‘Adopters of these commercial

tools can realise savings because of the scale of their operations, despite significant investment to purchase the tools,’ as Hirschman et al[4]

point out. But what of

small scale research which is ‘typically funded by grants with limited resources to invest in infrastructure?’ Tere’s a lot of

doorstep to another, gathering background for this article, I also sought reactions to developments in the latest releases of EndNote and OriginPro, both of which I happened to have on review. Tis serendipitously led me to discover the young materials science researcher who has built up a series of EndNote databases to which automated search feeds contribute raw material. A search and filter utility written in BASIC (remember BASIC?) extracts specific material using equivalent terms lists, summarising it in CSV files for exploration using OriginPro. I also met two post-grad researchers in a

Further

Accelrys accelrys.com

IBM www-01.ibm.com/ soſtware/analytics/ spss/products/ statistics

Linguamatics www.linguamatics. com

NotaBene www.notabene.com

OriginLab www.originlab.com

SAS www.sas.com

StatSoft www.statsoſt.co.uk

Thomson Reuters endnote.com

20 SCIENTIFIC COMPUTING WORLD information

Abbyy www.abbyy.com

university technology spinoff programme, creatively using the flexible combination of text acquisition and organisation tools (Archiva, Ibidem, Ibidem Plus, Orbis) available through NotaBene. Once again, equivalent term lists (provided within the NotaBene cluster) were used to aggregate material into summaries, which were exported to spreadsheets for further processing in data analytic applications. By running these processes on the fly, or in otherwise idle time, they extracted, processed and fed into their workflow surprising quantities of statistical data without onerous increases in overhead workload. All this is all very well, but for

text to be analysed it must first be readable – and that usually means readable by a computer. Tese days most text is electronically originated, but legacy material oſten is not and a lot of ad hoc notes may not be. Optical Character Recognition (OCR) is the workhorse of text transcription and, while we all grumble about its shortcomings, it does a good job of rendering graphic images of printed fonts into digitised text for analysis. Even at the lowliest manual

level, OCR is a useful tool. A colleague and I recently had to add a 650-page 18th century text to a digitised textbase for analysis. Te rare and valuable paper original which we located was in a library and could not be removed, and filing a request for the digitisation

to be carried out would take weeks. With the consent of the library we used a smartphone, a 10-year-old copy of Abbyy FineReader 5 (now in release 11, and correspondingly more developed, as part of a soſtware range for different text tasks) and a netbook. Even allowing for manual error correction, we had our validated data within three hours and the library added a copy to its own digital records. A similarly-sized text already available as graphic-only PDF was transferred more quickly still. Nevertheless, OCR does have its limitations;

one of them is in dealing with handwritten material. Where the material to be transcribed is written by known authors, training akin to that used for voice recognition can bring error rates down to vanishingly low levels; one medical research centre with which I work absorbs large volumes of handwritten notes and (oſten parallel)

OCR DOES HAVE ITS

LIMITATIONS; ONE OF THEM IS DEALING WITH HANDWRITTEN MATERIAL

vocal commentary into its textbase. Handwriting by authors not known to the transcription system, however, is a different kettle of fish. Submitting to the same medical system an A4 page containing 238 words in my own handwriting (which it had not seen before) produced not one correctly transcribed letter. Acquaintances in intelligence-related

occupations tell me of systems which are used to transcribe unknown handwriting with ‘useful’ levels of recognition. Tese, however, are based on training for key word recognition as a way of identifying matter for manual reading – an example given by a customs officer involved training a system using numerous handwritten examples of words like ‘bhang’, ‘coke’, ‘MDMA’, ‘methamphetamine’, ‘weed’, and so on. Where such words accumulate, increasing levels of priority for closer examination are applied. For such tasks, a class of approaches collectively known as word spotting is more productive than OCR. Te neat thing about word spotting is that it can

be used to search directly for text entities within graphic images, without any requirement to first convert those images into text. A researcher need only offer an image of the required word and the system will seek statistically-similar visual segments with a set of JPEG files.

References and Sources For a full list of references and sources, visit www.scientific-computing.com/features/ referencesfeb13.php

www.scientific-computing.com

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40