SCW_FEBMAR13

textual analysis

The joy of text I

Teasing out valuable data from a sea of text is no mean

feat, as Felix Grant discovers

t’s a mundane truism, not normally worth mentioning, that words and phrases as signification units in natural language have only the fuzziest of relations to that which

they signify. It is, however, a live issue for the many researchers trying to computerise data- analytic activity using text as raw material. It’s also a truism of which I have been reminded afresh as I discussed this issue’s topic with practitioners of textual analysis – no two of whom used the term in exactly the same way. Strictly speaking, textual analysis describes

a social sciences methodology for examining and categorising communication content. In practice, though, it is widely used to cover a range of activities in which unstructured or partially structured textual material is submitted to rigorous analytic treatment. What they all have in common is a desire to wrestle the petabytes of potentially valuable information locked up in an ever-inflating text reservoir (blogs, books, chat rooms, clinical notes, departmental minutes, emails, field journals, lab notebooks, patents, reports, specification sheets, web sites and a million other sources) into a form that is susceptible to useful, objective data analytic treatment. Temis, of whom more below, has on its

website a headline which sums it up neatly: ‘Big data issue #1: a lot of content and no insights’. Text mining, the consequent knowledge bases, and analysis of the results have become a major component of biomedical and pharmaceutical research. For our purposes here, I have taken it to

mean analysis whose purpose is to extract scientific value from texts, to examine those texts scientifically, or some combination of the two. A case history which meets both of those

criteria is the application of SAS Text Analytics to patient records at Lillebælt hospital in Denmark[1,2]

with a payoff in dramatically

improved error trapping. Quite apart from the value inherent in better validated information,

18 SCIENTIFIC COMPUTING WORLD

Investigating indirect gene-gene relationships between a drug compound, Raptiva, and a disease, Psoriasis, using interaction network visualisation in Linguamatics I2E

www.scientific-computing.com

records can be compared here (literally or statistically) on the basis of their content, and statistical data on medical issues can be derived from them to inform practice. As Te Guardian’s Jane Dudman comments: ‘All healthcare policy decisions are based on the statistics that each clinic contributes by registering data. If data is wrong, the basis for decision-making is also faulty.’ She might also have added that, if the data is not accessible for analysis, it is missing from those statistics which again, therefore, become flawed. Tis issue of accessibility, for data buried in unstructured text, is a crucial one – and one which text analytic methods seek to address. Another Danish example of the first type

(extracting scientific data from texts) is the use of SPSS by the not-for-profit information

THERE ARE AN

INCREASING NUMBER OF PRODUCTS

WEIGHTED TOWARDS ANALYTICS

arm of cooperative retail conglomerate, FDB. By using text analytic approaches to mine supermarket data in combination with interview and survey records, they have generated dietary healthcare outcome indicators and provided a public interactive exploratory interface for immense data reserves that would normally be inaccessible though sheer contextual volume. For tidiness, let’s stay in Denmark for

an instance of the second application type: scientific comparison of texts which are not

necessarily scientific in themselves. StatSoſt’s Statistica text mining tools are being used by researchers at three Danish universities to analyse similarities and differences between different north European mythologies and storytelling traditions. While this work has attracted interest from sociologists, human geographers, ethnologists and others, the primary motivation is scientific classification of what one of the group describes as ‘literary DNA’ – the fundamentals of story as a form. From ancient oral folk tales to modern magical realist novels such as Peter Høeg’s Forestilling om det Tyvende århundrede and the influence of Muslim immigration, threads of connection and degrees of separation are statistically defined in objective ways. Tis is not dissimilar, in essence, to the more familiar quest to decide authorial attribution of Shakespeare’s plays – a favourite playground of textual statisticians for as long as I can remember. All of those examples, as it happens, use

specialised tools within well-known general data analytic products, but it needn’t be so. Tere are an increasing number of products which are designed from the beginning to either specialise in, or be weighted towards, text analytics. Tere are also plenty of people using scientific computing methods to analyse smaller, focused sets of textual material through wholly generic means and a little ingenuity. Two representatives of the rapidly growing

market sector that focuses specifically on this area are Linguamatics and Temis, both of which apply cutting-edge natural language processing methods to the task of acquiring, organising and analysing data locked up in textbases of various

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40