RI_AUGSEPT10

Data

Digging into data in new ways

New large-scale data analysis techniques of many different types of data sets can give researchers new insight, writes Alastair Dunning

What do you do with a million books? Or a million newspaper pages? Or a million art images? Scholars now have access to huge repositories of digitised data – far more than they could read in a lifetime. And yet the way researchers access these documents is still based on the relatively linear process of searching for specifi c data like keywords, names, events, places or dates. Once users have entered such a search term, they are then presented with a list of hits to scan and choose from. While accelerating the speed of research,

such searching does not permit the researcher to exploit the full breadth and richness of a digitised resource. Rather than analysing the body of material as a whole, it restricts us to looking at little snippets – like tearing out pages from a book rather than reading whole chapters. New techniques of large-scale data analysis

can, however, allow researchers to discover relationships, detect discrepancies and perform computations on data sets that are so large they can only be processed using powerful software. The Digging into Data challenge asked teams of international researchers to exploit new web tools and contemporary computing power to explore larger bodies of data. Bringing together four funding bodies from either side of the Atlantic – JISC in the UK, National Endowment for the Humanities and National Science Foundation in the USA and the Social Sciences and Humanities Research Council of Canada – the programme has created teams combining the expertise of social scientists, humanists and computer scientists. One of these teams is immersed in approximately 23,000 hours of recorded

www.researchinformation.info

music – around 350,000 individual songs and compositions – which are now being tagged by researchers at the ‘structural analysis of large amounts of music information’ (Salami) project. The sheer quantity of music being analysed, from a capella to Zydeco, Appalachian to Zambian, and medieval to post-modern, allows the team to rescale the traditional research questions music scholars are asking. By using a range of software tools to tag each piece according to elements such as rhythm or harmony, the music can then be analysed to compare genres and fi nd changes in musical patterns over time.

‘Valuable data mining can be done on data sets that were created for an entirely different purpose’

Meanwhile supercomputing is also helping

to provide fresh insights into an old scholarly question: how to determine authorship. Looking at medieval manuscripts, early modern maps and more recent knitted quilts, the ‘digging into authorship’ team is making use of high quality digital images to check for repeating motifs, patterns and other traces of artistic identity. Such analyses do not necessarily provide

academics with the fi nal answers. However, the ability of processor power to analyse digital images with huge fi le sizes rapidly can do what a human can’t: read everything quickly and synthesise it in minutes. This offers the scholar new clues to determine the creator of a given piece. Although data mining is a technique already used elsewhere

Researchers are ‘gatekeepers’ of their data

in the humanities – to establish the identity of a playwright for example – this is probably the fi rst time that researchers have applied this kind of analysis to very large collections of images. Data analysis needn’t be confi ned to researchers in the sciences. Projects like these are designed to foster interdisciplinary collaboration. They can help pool resources and promote international collaboration, getting more out of the research funds that we strive so hard for. It’s clear that valuable data mining can be

done on data sets that were created for an entirely different purpose. Researchers are not just generators, but gatekeepers of their data. It’s now the task of researchers and curators to make data available using open standards, and for repositories that hold large digital collections to ensure effi cient access to these materials for research, for example by changing text and data into readable formats. We also need to create tools that extend the research capacity of ordinary databases. This might be software that can ‘read’ data from different disciplines and could, for example, search biology and chemistry data for features that are useful to academics working in either discipline. The scale of resources can have a positive impact on research – but only if those millions of pages, images and datasets are made accessible in the fi rst place.

Alastair Dunning is programme manager for digitisation at JISC in the UK

Research Information August/September 2010 9

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28