This page contains a Flash digital edition of a book.
FEATURE


Text and data mining


Semantic tools help extract meaning


Siân Harris speaks to some companies that provide organisations with tools for text and data mining


W noted


hen companies want to do more with their content they often turn to text and data mining to help them add value. Such companies


might be big pharmaceutical or healthcare companies, eager to reap more insight from their high-throughput screening, clinical trials and literature. They might also be scholarly publishers wanting to do semantic enrichment on the content they publish. This could be in order to help readers to search or find related content better, or help the publisher derive other packages of content they can sell in different ways. New resources such as thematically-focused topic pages, linked data knowledge bases, or content APIs, rely on semantic metadata. ‘Text mining is about adding meaning,’ Rob Virkar-Yates, marketing


and


communications director at Semantico, which provides some base level text and data mining through its Scolaris platform and also uses TEMIS’s Luxid platform to extract meaning from large passages of unstructured text. ‘Our most frequently requested use case is “related documents”, where we can use the information gathered from Luxid to suggest other documents that may be of interest to the reader of a given article. In the past, the only way of achieving this would have been to tag “related documents” manually, and keep them maintained, or to create taxonomy and link documents together using the subjects and facets they are tagged with. Both of these could easily be a full-time job for one or more


20 Research Information AUGUST/SEPTEMBER 2013


He also noted the benefits of text and data mining in the area of analytics and visualisation. ‘The Luxid platform also includes Luxid Information Analytics – a tool that enables researchers to investigate their field of interest visually and quantitatively, based on information extracted from unstructured documents.’


Linked data


people,’ said Virkar-Yates. According to Daniel Mayer, VP of marketing at TEMIS, there are several key requests that TEMIS receives from customers. ‘In industry, we see two types of use cases,’ he said. ‘The first relates to what we call “discovery” activities. They involve the processing and analysis of massive amounts of available scientific literature, patents, research publications and/or news to extract and aggregate available knowledge about a given subject of interest.’


He continued: ‘The second type are “enrichment” use cases that are mostly focused on internal/corporate documents that are stored in-house, with a goal of enhancing the way they are archived (to prevent loss of knowledge, for example, or to ensure enhanced compliance) and enhancing the capture, preservation and exploitation of the organisational knowledge they contain.’


Linked data is a related buzz phrase in the industry today, and is an approach being taken by several large publishers and information organisations. Peter Camilleri, business development director of TSO, which works with Nature Publishing Group, the Royal Society of Chemistry, the British Library and others, explained: ‘Many documents are made up of vast amounts of unstructured or semi-structured text, which contain valuable information buried within it. Linked data is a way of making available data as RDF (resource description format) or “triples”. RDF presents data as statements, with the statements being made up of a subject, a predicate and an object. This allows datasets from different domains to be linked together in flexible ways without predefined schemas. RDF also means relationships between different datasets can be expressed, leading to greater discovery of information held within the data.’ As with any technology, there is a trade-off between making it simple to use and making it flexible. Phil Hastings, SVP for sales and marketing at Linguamatics, described the company’s natural language processing ability: ‘Its strength is that it is very flexible, agile and scalable. How it is used is up to the customer. We try not to apply any preconceived ideas to what might be important to users. It is about applying the right filters.’ This is important, he said, because different types of content might need to be considered differently. For example, social media provides plenty of useful information but also lots of noise. In addition, it also uses different


@researchinfo www.researchinformation.info


Photobank.kiev.ua/svkv/Shutterstock.com


Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32