RI_AUGSEPT15

Semantic enrichment FEATURE

enhancements, and one of the ways they were doing that was in increasing the findability of their content using semantic enrichment.’

From subject headings to ontologies Whereas computers struggle to understand the meaning of natural language, enriching content with additional semantic metadata from controlled vocabularies can reduce ambiguity. A controlled vocabulary is a knowledge organisation system that reduces natural language ambiguity by restricting the terms that may be used, and how the terms should be used. In natural language, ‘Apple’ may refer to a computer, the associated company, the Beatles’ record company, or even occasionally a piece of fruit. Within a controlled vocabulary ‘Apple’ can only refer to one of these things. Controlled vocabularies take many forms, including subject headings, taxonomies, thesauri, and ontologies. Somewhat ironically, the terms themselves are often used inconsistently, with taxonomy and ontology in particular having a wide variety of definitions. The fundamental difference between the different types of controlled vocabularies, however, is in the variety of the relationships that exist between the terms. Whereas subject headings may have no relationships between the individual terms, thesauri express hierarchical

‘Semantic enrichment is not necessary a cheap option’

and associative relationships between terms, whilst ontologies often incorporate a wider variety

of both between them.

Once terminology is used consistently it becomes meaningful to computers, and as we move from subject headings to ontologies with an increasingly rich set of relationships, there is far more potential for discovering related content and inferring new relationships. For Kasenchak there are signs that this move is beginning to happen: ‘Fifteen years ago it was common to have a subject headings list, today forward thinking organisations have a thesaurus, and the move is towards more ontological structures. The standards are evolving that way. The cost in assembling an ontology from scratch may be prohibitive, but people are taking their existing taxonomy or thesaurus and slowly developing an instance ontology over time. Either that or a company with a lot

www.researchinformation.info @researchinfo entities and relationships Reducing ambiguity

l Subject headings are a controlled set of terms designed to describe the subject of a resource, one of the best known of which is the Library of Congress Subject Headings (LoCSH) (http://id.loc.gov/ authorities/subjects.html). Subject headings do not necessarily include the relationships between terms, although the LoCSH has been published as RDF with thesaurus type relationships.

l A thesaurus provides hierarchical relationships between concepts (i.e., broader and narrower terms), as well as equivalence and associative relationships. Popular examples include Getty’s Art and Architecture

of energy behind their library initiative might hire someone in our sector to expedite that process of building those ontologies.’

From documents to data Applied at the document level, ontologies may allow people to find related content that wouldn’t be possible previously, but semantic enrichment doesn’t

have to stop at the

document level. Semantic enrichment may be applied at far finer levels of granulation, enriching increasingly smaller parts of documents:

sections, paragraphs, tables,

diagrams, even the appearance of individual concepts or entities. This not only facilitates the retrieval of relevant resources, but also the relevant parts of documents. Today such enrichment is often in the form of XML files, but increasingly RDF triples

are becoming

important: a web of data to integrate with an ontological web of terms.

Kasenchak sees such a move as opening a host of new opportunities and insights from content: ‘RDF triples are on the rise in a big way. We’re going to see people converting data, and rethinking how they store that data as RDF triples instead of adding just XML to a document.

‘Once you have content marked up with semantic terms, then you have things you can count. As soon as you have countable quantified things from a controlled list you’ve turned your content into big data. You can make inferences, ask questions, cluster things together, and make wonderful visual displays and graphs for exploring these data sets. It’s just beginning, and the curve is accelerating.’

Thesaurus and the Getty Thesaurus of Geographic Names (www.getty.edu/research/ tools/vocabularies). Once again these resources are now publicly available as RDF.

l An ontology provides a formal representation of knowledge with rich semantic relationships between terms. As well as concepts it may include a wide range of other entities. The BBC has been at the forefront of developing a wide range of ontologies for modelling its content (www. bbc.co.uk/ontologies). By breaking content into chunks of data, and curating links to external resources, rich resources may be created dynamically.

Conclusion Information overload is not a new problem; the history of scientific publishing is filled with innovations to cope with the expanding quantity of literature, enabling information retrieval without wasting prohibitively excessive amounts of time.

The difference now is that the scale of the problem means that automated processes need to provide more semantically rich solutions. Scholarly publishers may be at the leading edge of semantic enrichment, but anyone who regularly makes use of their online offerings will recognise that there is a wide variation in the quality of the current services, and most still have a long way to go. As William Gibson might say: ‘The future

has arrived – it’s just not evenly distributed yet.’

Semantic enrichment is not necessary a cheap option (especially with high-quality bespoke ontologies) and there are competing technologies that offer the promise of helping people find the content that they need. This is especially the case with open content, where altmetrics and social media both harness the size of the audience to offer alternative ways of filtering data and finding and identifying related resources. Unfortunately, recognition of information problems, even if accompanied by information solutions, does not always equate to the necessary investment in information services. Semantic enrichment undoubtedly has a lot to offer in the long term, but only time will tell how many organisations are willing to make the investment they need to in the short term.

AUGUST/SEPTEMBER 2015 Research Information 35

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40