RI August 2018

Feature

Please explain what you understand by ‘semantic enrichment’

Babis Marmanis, executive vice president and CTO at Copyright Clearance Center (CCC): Word representation is central to natural language processing. The default approach of representing words as discrete and distinct symbols is insufficient for many tasks and suffers from poor generalisation. Semantic enrichment is the enhancement of content with information about its meaning. This process augments the amount of information carried by specific words or composition of words, thus, enhancing its value by making it easily discoverable and relatable to other data sets or information assets.

Giuliano Maciocci, head of product, eLife: The addition of meaning and context to data.

Donald Samulack, president, U.S. operations, Editage: In research communication semantic enrichment can, on the one hand, mean the design or packaging of content to increase human or machine comprehension, but it can also mean the augmentation, association with, or the embedding of additional content in a format other than text; such as an infographic, video explanation, or other form of data visualisation. Semantic enrichment strategically brings focus to the main message of the content, makes specific content stand out above the rest of the narrative, and further enhances the discoverability of the content – through either ‘human’ or ‘machine’ processing of information.

Jordan White, director of content operations, ProQuest: For Alexander Street-branded databases within ProQuest’s portfolio, we have a concept called ‘semantic indexing’ in use since 2000, wherein we take disparate pieces of content, add metadata and other functionality, revealing utility of the

www.researchinformation.info | @researchinfo

content in context. We begin with a scholar and a discipline in mind – a musicologist studying Bach, or a human rights scholar studying Rwanda. We ask: ‘What would they want to know and how they would want to approach the question?’ This leads us to certain themes and metadata concepts – specific instruments or musical keys; historical origins, legal concepts and social trends; ‘metaworks’ such as a play or musical composition that provide a semantic link between various instantiations of that work, sharing certain metadata. All content included in the database is then organised and indexed under these frameworks. Our systems, our controlled vocabularies, and our presentation of our products are all designed to make content discoverable in this light, attuned and adjusted to the expectations and practice of a specific scholarly discipline.

It’s more than 10 years since we reported that RSC Publishing became the first primary research publisher to publish semantically-enriched articles. What do you see as the key developments, industry-wide, since then?

Marmanis, CCC: Advances in natural language processing (NLP) through machine learning (ML) is the major key development that has had and will continue to have significant impact on both the production and consumption of semantically-enriched articles. For example, in life sciences, text mining has become an important tool for the researchers and the most fundamental task is the recognition of biomedical named entities; such as proteins, species, chemicals, genes, diseases, and so on. The ability to automatically develop effective word embeddings for biomedical literature has substantially enhanced text- mining in that area. Word embeddings capture semantic similarities between words that are not visible from their lexicographic form; for

instance, the words ‘enables’ and ‘allows’ are syntactically very different, yet their meaning is somewhat related.

Maciocci, eLife: The adoption of structured, semantically rich formats, such as JATS XML, for the digital publishing of academic articles has since become much more prevalent, making more of the available literature – particularly in the open access space – easier to search, mine, cite and cross-link. Additionally, an increasing number of services are now using artificial intelligence (AI) and data-mining techniques to add additional layers of context to the published literature, going well beyond the semantic information contained at the individual paper level and uncovering trends and connections across a much broader corpus. Integration with persistent identifiers (PIDs), including ORCIDs for researchers and Research Resource Identifiers (RRIDs) for resources, along with the use of DOIs and initiatives such as FAIR sharing and the Joint Declaration of Data Citation Principles, are also playing an important role in helping to trace the relationships between papers across publishers.

Samulack, Editage: There was first a movement toward adding more supplemental information in association with a research article. Then there was the recognition – through technology and the general lay-up and design of the average journal page – that it is the attention which is drawn to an article that makes research findings come to life. On a more sociological level, constraints of time and the vastness of the literature have forced readers to be more discerning in what they pay attention to. Brevity and discoverability, as well g

“Semantic

enrichment is the addition of meaning and context to data”

August/September 2018 Research Information

g 11

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40

orderForm.title