RI June/July 20

Interview

illustrates how important it is to keep metadata content relevant and current”

“The recent pandemic

structure based on the ‘look of the page’. We developed software that we called MindReader because it needed to infer what the coding should have been if the editor would have known about codes. Upon reflection, this was a very early implementation of artificial intelligence for structuring content. MindReader worked well and converted more than two million pages in 1983! While early coding systems were

somewhat ad hoc, SGML, and later, XML, provided more structure and standardisation, allowing the handling and distribution of larger and larger volumes of content. DCL’s role grew as these new capabilities allowed the data industries to grow and handle the ever-increasing data streams we process on a daily basis today.

How does the organisation fit in with the world of scholarly communications? Starting in the mid-1990s DCL got more heavily involved in supporting scholarly communications. The industry started looking for new and innovative approaches to deal with the ever-growing mountains of content, and the need to reduce costs and become more efficient. Since then DCL developed services to support scholarly publishing from beginning to the end of the publishing workflow. We specialise in complex content transformations. Some examples:

• Ingesting author manuscripts, composing them, and standardising them into JATS and other standard formats;

• Normalising legacy collections into standards formats and loading the content onto the various platforms the industry supports;

• Identifying and extracting metadata to support taxonomies and ontologies;

• Coding and verifying bibliographies and references against 3rd-party data sources;

• Ongoing distribution of content and metadata to discovery vendors with its Discovery Bridge service;

• Analysis of large document collections to identify content reuse across multiple document sets and source format with our Harmonizer software; and

www.researchinformation.info | @researchinfo

• QA validation and independent review of previously converted content.

What is the biggest challenge for data companies at present? Keeping up with rapidly growing volume of the world’s research output, while assuring quality and truthfulness is big challenge, and goes beyond the research paper itself, and data companies should take the lead. How content consumers interact with

information is still basically the same as when we were in a print-driven world – someone wants information, looks for the information, discovers a topic of interest, and then consumes that information. The tools we use to search and read are certainly different and constantly updated. But the volumes of material are so much larger today – and finding the right information, and not missing that critical piece of information is much harder. I think the pandemic illustrates how

important it is to keep metadata content relevant and current. Many publishers and other content-centric organisations deeply understand the importance of taxonomies and metadata. But how do you ensure language you implemented 10 (or 20) years ago maintains pace today? And when a search identifies hundreds, or thousands, of articles – how do you scan them while assuring you are not missing critical information? Much of what’s done in scholarly communications is artisan work, and without more automation it’s difficult to keep up. It’s time to revisit complex data and

content structure challenges. Advances in automation and artificial intelligence, in all its facets (machine learning, natural language processing, computer vision, and so on) hold answers that were not feasible even five years ago. Projects that were previously impractical due to budget constraints are now within reach. I always like to listen to our customers detail big-picture projects that they want to explore and find ways to make it affordable to undertake. For example, the New York Public Library knew that they wanted to make ‘a resource for the world’. The thought was they want to provide access to all books that are out

of copyright. The first step to undertake was to ensure the copyright records are structurally and intelligently tagged. At DCL we took the Internet Archive’s digitised (but unstructured) bibliographic references and put it into XML. This serves as the basis for the NYPL’s large resource. I always ask people: are there unstructured data stream and data collections that can be structured – and improve your process?

Jump forward 20 years … what will be the role of data in research and academia? I think many of the issues with increasing volume, and standardising information will likely be solved over the next few decades, and may not be that different – though there will be much more of it, and there will be a need for more efficient ways to find what you need. The looming problem today is trusting

the research results. How do we verify research data and make sure that what gets out is accurate and honest? Attempts to reproduce and verify research results are often not successful. Should base data be required? Should independent verification be required? How to avoid plagiarism and faulty research? With the need to get research out faster in the form of preprint, how does one ensure the scientific process and validation? The concept of big data is not new and bigger data is already here. Intelligence and structure will allow us to better sort what is meaningful, and what is not. Content structure and metadata helps separate the wheat from the chaff and might help us identify faulty research to some extent – but the biggest challenge may not be a data problem. It may be a trust problem.

Any interesting facts, pastimes or hobbies that you want to tell us about? I’m an avid skier and have a strong interest in history, as well as artificial intelligence. A few years ago, I learned to play the saxophone!

Interviews by Tim Gillett

June/July 2020 Research Information 25

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36

orderForm.title