Text and data mining FEATURE
student I knew I wanted to do TDM but I never asked. And a lot of people wish Google Scholar had an API. I think these people are actually wishing to do TDM, but don’t know it yet.’ In addition, Neylon suggested that some low-level TDM goes on below the radar. ‘Text and data miners at universities often have to hide their location to avoid auto cut-offs of traditional publishers. This makes them harder to track. It’s difficult to draw the line between what’s text mining and what’s for researchers’ own use, for example, putting large volumes of papers into Mendeley or Zotero,’ he explained.
Formats and standards
When it comes to how people want content to be presented for mining, this varies depending on the topic of the research and the methodology. Rutt of NPG noted: ‘In more recent requests people have been leaning towards XML, but that definitely varies. Not all publishers do XML so researchers may want content that fits with what other publishers offer.’ ‘Some researchers
prefer XML and some
publishers only publish PDF,’ agreed UK-based palaeontologist Ross Mounce. However, he said that he requires PDFs because he needs to mine images for the data behind them. Nonetheless, he said that the ‘PDF is a horrible container. You have to develop tools to mine PDF text.’ But there are things that can make mining
easier, he argued. He urged publishers to specify that images derived from data are submitted as vector, rather than raster, files, because then the underlying data ‘would be relatively easy to decode’.
For Piwowar, ‘PDFs are not good.’ She noted: ‘JSON is fantastically easy to parse and so is XML. Ideally, papers would all be in the same format but we don’t want to be stuck in a format of 15 years ago. And if papers are available as one format, someone could develop a way to convert to another. That’s the great thing with open access,’ she noted. Tripathy said: ‘I prefer working with the HTML full texts of the article but also work with the XML provided through Elsevier’s text- mining API. A nice thing about HTML is that it is a single standard that most publishers conform to – so I can use code that I’ve developed for one publisher on content provided by another.’ However, he also noted some challenges. The first is the lack of a standard format across publishers. ‘For each publisher that I extract information from, I need custom bits of code to find relevant content from each of these publishers,’ he explained.
Another challenge he has is extracting
www.researchinformation.info @researchinfo
structured information from academic publications, which are typically unstructured. ‘Challenges include identifying relevant bits of information in the publication and tagging that bit of the publication as being relevant to me with high accuracy. Because this problem is quite difficult, algorithmically, I often take a combined approach, with automated text- mining plus manual curation. I use text-mining to scan through thousands of articles to find the relevant ones and then go through and manually check everything that was extracted automatically and fix things as necessary. This problem would be substantially improved if there were better standards for how the information I’m extracting should be communicated within a publication (like there are for genomic information, which specify how different genes should be communicated).’
Kamila Markram, CEO of Swiss OA publisher Frontiers, agreed on the need for standards. ‘Standardisation is never easy but needed,’ she said, adding that Frontiers works with various groups of researchers and research organisations
‘Even where I have legitimate access I can’t download too many – and if I download them, it’s at
a really slow data rate’ Ross Mounce
to get consensus on how data should be presented. ‘Publishers need to make content available and annotate it. The challenge is in finding what you need to annotate. Researchers need to come to consensus about what they need.’
However, Neylon warned that, sometimes, standardising on database formats and structures can be a challenge and may not always be the right approach. ‘There’s been a tendency to think that, because the early success was with big databases, we should standardise on those formats,’ he observed. ‘It was the right thing for big things like protein crystal structures or gene sequences because they are such a clear kind of data. As you move into how you do things in the lab, for example, it’s less easy to define and those kinds of models don’t fit so well.’
Downloads or crawling Another consideration for people wishing to do TDM is how to handle publisher content. The two main approaches are doing bulk downloads or crawling the content on the publishers’ site.
However, there are limitations with both approaches – mainly due to data rates. Mounce finds this frustrating: ‘The literature isn’t actually that large. I’ve got all of PLOS on my computer. I’m looking to extract all 20th century phylogenetic data. I estimate that, in the last decade, there have been more than 100,000 papers on this but that would be less than 100Gb. However, even where I have legitimate access I can’t download too many – and if I download them, it’s at a really slow data rate.’
He noted that each publisher has different limits for how many papers can be downloaded in a given time-frame. ‘They all claim that it’s a technical limit, protecting users from denial of service but we’re only pinging the content quickly,’ he said. He argued that it is not in researchers’ interests to bring down publishers’ servers either, so TDM is done considerately. He is also frustrated that, even when content is published under a Creative Commons licence, there are still often restrictions on download speeds and crawling. Alicia Wise said that Elsevier enables TDM through its API or content can be accessed by bulk download – and that content is available for mining by robots in a way that does not upset normal use. She notes that the API route is particularly suited to relatively low- scale text mining, while bulk download might be more suitable to larger scale TDM. ‘We see both approaches as rough and ready. We support them but we see the opportunity for better tools,’ she observed, adding that the company is doing a number of pilot projects with institutions around the world in this area. Markram of Frontiers said: ‘We allow bots but we do evaluate them and stop them if their behaviour is funny. We prefer people to come and ask first because not all bots on the internet are benign.’
Rutt explained of NPG’s position: ‘We allow TDM on subscribed content but there are a few practical constraints: we want the IP address of users and ask that they don’t violate copyright with the output of TDM. We also ask librarians of site licences to sign an addendum to their site licence as a one-off. We allow users to come into our system and crawl it, but there is a pretty slow crawl rate so it does not disrupt the system. We can also deliver data on CD or via FTP but we tend to charge for this.’ However, Neylon is dismissive of the idea of TDM causing problems to servers. ‘There’s a lot of nonsense about crawling causing problems. If you’re a publisher of any size you should be able to deal this traffic. We get over four million
AUGUST/SEPTEMBER 2013 Research Information 17
Page 1 |
Page 2 |
Page 3 |
Page 4 |
Page 5 |
Page 6 |
Page 7 |
Page 8 |
Page 9 |
Page 10 |
Page 11 |
Page 12 |
Page 13 |
Page 14 |
Page 15 |
Page 16 |
Page 17 |
Page 18 |
Page 19 |
Page 20 |
Page 21 |
Page 22 |
Page 23 |
Page 24 |
Page 25 |
Page 26 |
Page 27 |
Page 28 |
Page 29 |
Page 30 |
Page 31 |
Page 32