FEATURE Text and data mining
unique visitors per month. We’re talking about a few tens of thousands of hits from TDM. It’s just day-to-day operation of a high-traffic website. And the people who are doing TDM tend to be the most polite and considerate users. Security testing can cause many more problems to publisher servers,’ he said.
Access and imagination Aside from technical details, there are some big themes that text and data miners want. As Neylon summarised: ‘For anyone to do anything with a corpus of literature, they need to be able to discover and identify sets of literature. They also have to be able to access it, ideally through OA to the material of papers. Thirdly, they need to have the legal rights to do whatever they need to do with it.’ He continued: ‘The technology to support it is all there. Technical tools for building indexes also exist. The only real problem is access to content and the reason it is blocked is sheer lack of imagination and thinking of business models.’
Piwowar agreed on the challenges. ‘I want
to use literature to do research on researchers and gather evidence that’s only in full text. The largest challenge is that there is no place to search it all. The closest thing is Google Scholar but that doesn’t have an API. Other places could do it but do not offer a full set of literature,’ she said. ‘There is very limited support because many articles are not OA.’ She continued: ‘Another obstacle is, once you’ve done text mining, how you distribute the results. The NC clause [Creative Commons’ non-commercial designation in some licences] is very ambiguous. ImpactStory [the altmetrics organisation that she founded with Jason Priem] is not for profit but it is incorporated as a company, and at some point we might charge for some premium services. Copyright laws are different in different countries. It’s hard to figure out what you are allowed to do and so it’s easier not to do TDM.’ ‘If there’s any risk we might get sued, we
won’t do it. It has a huge chilling effect,’ agreed Mounce. For content licensed under the Creative Commons CC BY licence the issue is simpler, according to Markram. ‘If you don’t have restrictions it’s just a matter of instructions to typesetters. We do ask researchers to acknowledge us.’ She explained that the people behind Frontiers are ‘researchers but also publishers so we have built tools that we want to use’. Indeed, Henry Markram (Markram’s husband and co-founder of the company) is
18 Research Information AUGUST/SEPTEMBER 2013
director of a major brain mapping project so he has a huge need to mine data in his own research.
Permissions
Another stumbling block that researchers find is the requirement to ask permission to do TDM. Although academic researchers often have access to a large body of literature through institutional subscriptions, this does not give them automatic rights to do TDM with the content. Many publishers require researchers to ask permission individually, which can present a significant time barrier. ‘I have physical access to content already. Negotiating again genuinely blocks research,’ noted Mounce.
‘This has been a non-trivial challenge,’ observed Tripathy. ‘While my institutional librarians have been very helpful in helping me obtain licences, in cases where I have had to wait for licences, it has slowed my research. However, I’ve found that, when communicating with publishers about text mining (for example, Elsevier or Wiley), they have been very excited
‘Getting researchers to request to do TDM is
barking mad’ Cameron Neylon
to hear about my use case and have been willing to work with me, both in giving me access to content and doing things on their end to help with extracting content.’ He continued: ‘One thing I’ve observed is that editors of the journals I’m extracting information from usually have a hard time understanding why it is even an issue, as long as my institution has a journal subscription. They also appreciate that I’m providing the extracted information back to the community in a useful form.’ Meanwhile, Piwowar recounted how she gained permission to use a large body of content from one subscription publisher but by the time this permission was approved it was too late for her to use the content in her project. ‘Researchers need it when they need it, not months later. Even if you shorten the approval process to two weeks, it’s too long,’ she said, noting that this is a particular problem for early- career researchers who are often on short-term contracts. A related issue is that permission to use subscription content for TDM is generally affiliated through the subscribing institution but early-career researchers frequently change institution and therefore have to renegotiate permissions to do TDM.
Neylon is not a fan of the approach of requiring researchers to seek permission: ‘Getting researchers to request to do TDM is barking mad,’ he said. ‘Traditional publishers have built up an entire business model based on control. Structurally and functionally they have to understand how things are used so they can see if they can make money from it – but the reality is that this is actively blocking people from experimenting.’ He added: ‘For most people there is very little point to have output from just one publisher. They need content from all publishers. If you need to spend six weeks negotiating with Elsevier and then another six weeks negotiating with Wiley to get different use conditions, this blocks the project.’ PLOS, he said, does not require users to request permission to do TDM.
Green access
Much of the discussion around TDM focuses on gold OA content but there is, of course, another body of content available – green OA. Discussions at a Westminster Higher Education Forum held in London in February highlighted how some participants favour the green route, despite concerns raised by others over the ability to do TDM on green OA content.
And long-term green OA ‘archivangelist’ Steven Harnad, tweeted similarly from an event in May, ‘#wilbis keeps dwelling on special cases where data-mining important – ignores vast majority where it is not.’ He went on to tell Research
Information: ‘@researchinfo Green
*will* lead to as much CC-BY and re-use as authors want and need: but all need to mandate immediate-Green first! #wilbis’. However, this does not seem to reflect researchers’ experiences on the ground today. ‘I haven’t tried using green OA content. The reason is that I don’t know how to go about finding green OA content that is relevant to my use case,’ said Tripathy.
Mounce said: ‘Most green things basically ignore licensing and there is very little CC BY content. Searching across repositories is very difficult, although I’m sure it will get better. I think that saying free access is good enough is selling out future generations. If we get licensing wrong it will be with us for another 70 years.’ And there are practical challenges too, as NPG has found. ‘In 2007 we had a change of policy for our green OA content in PubMed Central (PMC) that put it in a subset that can be mined,’ said Rutt. However, her colleague Grace Baynes, NPG’s head of corporate communications, added that it took a long time for PMC to put this content into the TDM subset.
@researchinfo
www.researchinformation.info
Page 1 |
Page 2 |
Page 3 |
Page 4 |
Page 5 |
Page 6 |
Page 7 |
Page 8 |
Page 9 |
Page 10 |
Page 11 |
Page 12 |
Page 13 |
Page 14 |
Page 15 |
Page 16 |
Page 17 |
Page 18 |
Page 19 |
Page 20 |
Page 21 |
Page 22 |
Page 23 |
Page 24 |
Page 25 |
Page 26 |
Page 27 |
Page 28 |
Page 29 |
Page 30 |
Page 31 |
Page 32