RI_APRMAY16

FEATURE Data management

reveal personal information on Twitter users. ‘Though some methods [exist] to mitigate this risk, simple anonymisation may not fully prevent accidental disclosure,’ she says. Indeed, the recent ‘Wisdom of the Crowd’ project from Ipsos MORI, UK, which explored social media use and spawned numerous publications, validates Day-Thomson’s concerns. Titles such as A Guide to Embedding Ethics in Social Media Research and The Road to Respresentivity demand better ethical standards and rigour to be built into the research process. Predictably, representivity is a thorny issue as social media users are not necessarily representative of the entire national population. Recent analysis from the US-based Pew Research Center for Internet, Science and Technology on the demographics of social media users, revealed only 20 per cent of the entire

‘More detailed research has uncovered a clear discipline bias’

adult population use Twitter, 62 per cent of the entire adult population use Facebook, while 22 per cent used LinkedIn. But Weller takes a different tack on respresentivity. ‘Most of my colleagues do not study social media data to get a representative sample of the general population, but rather to learn something about people,’ she says. ‘This research is not a survey and [we don’t have] sampling mechanisms,’ she adds. ‘So we focus on a research question, which is representative for a specific platform and doesn’t create this problem.’ As Weller highlights, when probing Twitter, correctly phrasing a search query is crucial. Still, even if the researcher hits the target here, other issues still come into play.

World map with countries coloured according to the most popular social networking site. For almost all countries, this is Facebook

Perhaps most noteworthy is that, for Twitter researchers, a public API only provides a one per cent sample of the Twitter data, with no means for the researcher to focus on a particular user of pattern. What’s more, a researcher has no idea how this data has been sampled. As Day-Thomson points out: ‘Twitter’s one per cent streaming API is popular with researchers and includes a lot of data – and research institutions may not have the capacity to cope with more data than this. But a problem is that this one per cent is pulled out by Twitter and the algorithm hasn’t been released, so for social scientists there is no way to account for any bias here, as they don’t know what’s been included and why.’ Weller agrees and adds: ‘You just don’t have a clue how this Twitter data is sampled and then, by the way you decide to collect your data, you may create other biases.’

At the same time, interrogation of past, public APIs isn’t easy. According to Weller, Twitter provides access to historical tweets, so you can access the last 3,200 tweets of a single user, such as Barack Obama. But, as she highlights: ‘For topical keyword or hashword

Where next for the Library of Congress and Twitter?

From January to June 2015, Katrin Weller held a research fellowship at the Library of Congress. The GESIS Information Scientist had hoped to work with Twitter datasets in the archive, but the lack of availability meant her research never actually took place. What’s more, she doesn’t hold much hope for any solutions soon.

‘From my experience I can say the 6 Research Information APRIL/MAY 2016

archive isn’t going to happen soon or even in the next 10 to 20 years, so as a researcher I wouldn’t rely on this being available,’ she says. ‘The library is also archiving text-based tweet formats, which you can get through an API or Gnip. This means, you won’t get the look and feel of what the platform looked like in, say, 2006, and images and URLs inside a tweet may lose their value if they can’t be resolved.’

searches, you can’t go back into the past, you can only say “I’ll search from now”.’ Clearly, for spontaneous events such as Arab Spring protests, a researcher will have to set up data collection methods very quickly if he or she wants to analyse public APIs, rather than buy data from Twitter data-reseller, Gnip. But as Weller emphasises, academics can still work with these issues. ‘Researchers can understand the issues and critically refer to these in a publication to make it explicitly clear to another researcher what he or she actually did,’ she says.

Dare to share?

Beyond ethics and representivity, legal issues abound. Under Twitter’s API Terms of Service, researchers can share tweet identification numbers but sharing larger datasets is prohibited.

‘This means that the next researcher who wants to look at [your research] will have to retrieve every single tweet based on its ID, which takes time,’ she says. ‘And if the tweet has been deleted, it just doesn’t show up anyway, which for research purposes is a disaster in some ways. The Terms of Service also means you cannot preserve your research data, which is a big problem.’

As many in the field acknowledge, including Thomson-Day and Weller, the terms and conditions from Twitter and other social media platforms largely exist to protect the profits gained by selling user data to commercial companies. However one noteworthy exception has raised a glimmer of hope for researchers worldwide. In 2010, Twitter gave the US Library of Congress permission to archive every public tweet since its inception in 2006, and to continue archiving future tweets. At the

@researchinfo www.researchinformation.info

Christallkeks

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40

orderForm.title