SCW_APRMAY11

inside view Casting the net

The internet is changing the way we look at data, says Cameron Neylon, senior scientist at the UK’s Science and Technology Facilities Council

At its core, the web is a communication technology, but one we’re not used to working with. As scientists our traditional approach is to pass documents up and down chains, and through specific filtration points. Scientific papers are an obvious example of this as, once something is written, it gets passed up to the editor of a journal. Through a variety of processes, he or she then makes a decision as to whether it deserves publication – and, if so, sends it out through a fairly defined distribution network. That’s how we communicate science and,

to a certain extent, the way we’ve always managed the flow of information. There have been lots of efforts to look at matrix management, but none of them tend to be successful. In an industry where we are fundamentally focused on leveraging information and data, that has probably been the most efficient – but not cheapest – way of doing things. By dropping those distribution costs to practically zero, the web has changed all that; as we, for all intents and purposes, are no longer constrained to defined flows of data and information. This does, of course, bring its own set

of challenges. If we look at a company like Facebook, that has a share value akin to that of significant pharmaceutical companies, we see that it is essentially a knowledge-management business that deals in data and information mined entirely from non-structured sources. The question we must ask is whether or not that template can transfer from pure social information in a large-scale, very high penetration market, into areas where the data is more specific and the concerns over the quality of information are greater – it doesn’t matter if someone lies about their pet, but it matters a great deal if people put misinformation about the safety of a drug into the system. We’re not talking about the same thing by any means, but companies like this challenge our pre-conceptions of how data can and should be handled. To a large extent these companies

have solved the problem of mining vast quantities of unstructured data – an issue

46 SCIENTIFIC COMPUTING WORLD Part of the web’s success has been down

to people taking data and doing exciting new things with it. I’m sure there are lots of individuals who would love the opportunity to take the toxicology data from failed lead compounds and try to figure out if there is perhaps some useful information there that can be used to approach other challenges, or indeed guide future developments. There is a lot of interest out there and skilled people, including amateur scientists, to do the work, but they will go nowhere unless that data is available. We work in a world where risk

that continues to be viewed as intractable by the scientific community. I always hear that we can’t possibly take free-text records of laboratory notebooks and make any sense out of them, or that we can’t use ‘search’ as a mechanism for organising data. But you just have to consider the organisations such as Facebook, Google and Amazon Booksellers that are actually doing this right now; just not with the type of data we’re used to.

assessment is central, and that risk is always focused on the release of data. But we don’t ever consider the risk of not releasing – how do you find the person who has the answer to your problem, and what happens if you don’t? We need to create a community where information can come in and be available for people to exploit. And for that environment to thrive, we need to contribute to those systems as well as take from them. One of reasons there has been relatively

little success in putting electronic laboratory notebooks (ELNs) in academic settings is

WE WORK IN A WORLD WHERE RISK ASSESSMENT IS CENTRAL, AND THAT RISK IS ALWAYS FOCUSED ON THE RELEASE OF DATA. BUT WE DON’T EVER CONSIDER THE RISK OF NOT RELEASING

There is often an assumption that if data

is unstructured, then we can’t do anything with it, but the success of many companies based on leveraging diffuse and unstructured data from the consumer web show that isn’t true. We aren’t able to do this today, but if we consider how to get our data into a form where those tools can be applied, then that day will come. It may be a result of the influence of the commercial sector, or efforts on behalf of pharma to push more data into a pre-competitive space, but the key point is that if we have enough data, we, and indeed other people, can do amazing things with it.

the cost of adapting the existing off-the- shelf product to specific needs. I suspect a lot of that adaptation involves redoing something that has already been done for someone else, but not made publicly available. This would have real potential for reducing software costs significantly and vendors are addressing this at the moment. I do think things are moving in that

direction with the developments of open source systems and increasing modularisation in commercial vendor offerings.

Interview by Beth Sharp www.scientific-computing.com

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48