SCW October/November 2018

LABORATORY INFORMATICS

Fletcher, CEO of Rome-based technology consultancy, Lynkeus, has pioneered machine learning methods for creating this type of data for e-clinical trials. ‘We use a method called recursive conditional parameter aggregation to create data for a set of artificial patients that is statistically indistinguishable from the real patient data it was derived from,’ he explains. Any health study that depends on

g

rest of Oxford’s biomedical research in a new Big Data Institute, part-funded with a generous donation from one of Asia’s richest and most influential entrepreneurs, the Hong Kong billionaire Li Ka-Shing. This institute hosts the single biggest

computer centre in Oxford and takes in data from all over the world. ‘Our largest datasets come from imaging and genomics, but we also hold gene expression and proteomics data, as well as consumer data and NHS records’, says the Institute’s director, Gil McVean. ‘And ‘imaging data’ itself covers a huge variety of modalities and therefore of image types, from whole-body imaging and functional MRI brain scans to digital pathology slides’. The Big Data Institute holds all the

data collected through one of the largest epidemiology initiatives yet launched: the UK Biobank, designed to track diseases of middle and old age in the general population. This recruited half a million middle-aged British individuals between 2006 and 2010. Initially, they provided demographic and health information, blood, saliva and urine samples, and consented to long-term health follow-up. This information, including universal genome-wide genotyping of all participants, is being combined with health records including primary and secondary care and national death and cancer registries, to link genetics, physiology, lifestyle and disease. Many participants have also been to

answer further web-based questionnaires or to wear activity trackers for a given period, and the genetic part of the study is still being extended, as its senior epidemiologist Naomi Allen explains. ‘A consortium of pharma companies, led by Regeneron, has funded exome sequencing – that is, sequencing the two per cent of the genome that actually codes for functional genes – of all participants. They hope to complete the sequencing by the end of 2019 and UKB will make the data

20 Scientific Computing World October/November 2018

“Some datasets, particularly those including genomic data, are so rich that it becomes possible – if very hard – to track down an individual from their data, even if all identification has been removed”

publicly available to the wider research community a year later.’ The eventual, if ambitious, aim of UK Biobank is to sequence all 500,000 complete genomes over the next few years and release that data, too, to the research community.

Ensuring data access and anonymity All academic and even some commercial biomedical research is now carried out under open access principles. These can be hard to apply to human health data, however, as issues of privacy are also important. The UK Biobank restricts its data to bona fide scientists seeking to use the data in the interest of public health; as such, the type of research projects varies widely. All the data are ‘pseudonymised’ by using encrypted identifiers that are unique to each research project, before being made available to researchers. However, with the amount of detailed

data available nowadays, it is becoming harder to guarantee anonymity. ‘Some datasets, particularly those including genomic data, are so rich that it becomes possible – if very hard – to track down an individual from their data, even if all identification has been removed’, explains McVean.

One perhaps drastic solution is the

creation of artificial data records. These have similar characteristics to patient records and can be fully analysed but represent no individuals. Edwin Morley-

volunteers providing data can also suffer from the problem of volunteer bias: people interested enough in their health to join such studies are generally likely to be healthier than the population as a whole. It is possible to collect data directly from the millions of Fitbit fitness trackers in regular use, but this would represent an even more biased dataset, as most people who own Fitbits are richer and more tech-savvy, as well as more health-conscious than average. Martin Landray, deputy director of the

Big Data Institute, was involved in the Biobank study of exercise and health, which overcame some of this bias by sending accelerometers to 100,000 volunteers and asking them to wear them for a week. Each one returned 100 data-points per

second, generating a huge dataset. ‘This presented a big data challenge in three ways: in logistics, in ‘cleaning’ the raw data for analysis and in machine learning: picking out movement patterns associated with each type of activity,’ he says. ‘Only then could we – after discarding outliers – detect any relationship between actual (as opposed to self-reported) activity levels and health outcomes’. But whose data is it anyway? All biomedical data, including ‘omics data, is derived from one or more individuals, and much of the data collection depends on willing volunteers. And, particularly as anonymisation cannot be fully guaranteed in all circumstances, all these individuals as ‘data subjects’ must have some rights over their data. Anna Middleton, head of society and ethics research at the Wellcome Trust Genome Campus (the home of the Sanger Institute), leads a global project called ‘Your DNA Your Say’ that aims to discover how willing people are to donate and share their genomic and other biomedical data. ‘We are finding that people are more willing to share their data if they understand what it means, and if they trust the scientists who will be using it,’ she says. ‘Unfortunately, levels of knowledge and trust are low in many countries, and scientists must do more to communicate the meaning and value of our big data and how it drives our work.’

@scwmagazine | www.scientific-computing.com

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36

orderForm.title