SCW_OCTNOV15

high-performance computing

Transforming discovery through integrated, comprehensive

Dan Bedard describes the progress that the integrated Rule-Oriented Data System is making

in removing the data management roadblocks that inhibit the wider research use of data

B

y now the readers of Scientific Computing World will have heard plenty about the promise of big data and the challenges that inhibit realisation of

that promise. Technologies that generate data – such as distributed sensor networks, social media, and low-cost genome sequencers – enable researchers to study problems at unprecedented levels of detail. Studies that will transform our understanding of issues such as climate change, intergenerational poverty, and debilitating diseases are now within our reach. To unlock the promise of these prolific data sets, however, we need to address several tricky data management problems. Te realities of modern-day scientific research

complicate data management. Scientific problems are multidisciplinary and messy; data sets are oſten distributed among many institutions, each with its own storage technologies and data management practices. And while collaborative research requires data sharing, research on sensitive data requires security and protection of sensitive personal data. For example, social science data oſten includes health and education records that must be anonymised or secured. In addition, the available analytic methods – the tool belt of data processing and analysis – continues to evolve, compelling researchers to preserve data for an essentially unlimited lifespan. As head of the iRODS Consortium, based at

the University of North Carolina’s Renaissance Computing Institute (RENCI), I see our members grapple with these competing forces every day. Te data management questions asked by research and business organisations include: l Can we find a consistent, sensible mechanism for accessing and administering our data, which spans departments and institutions?

l What tools can we use to organise and explore subsets within our data? And

18 SCIENTIFIC COMPUTING WORLD The Sanger Institute Sequencing Centre. The Sanger Institute has one of the largest sequencing centres in the world @scwmagazine l www.scientific-computing.com

l How do we implement policies that ensure the integrity, security, privacy, and efficient processing of our data? Te iRODS Consortium was founded to

sustain the integrated Rule-Oriented Data System (iRODS), free open-source soſtware that provides policy-based management of unstructured data (i.e. files). iRODS presents a standard interface to data that is spread across multiple file systems and object stores, enabling a multitude of web clients, command line tools, and APIs to access the user’s data. Files in iRODS are associated with system and user-level metadata in a central, indexable catalogue. Te iRODS rule engine implements data management policies for access control,

SCIENTIFIC PROBLEMS

ARE MULTIDISCIPLINARY AND MESSY

data management

retention, and any automated task imaginable across a data grid. To enable broad collaboration, iRODS deployments can be federated, a process that allows different data sets with independently defined management policies to appear to function as a single entity.

How are organisations using iRODS to take control of their data? In June, users from industry, academic research centres, and government gathered to share their experiences at the seventh annual iRODS User Group Meeting, hosted by the iRODS Consortium in Chapel Hill, North Carolina. Jon Nicholson of the Wellcome Trust’s Sanger

Institute, based at Hinxton, near Cambridge in the UK, explained how it is using iRODS to manage petabytes of genome sequence data. Aſter sequencing, the aligned data files are annotated with metadata indicating parameters such as the study ID and whether or not human DNA is included in the sequence. Using iRODS rules, the researchers automate several critical tasks. Checksums, an error-detection technique used

in data transfers, are calculated on the data and stored as metadata. iRODS uses these checksums to verify periodically that the data has not

➤

Wellcome Library, London

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44