high-performance computing ➤
all the data coming from the Hubble Space Telescope, it consulted the physicists at the Stanford Linear Accelerator (SLAC) BaBar experiment, and applied their metadata- based techniques to astronomy. Data collected from Hubble over the decades is meticulously annotated with rich metadata so future generations of scientists, armed with more powerful tools, can discover things we can’t today. In fact, because of rich metadata, more research papers are being published on decades-old archived Hubble data than on current observations.
General solutions to managing metadata So what if your organisation isn’t part of a multi-billion dollar, multinational big science project with the resources to build a custom system for managing metadata? Good news, there are a couple of broadly available and generally applicable metadata- oriented data management systems already used by hundreds of scientific organisations: iRODS and Nirvana. Tese ‘twin brothers from different mothers’ were both invented by Dr Reagan Moore (a physicist of course!), formerly with General Atomics and the San Diego Supercomputing Center, and now with the Data Intensive Cyber Environments (DICE) research group at the University of North Carolina. iRODS is the Integrated Rule-Oriented Data System, an open source project developed by DICE. Reagan Moore discussed the system in his article ‘How can
we manage exabytes of distributed data?’ on the Scientific Computing World website in March 2014. Nirvana is a commercial product
developed by the General Atomics Energy and Advanced Concepts group in San
ROBUST TOLLS ARE READILY AVAILABLE FOR MANAGING METADATA
Diego, from a joint effort with the San Diego Supercomputing Center’s Storage Resource Broker (SRB). (‘Taking action on big data’ is a recurrent
theme for North Carolina, as Stan Ahalt, director of the Renaissance Computing Institute (RENCI), professor of computer science at UNC-Chapel Hill, and chair of the steering committee for the National Consortium for Data Science (NCDS), discusses in his article on these pages.
How they work Tese systems have agents that can mount pretty much any number and kind of file or object-based storage system, and then ‘decorate’ their files with rich metadata that is entered into a catalogue that sits on a standard relational database such as Postrgres or Oracle. GUI or command- line interfaces are used for querying and accessing the data. Data can then
be discovered and accessed through an object’s detailed attributes such as creation date, size, and frequency of access, author, keywords, project, study, data source, and more. All this data can reside on very different, incompatible platforms crossing multiple administrative domains, but now tied together under a single searchable global name space. Several processes run in the background of this federation that move data from one location to another, based on policies or events, to coordinate scientific workflows and data protection like the systems at Cern. Tese systems can also generate audit trails, track and ensure data provenance and data reproducibility, and control data access – exactly what’s needed to manage and protect scientific data.
Metadata is the future of scientific data management Scientific big data, and the metadata-based techniques that manage it, are no longer the reserve of big science. Increased sensor resolution from more and more sequencers, cameras, microscopes, scanners and instruments of all types are driving a deluge in data across all science. Fortunately, robust tools are readily available for effectively managing all this data. Now it’s up to you to use them! l
Bob Murphy is big data programme manager at General Atomics
➤
recent proliferation of advanced degree programmes in data science and analytics, there are as yet no standards for data science curricula, and whether these programmes will meet the needs of data specialists in the commercial world is still unclear. Tat’s why the NCDS sponsors events that bring talented students in analytics, information sciences, and data-driven domain sciences together with representatives of the organisations who need them. It’s why we’ve formed teams consisting of faculty and corporate NCDS members to develop plans for data science curricula and why we’re working to build a data observatory that will give students the chance to work with real –
and very large – data sets. l Data challenges are ubiquitous and universal, which means solutions must break down barriers among scientific domains, and between the public and private sectors. It’s relatively simple for
28 SCIENTIFIC COMPUTING WORLD
corporate sponsors to fund university researchers interested in investigating data science questions. It’s tougher to bring together university researchers, soſtware and hardware specialists, and professionals in multiple business sectors to work side by side on broad-based efforts to translate data into knowledge and products that enable a better quality of life. Tat kind of work requires bridging cultural barriers, finding common ground, learning new lexicons, adapting, and sometimes, compromising. Within the NCDS, our working groups all
include members from industry, academia, and the non-profit sector. Our Data Fellows are early career faculty who address interesting data science questions and also want to translate their work into commercial settings. Tis kind of collaboration is still relatively rare, but essential to finding big data solutions that will lead to personalised healthcare, informed product development,
and real-time decision-making based on the latest data. For the NCDS, the last two years have
been busy, sometimes frustrating, always interesting, and oſten inspiring. We don’t claim to have all the answers to daunting data management problems, nor have we figured out how to ensure the world will have the data-literate workforce it will need for the future. But we have learned that data solutions
must be confronted by people from all backgrounds working together, and we believe we have created a framework for those critical, action-oriented collaborations. l
Dr Stanley Ahalt is director of the Renaissance Computing Institute (RENCI), professor of computer science at UNC-Chapel Hill, and chair of the steering committee for the National Consortium for Data Science (NCDS).
@scwmagazine l
www.scientific-computing.com
Page 1 |
Page 2 |
Page 3 |
Page 4 |
Page 5 |
Page 6 |
Page 7 |
Page 8 |
Page 9 |
Page 10 |
Page 11 |
Page 12 |
Page 13 |
Page 14 |
Page 15 |
Page 16 |
Page 17 |
Page 18 |
Page 19 |
Page 20 |
Page 21 |
Page 22 |
Page 23 |
Page 24 |
Page 25 |
Page 26 |
Page 27 |
Page 28 |
Page 29 |
Page 30 |
Page 31 |
Page 32 |
Page 33 |
Page 34 |
Page 35 |
Page 36 |
Page 37 |
Page 38 |
Page 39 |
Page 40