This page contains a Flash digital edition of a book.
resources How can we manage


exabytes of distributed data? Reagan W


Moore argues that policy- based data-


management systems are the next stage in the evolution of large-scale data management


data management technology to manage the movement and storage of data. Policy-based data- management systems provide a way to proceed. Tey represent perhaps the latest stage in the evolution of data-management systems from file- based systems, to information-based systems, and now to knowledge-based systems. File-based systems focused on the management


W


of bits, and provided a standard I/O interface for reading and writing files. Information-based systems added support for information about the files, including provenance, descriptive, and structural information – stored as metadata. Knowledge-based systems add support for procedures that either extract or generate information, and enable the processing of data within the storage environment. For 16 years, the Data Intensive Cyber


Environments (DICE) group at the University of North Carolina at Chapel Hill has been developing data-management systems called data grids – soſtware that makes it possible to organise distributed data into sharable collections, while enforcing access controls. Te original system, the Storage Resource Broker (SRB), focused on ensuring consistency across all operations performed in a distributed environment. Implemented as middleware, the SRB was installed where data would be stored. Applications included: the BaBar High Energy


Physics project, which moved two petabytes of data between Palo Alto, California and Lyon, France; the US National Optical Astronomy Observatory, which managed the migration of data from telescopes in Cerro Tololo, Chile, to archives in Illinois; and the United Kingdom’s


38 SCIENTIFIC COMPUTING WORLD


ith the exabytes of data that are being generated today, it has become essential to integrate networking technology and


e-Science data grid. Te SRB provided a standard I/O interface, while managing metadata about the distributed files. Te applications managed hundreds of millions of files. Despite SRB’s success in managing data


and information, users requested the ability to modify consistency constraints and implement multiple types of data-management policies. A requirement from the UK e-Science data grid, for example, was to create a collection in which files were permanently managed and could never be deleted. But, at the same time, it was desirable that administrators should be able to replace corrupted files, and users update their own files. Tis implied the need to manage at least three different constraints on data deletion within the same system: no deletion allowed; deletion by administrator; and deletion by file owner. Te DICE group developed a policy-based


system to extract knowledge about management policies from the soſtware, and apply the knowledge via computer-actionable rules. Effectively, every soſtware-encoded consistency constraint was replaced by a policy-enforcement- point. Actions by clients were trapped at the policy-enforcement-points. By searching the rule base, an appropriate rule could then be identified,


a processing pipeline (workflow procedures, workflow provenance, workflow re-execution); or for validating assessment criteria (repository trustworthiness, compliance with regulations). Today, viable data-management systems


automate enforcement of management policies within storage controllers, administrative tasks such as data migration, and the validation of assessment criteria. Tey capture knowledge, and automate processing of data within workflow pipelines. Te automation of these tasks corresponds to the creation of knowledge procedures that can be applied by a policy-based data-management system. Trough policy-based data management


systems, it will be possible to implement feature- based indexing of data collections. Discovery of data can be driven by the presence of desired features within the data set, instead of descriptive metadata. Tis requires the ability to apply a procedure to the data, determine whether the desired feature is present, and build an associated index. Policy-based systems can control the execution of the associated procedures. Trough policy-based data-management


systems, it will be possible to link virtual collections to virtual networks, and access


THESE APPLICATIONS IMPLY THAT POLICY-BASED SYSTEMS WILL BECOME PERVASIVE, AND MIGRATE INTO STORAGE CONTROLLERS AND INTO THE INTERNET


which controlled the execution of a workflow that applied the required management policy. Tis meant that the knowledge needed to manage the system could be captured in computer-actionable rules. Te system was no longer restricted to managing files and static representations of information. Instead, a data-management system could use rules to control the system and dynamically change the rules in a rule base. Te integrated Rule Oriented Data System


(iRODS) was developed over the past seven years, and has replaced the SRB. Within iRODS, policies can be enforced for: preservation (authenticity, integrity, chain of custody, original arrangement, retention, disposition); or for data publication in a digital library (descriptive metadata annotation, arrangement, creation of presentation versions such as image thumbnails); or for sharing in a data grid (access controls, distribution, caching); or for reproducible data-driven research in


data by name instead of network location. A data-management system can be integrated with network routers, such as the OpenFlow technology, and dynamically define the network path that is used to access a file. If a file is replicated within the logical collection across multiple storage locations, the request for a file can be automatically routed to the closest copy. Tese applications imply that policy-based


systems will become pervasive, and migrate into storage controllers and into the internet. Te knowledge required for processing or transferring data can be captured as procedures that are automatically applied under policy-based control.


Reagan W Moore is lead developer of iRODS and principal investigator for the DataNet Federation Consortium at the University of North Carolina at Chapel Hill


@scwmagazine l www.scientific-computing.com


Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32  |  Page 33  |  Page 34  |  Page 35  |  Page 36  |  Page 37  |  Page 38  |  Page 39  |  Page 40