This page contains a Flash digital edition of a book.
HPC 2013-14 | Data management

Data management system evolution

Policy-based systems will become pervasive, says Reagan W. Moore

Data management has evolved from using file-based systems to information-based systems – and now knowledge-based systems. Te file- based system focused on the management of bits and provided a standard I/O interface for reading and writing files. Te information-based systems added support for information about the files, including provenance, descriptive, and structural information stored as metadata. Knowledge- based systems add support for application of procedures that either extract or generate information, and enable the processing of data within the storage environment. For 16 years, the Data Intensive Cyber

Environments (DICE) group has been developing data grids – soſtware to organise distributed data into sharable collections while enforcing access controls. Te original system was called the Storage Resource Broker (SRB) and focused on ensuring consistency across all operations in a distributed environment. Te SRB was implemented as middleware – soſtware that was installed at locations where data would be stored. Te SRB system provided a standard I/O

interface, while managing metadata about the distributed files. Te applications managed petabytes of data and hundreds of millions of files, in international collaborations. Despite the success of SRB, users requested

the ability to modify consistency constraints and implement multiple data management policies. A driving requirement from the UK e-Science data grid was the ability to create a collection in which files were permanently managed and could never be deleted. But the ability to manage a collection in which administrators could replace corrupted files was also desired, as was the ability for users to update files in their own collections. Tis implied the need to manage at least three consistency constraints on data deletion within the same system. Te DICE group developed a policy-based system to extract knowledge about management


policies from the soſtware, and apply the knowledge via computer actionable rules stored in a rule base. Every soſtware-encoded consistency constraint was replaced by a policy enforcement point. Actions by clients were trapped at the policy enforcement points. By searching the rule base, an appropriate rule could be identified to control the execution of a workflow that applied the required management policy. Tis meant that the knowledge needed could be captured in computer actionable rules. Te system was no longer restricted to managing files and static representations of information. Instead, a data management system could use

“Te integration of networking technology and data management

technology has become essential for managing the movement and storage of the exabytes of data that are being generated today ”

rules that controlled the behaviour of the system, and dynamically change the rules in a rule base. It became possible to use generic infrastructure to implement archives, digital libraries, data grids for sharing data, project collections, and processing pipelines simply by changing the rules and procedures enforced by the system. Te integrated Rule Oriented Data System

(iRODS) was developed over the last seven years, and has replaced SRB technology. Within iRODS, policies can be enforced for preservation, for data publication in a digital library, for sharing in a data grid, for reproducible data-driven research in a processing pipeline; or for validating assessment criteria. Today, viable data management systems automate enforcement of management

policies within storage controllers, automate administrative tasks such as data migration, automate validation of assessment criteria, capture knowledge (processes) associated with creating derived data products, capture knowledge (communication protocols) needed to interact with remote systems, and automate processing of data within workflow pipelines. Trough policy-based data management

systems, it will be possible to implement feature- based indexing of data collections. Discovery of data can be driven by the presence of desired features within the data set, instead of descriptive metadata. Tis requires the ability to apply a procedure to the data, determine whether the desired feature is present, and build an associated index. Trough policy-based systems it will be

possible to link virtual collections to virtual networks, and access data by name. A data management system can be integrated with network routers, such as OpenFlow technology, and dynamically define the network path that is used to access a file. If a file is replicated within the logical collection across multiple storage locations, the request for a file can be automatically routed to the closest copy. Tese types of applications imply that

policy-based systems will become pervasive, and migrate into storage controllers (for automated data processing) and into the internet (for intelligent networks). In each case, the knowledge required for processing or transferring data can be captured as procedures that are automatically applied under policy-based control. Te integration of networking technology and data management technology has become essential for managing the movement and storage of the exabytes of data that are being generated today. Policy-based systems provide a way to proceed.l

Reagan W. Moore is a professor at the School of Information and Library Science, University of North Carolina at Chapel Hill, and chief data scientist at Renaissance Computing Institute (RENCI)

Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32  |  Page 33  |  Page 34  |  Page 35  |  Page 36