SCW_OCTNOV17

LABORATORY INFORMATICS

scientists lacked the infrastructure and tools required to do their work. I therefore set about creating an

infrastructure that would support the team. This included building a large expression data repository that would support various queries and interface with other data sources and applications. These types of tools and infrastructure

have since been developed by Genestack to support pharma and other organisations that have large, complex data sets. These technologies address data management issues, so an exploration of the underlying enablers might provide useful background.

Define your terms with metadata The game-changing technology, developed by Genestack, is a really powerful mechanism for describing data – these can be projects, experiments, studies or individual data types, such as a chemical structure or dose.

own data in a bigger context. But even when the data is accessible, issues like missing provenance information need to be addressed before meaningful insights can be gained.

This highlights the third bottleneck: lack

of consistency between the data sets. The same kind of data might be stored

in one format in the US and in another in a European institute. Part of the problem is a lack of standardisation, but it is also historical – whoever started up a group picked the format they liked, and we ended up with many disparate formats. The way that data is described is also

often inconsistent. Collaborators working in different labs will use different terms, or metadata, to describe data. For example, a data obtained on a human sample could be put in a repository while being tagged either ‘Homo sapiens’, ‘H. sapiens’, ‘Homo sapiens sapiens’ or ‘human’. So, a biologist querying ‘human’ samples could choose to use any of these terms, at the risk of overlooking a significant portion of the data.

Create structure I first became aware of the data management issues when I joined the European Bioinformatics Institute in 2002. It was clear that functional genomics

www.scientific-computing.com | @scwmagazine

”Scientists want to find out where all the data is, both within the organisation and in public repositories, and then figure out what is relevant to their research”

This is achieved by applying descriptors

to the different types of data. For example if it is a sequencing assay you can define its type (array or VCF data), its source (from a private or public repository) and its attributes.

The other element is the use of

ontologies: a controlled vocabulary based on a curated list of agreed-upon terms to describe genomics elements, processes and interactions. By using an ontology, all synonyms of a term can automatically be taken into account and can control how the data is described. Once created, the ontologies need to be

tested with sample data and a validation report produced to ensure the data has been described as desired. Using these ontologies enables consistency between data sets within an organisation and shared with partners and collaborators. Ontologies can be tailored to meet the needs of a particular organisation. Use of metadata and ontologies

makes it possible to search for data across multiple public and private databases. So a search for ‘human’ will automatically take into account ‘Homo sapiens’ and

‘Homo sapiens sapiens’ for example. This results in a system where the user no longer has to worry about where data is, in what format it is and how to describe it. Additionally, the system can index existing public repositories. Most of the metadata people are familiar with, such as the track organisation in iTunes, is produced automatically, already organised, and does not require manual input. This is the type of experience that we are looking to create in the field of genomics data management.

Improved search makes translational medicine possible Pharmaceutical companies have many

compounds that have failed efficacy trials but were successful in 30 per cent of the subjects tested. There is considerable scope to reanalyse the results and develop a companion diagnostic that will identify those patients that will respond to the treatments. This may be sufficient to resubmit for licensing a therapy that was previously shelved. These types of investigation would

make the development of therapies for orphan diseases more cost effective. There is a strong need within the

community for an improved ability to search for relevant data for their research problems; and yet, most of the time, these searches are difficult. Searching is hindered as different R&D

groups create data in different formats so you need to find ways to make it shareable and understand its provenance before meta analysis is possible. To enable an efficient query system, a strong metadata system is needed to exploit genomics and omics data. Those metadata ultimately allow information to be accurately stored in the proper repositories, easily retrieved, collated and cross-analysed. An accurate metadata system is crucial to be able to reproduce experiments. Good metadata systems have the potential to fast-track analysis by sometimes removing the need to analyse the data itself. Improved data management has the potential to make discovery scientists more productive, less frustrated by mundane, repetitive tasks and freer to work on innovative projects.

www.genestack.com

Dr Misha Kapushesky is the founder and CEO of bio data management company Genestack. Prior to Genestack, Dr Kapushesky was a team leader in Functional Genomics at the European Bioinformatics Institute. Where he led a team developing bioinformatics data systems, such as the Expression Atlas and R Cloud. Tools designed to manage and query the data EBI was archiving.

October/November 2017 Scientific Computing World

23

Gorodenkoff/Shutterstock.com

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36

orderForm.title