SCW Summer 2022

LABORATORY INFORMATICS

Today’s DNA sequencing technologies now make it possible to sequence whole

human genomes cost effectively and with speed. Sequencing initiatives are generating vast volumes of data that – theoretically – give scientists a starting point to drill down into individual patient genomes in the hunt for disease-related variants, and also to mine collectives of huge public datasets to aid our understanding of the genetic basis of disease, unpick disease mechanisms, identify drug and diagnostic targets, and stratify patients for clinical trials and personalised medicine. In practice, analysing this wealth of genomic data, in the context of associated biological and clinical data, is challenging. Gene variants identified through genotyping studies are stored in variant call format (VCF) files, but deriving patterns and insight from these files and connecting disparate data types isn’t necessarily intuitive. And with relational datasets generated through large public and private initiatives (containing potentially millions of variants from many thousands of individuals) there are immediate issues associated with scale, as well as with how one can formulate the

“If you’ve ever worked with a relational database, you have to typically join data across lots of tables”

right queries. In contrast with relational databases, graph databases can help to transform large-volume unstructured data into actionable knowledge, explains Alicia Frame, senior director, graph data science at Neo4j. ‘In the case of genomic research, the key problem is how to integrate the large volumes of highly heterogeneous data and gain maximum insight,’ she says. This is whether for diagnosis, personalised therapies or drug development, she is keen to stress. ‘Graph databases are an ideal way to represent biomedical knowledge and offer the necessary flexibility to keep up with scientific progress. Using graph databases, a well-designed data model and query can deliver in seconds what previously took days of manual analysis.’ Graph platforms are effectively a way

of representing and storing data as connected concepts, Frame explains. ‘You can think of the graph as built on nodes that are concepts and then the relationships that connect them,’ she says. ‘In ‘everyday speak’, we might well

www.scientific-computing.com | @scwmagazine

consider the nodes as nouns. So, in the genomics or bioinformatics space, these ‘nouns’ are the genes, chemicals, diseases, variants and phenotypes. And then, of course, the relationships between them are effectively the verbs, which connect the concepts. It’s – kind of – a real-world systems biology model.’ Under the Neo4j platform, the data is

stored in the same way that the ‘nouns’ and ‘verbs’ relate to each other in biology, says Frame, so getting the data you want back out is very intuitive. In a relational database, where everything is stored as rows and columns, you need to join the data – and that means spending a lot of time thinking about how the computer stores that data and trying to map how to connect it. Cypher lets a domain expert query far more naturally for patterns in the data. ‘So the user can literally ask the database to find chemicals that bind to receptors for particular genes that are associated with a particular disease,’ says Frame. This makes it very easy to effectively express a ‘mental model’ and phrase the questions naturally and retrieve the relevant information from the underlying database. ‘If you’ve ever worked with a relational database, you have to typically join data across lots of tables,’ she says. ‘The more complex the query, the more complex it is to join the proverbial dots in the table. The more joins you have, the slower it is and the more difficult it is to write the query,’ Frame acknowledges. ‘Use a labelled property graph model based on nodes and relationships and there is no need to consider joins between tables, because the data is already joined.’ It also becomes intuitive to add new data as it is derived.

Open-source and user-friendly Graph databases also make it much easier to build applications for every end user – think again, clinicians and researchers – and, at the back end, it becomes relatively easy for the person building the graph to maintain the resource, update it and deliver it to those end users. Neo4j has focused on making the

open-source platform easily accessible and user-friendly for novices and smaller initiatives. ‘For the community edition, we offer the database, plugins for data science and visualisation tools,’ explains Frame. ‘If you are a researcher or an individual, you can download our database and our software from our website for free . In fact, many groups start there.’ The pivot point between the free, open-source version and the commercial enterprise platform will depend on the volume of data and the number of people who will

“Graph databases are an ideal way to represent biomedical knowledge and offer the necessary flexibility to keep up with scientific progress”

be using the system, she adds. ‘One of the primary differences between the free community version and the enterprise system is parallelisation. The community platform will use up to four cores, whereas users of the enterprise platform can tap unlimited numbers of cores for faster computation when datasets are really huge and speed is important.’ In fact, many public genomic

datasets are already encoded as graph databases. ‘The NCBI, for example, has downloadable graph representations of many of its public databases,’ Frame says. ‘We also have a ‘graphs for good’ programme, through which we offer the commercial, enterprise software for free to nonprofits, charities, researchers and academics in order for them to do their research; we also licence the database and the plugins to drug discovery companies such as Novo Nordisk.’ The most obvious – although not the only – challenge associated with managing and analysing genomics data is its scale, comments Ignacio (Nacho) Medina, CTO of Zetta Genomics and founder of the open-source computational biology (OpenCB) platform. OpenCB is a bioinformatics suite that is designed to allow genotypic data management and analysis on a scale relevant to the massive sets of genome sequencing results that the research and clinical communities are generating. Medina describes the platform as a full stack open-source software solution, enabling large-scale genomic data storage, indexing, analysis and visualisation.

Scalable genomics research The need for a dedicated, genomics- focused platform became increasingly evident to Medina more than a decade ago with the emergence of next- generation sequencing technologies and with the application of genotyping – not just for basic disease research, but also in clinical settings for potential applications in disease diagnosis and the development of personalised medicine. As the first scalable solution enabling genotypes – recorded in variant call

g Summer 2022 Scientific Computing World 21

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38

orderForm.title