Sequencing
Table 1 Processing step
Input files Output files
Common algorithms and toolkits
Read QC
FASTQ or SAM/BAM
FASTQ or SAM/BAM
FASTqc, GATK ClipReads, Trimmomatic
Alignment
FASTQ (for sample) and FASTA (for reference)
BWA and a host of others
Variant calling SAM/BAM
SAM/BAM/CRAM VCF/gVCF/BCF
SAMtools, GATK- UnifiedGenotyper, a host of others
Variant
annotation VCF/BCF
VCF/BCF, BED or TXT
SIFT, PolyPhen, SNPeff, Annovar, VEP, Varant, ClinVar, a host of others
typically provide parameters that help tune the algorithm to the quality and depth of the data, the types and frequencies of variants expected, the characteristics of the genome, or indeed for com- putational efficiency. These parameters have default values that work well for most cases, but significant experience is often required to know when and how to adjust the parameters for less straightforward data. Finally, some algorithms (especially the computationally-intensive align- ment algorithms) introduce stochastic effects by their design. They may use heuristics for the sake of computational efficiency, or depend on the order of execution when executing parallel threads. By way of example, the PrecisionFDA Consistency Challenge evaluated the reproducibili- ty of secondary analyses of the same known input across multiple executions of the same pipeline. Of 18 pipelines that participated in the challenge, eight were denoted as ‘Deterministic’, giving the same set of variants each time. The remaining 10 had inter-run differences ranging from 0.01% to 2.6% of the total number of variants detected. These discrepancies may seem small in numerical terms, but the actual number of clinically-relevant variants in any analysis is often small, and one must be sure that these variants are not the ones subject to much variability.
Interpretation
If one has a secondary analysis pipeline that is robust and reproducible, ie analytically valid, one then faces the next challenge: to interpret proper- ly the meaning of the variants that are found in terms of their clinical impact, and to make sound decisions based on that interpretation. Public annotation databases such as ClinVar and The Cancer Genome Atlas offer curated sources of information about variants which have reason- able evidence of clinical effect. For variants that
66
have not yet reached this level of certainty, tools such as SIFT, PolyPhen, Variant Effect Predictor, etc, can use other methods to assess the likely bio- logical (if not clinical) impact of variants. These tools provide qualitative assessments such as ‘benign’, ‘possibly damaging’, or ‘likely damag- ing’ to convey the predicted impact of a variant on the associated protein. The FDA has issued draft guidance for assessing whether a public annotation database provides valid scientific evi- dence that might support claims of clinical valid- ity of NGS-derived variants.
In addition, studies have investigated the vari- ability in variant data interpretation between dif- ferent locations, such as the nine-lab study run by the Clinical Sequencing Exploratory Research (CSER) Consortium. This study demonstrates that consistent interpretation of the clinical impact of variants remains a challenge, even when the same guidelines are being followed by differ- ent organisations.
Enabling technologies
In addition to ongoing development of new and better NGS algorithms, there have been many efforts to develop higher-level technologies that can help address these issues. The Common Workflow Language (CWL) is an open-source language for specifying the exact steps and parameters used in a lengthy analysis pipeline. There are many examples of reproducible analysis platforms, including Galaxy and Taverna, that can record and replay an analysis flow. The combination of a common work- flow language with new container technologies such as Docker, mean that these frameworks can be implemented in a way that scales within and across computing environments and cloud configurations (for example, Rabix from Seven Bridges Genomics). The FDA has been working on the specification of a BioCompute Object (BCO). The goal is to
Drug Discovery World Winter 2017/18
Page 1 |
Page 2 |
Page 3 |
Page 4 |
Page 5 |
Page 6 |
Page 7 |
Page 8 |
Page 9 |
Page 10 |
Page 11 |
Page 12 |
Page 13 |
Page 14 |
Page 15 |
Page 16 |
Page 17 |
Page 18 |
Page 19 |
Page 20 |
Page 21 |
Page 22 |
Page 23 |
Page 24 |
Page 25 |
Page 26 |
Page 27 |
Page 28 |
Page 29 |
Page 30 |
Page 31 |
Page 32 |
Page 33 |
Page 34 |
Page 35 |
Page 36 |
Page 37 |
Page 38 |
Page 39 |
Page 40 |
Page 41 |
Page 42 |
Page 43 |
Page 44 |
Page 45 |
Page 46 |
Page 47 |
Page 48 |
Page 49 |
Page 50 |
Page 51 |
Page 52 |
Page 53 |
Page 54 |
Page 55 |
Page 56 |
Page 57 |
Page 58 |
Page 59 |
Page 60 |
Page 61 |
Page 62 |
Page 63 |
Page 64 |
Page 65 |
Page 66 |
Page 67 |
Page 68 |
Page 69 |
Page 70 |
Page 71 |
Page 72