A new bioinformatics analysis platform tailored towards clinic use

MULTIVARIATE STATISTICAL METHODS FOR INTEGRATING HIGH-DIMENSIONAL DATASETS

A central aim of WP3 was to develop new analytical strategies for integrating multiple related phenotypes. We have previously established statistical foundation and derived a statistical framework for linking genetic variants in gene sets to multiple correlated phenotypes. By exploiting the correlation structure due to shared genetic and non-genetic factors, such integrative modeling can lead to dramatic increases in the statistical power for detecting true associations. Within PANCANRISK we have now extended this model in order to derive new insights into the genetic architecture, thereby elucidating the communalities and differences of genetic effects across the traits under study. In particular, we have derived a method we denote the interaction set test (iSet), which is designed to increase power and interpretation when testing for genotype-context interactions. These occur if genetic effects are specific to some contexts but absent in others. Understanding interaction effects is important, for example to elucidate the interplay between genetic risk variants and environment. In PANCANRISK, we will apply iSet to detect context-specific effects of (rare) germline variants on cancer risk. This analysis will help to elucidate the ethology of risk factors across different cancer types. Additionally, these methods will be applied to dissect the genetic control of different molecular traits measured in primary tumors.

PIPELINES TO SUMMARIZE GENOME- WIDE EPIGENETIC TRACKS USING STATE-OF-THE-ART METHODS

We have developed a novel supervised machine learning based method to perform genome-wide identification of active enhancers further referred to as GEP (Genome-wide Enhancer Predictor). Unlike previous supervised methods, the GEP model has been trained exclusively on experimentally characterized enhancers (Kheradpour et al., 2014) and a heterogeneous balanced background dataset. The heterogeneous negative training set was selected from non-enhancer cis-regulatory elements (i.e., promoters), exons, introns and intergenic regions following a random sampling procedure subjected to the following constraints: (1) genome-wide location of the instances followed the same distribution across chromosomes as the one presented by the enhancer set; and (2) the whole negative training set was comprised of promoter, gene body and heterochromatin elements with approximate proportions of 50, 30 and 20 per cent; respectively. We have chosen Random Forest (RF) and Support Vector Machine (SVM) classifiers to build predictive models for classification of active enhancers.

A manuscript describing GEP and the benchmark analysis has been written. The GEP software is available here.

We also have developed the Multiple Signal Integrator (MSI), an R-based  package for identification of differentially active genomic regions using multi-dimensional omics data (epigenomic modifications, chromatin conformation, gene expression).

MODELLING AND INTEGRATION OF 3D CONNECTIVITY DATA USING HI-C

eQTL mapping approaches are one main strategies we propose for the identification of regulatory variants. Classical methods consider all variants in large regions around genes, which poses a substantial multiple testing burden (Stegle et al., 2010). We have explored strategies based on promoter capture Hi-C for limiting the search space, thereby testing for associations in a more targeted manner, which promises to increase power. The methods were implemented as open sources software within our LIMIX framework (https://github.com/limix/limix). The application of the software requires to essential input data types. First, a tissue type matched promoter capture Hi-C dataset is needed and second, an eQTL dataset with appropriate input format for LIMIX needs to be supplied.

DEVELOP META ANALYSIS APPROACH

We have described the strategy for the meta-analysis approach for predicting the potential functional impact of non-coding variants by integrating multiple sources of evidence, such as evolutionary conservation, potential motif damage and epigenetic context in a tissue-specific manner. We also described the proof-of-concept implementation of this strategy, which we evaluate and benchmark versus state-of-the-art methods.

Our results so far demonstrate that: 1) Using a context-specific variant annotation substantially improves the ability of the classifier to distinguish between positive and negative cases; 2) Our meta-analysis performs noticeably better than the current gold standard method, when variants are annotated with the epigenetic information from the appropriate context.

REGULATORY VARIANT PREDICTIONS THROUGH CANCER-GENOME eQTLs

Regulatory germline variations have tissue- and context specific roles. Despite the wide-range dysregulation of tissue in cancer, our results indicate that germline variants largely retain their regulatory roles from normal tissue contexts. Despite the predominantly shared regulatory landscape, we have also identified notable exceptions where the effect of regulatory variants changes in cancer contexts. One driver of these differences are variations in gene expression, although more interesting differences are likely due to epigenetic changes. These results also support the need to combine epigenetic and transcriptional information the identification of regulatory variations.

QTL MAPPING IN CANCER GENOMES

Somatic regulatory effects are difficult to identify because the effects are rare. To address this, we have devised two complementary strategies. First, to obtain a global view of different classes of mutations and their regulatory roles, we have used allele-specific expression analyses. Although this approach does not identify individual associations, it enables classifying regulatory effects in a fine-grained manner is not limited by allele-frequency of these events. Second, we have used somatic eQTL mapping to identify individual associations between recurrently mutated elements and gene expression. We have optimized a number of important parameters and developed a method that aggregates across somatic variants weighted by the fraction of cells in the tissue that carry these specific mutations.

INTEGRATED MODELLING OF EQTLS AND EPIGENETIC VARIATION DATA, BENCHMARKING

Based on the matched proxy tissues and epigenetic data, we derived epigenetic scores that reflect the likelihood of tissue-specific regulatory activity for each cancer eQTL variant. Briefly, we used the meta-analysis approach for epigenome-based variant scoring to calculate an activity score for each eQTL variant in the corresponding tissue type.

We made a great effort to integrate and combine epigenome- and genetics based scoring of regulatory variants. A notable observation was the extent of tissue-specific regulatory effects of eQTL. This observation differs from the common notion that the majority of eQTL are shared across tissues (e.g. GTEx 2015).  This finding may point to priming of eQTL effects in upstream lineages, with epigenome-based tissue specification. The resource we have established in hits deliverable will enable further mechanistic studies to explore these relationships in further detail.

INTEGRATION OF SOMATIC AND GERMLINE REGULATORY VARIANTS

We here described our efforts to integrate regulatory germline variants, somatic mutational signatures and gene expression profiles. At the present resolution of our data, based on the PCAWG cohort, we identify a considerable number of signature-expression associations, many of which point to relevant biological processes that are consistent with the aetiology of the corresponding mutational process. The most interesting opportunity is the causal integration of mutational variants with germline risk factors. At present, this analysis remains conceptual as we have little power to detect such effects. We have reported an association for APOBEC, which is a positive control and hence validates our approach. Larger datasets will be required for effective discovery of such relationships. The core of the findings presented here are part of the PCAWG working group 3 marker paper.