Causality between genetic variants and survival months for cancer patients
In this post, I present how to execute Mendelian Randomization, which is one of popular causal inference methodologies using genetic variants, using public data and open source.
Mendelian randomization (MR) analyses the causal effect of an exposure (e.g. alcohol intake) on an outcome (e.g. Heart disease) using an instrument variable (e.g. genetic variant) with confounder (e.g. Ageing, Shoe size). To measure a correct causality, MR should follows a strict association among those variables.
To execute MR, I use a public dataset (cBioPortal, ClinVar) and an open source library (DoWhy). cBioPortal provides cancer study dataset about patient profile and their genetic mutation information. ClinVar aggregates information about genomic variation and its relationship to human health. And the DoWhy library provides an end-to-end function about a causal model identification, estimation and validation.
I implement all process in this article as Python Jupyter notebook. To follow this script, it needs to download ClinVar and MSK dataset on cBioPortal written in the script.
/* The purpose of this analysis is to implement MR analysis using open data and libraries. It doesn’t proceed with proper validation for models so MR analysis results in this article cannot be accepted until possible counterfactual models are verified. */
1. Public data — cBioPortal, ClinVar
1.1. Dataset description
The cBioPortal provides cancer study data. There are 337 studies, 168578 patient records and genetic information (mutation, CNA, RNA-Seq) as of Jan. 2022. I chose the MSK MetTropism dataset (MSK, Cell 2021). It integrated a pan-cancer cohort of tumour genomics and clinical outcome about 25775 patients. Table 1. shows patient profile and genetic mutation information in MSK dataset.
ClinVar provides a link between genomic variation and human health. It contains clinical significance, disease or phenotype information for genetic variants. Clinical significance represents a level of a disease-causing variant (e.g. benign, pathogenic) against mutation information (e.g. dbSNP)
1.2. Variable association
I generate a merged dataset for MR analysis using cBioPortal and ClinVar dataset. There is one common key in both datasets — dbSNP RS identification to merge. Table 3. shows the MR variables and data sources.
The variable association in MR analysis should satisfy the relationship in Fig 1. Normally, a cancer presence is used as the outcome (Y) in standard MR analysis. Because all patients in the MSK dataset have cancers and genetic mutations. So, our causal question is a level of the clinical significance associated with the survival months.
2. Data processing
2.1. Mutation — clinical significance list generation from ClinVar
ClinVar provides more than 1 million assertion criteria. The assertion criteria contains the variant assessment term like pathogenic, uncertain significance. Using the assertion criteria records, we generate the mutation (dbSNP) — clinical significance (pathogenic) list in Table 4.
ClinVar updates the assertion criteria monthly basis and provides a large file (20Gb XML) on their website so we download the file and extract the mutation and clinical significance data. Fig 2. shows a sample xml code containing dbSNP code, clinical significance in the assertion criteria.
2.2. Patient record generation from MSK dataset
MSK dataset contains 25775 patient records and genetic variants. Table 5 shows the patient profile sample for MR analysis.
Table 6. shows two genetic variant samples. Using Tumor_sample_barcode, it joins the patient profile to the genetic variant table.
3. Causal inference analysis
DoWhy library provides the end-to-end functionality for causal inference from a target identification to a verification of model estimation. There are two causality calculation methods; the back-door criterion and the instrument variable in the library. I use the instrument variable method to implement MR analysis.
3.1. Simple MR model with binary group value of instrument and exposure
I start from a simple MR model using three variables; Pathogenic (X), Survival months (Y), dbSNP (Z) and generate a dummy confounder (U).
Pathogenic variable (X) in ClinVar is a categorical variable representing the level of clinical significance (e.g. benign, like pathogenic, pathogenic). In this simple MR model, it converts to a binary value through two groups; 0: benign, 1: pathogenic.
Overall survival months variable (Y) is real month value so no conversion applied.
Number of dbSNP code (RS ID) variables (Z) is integer value. Patients can have more than one RS ID upon results of their sample tests. The average number of RS ID per patient is 5.4 and median is 2.0. In this simple model, it groups to 0 ( number of RS IDs is less than 2) or 1 (greater than 2).
Fig 3. shows a Python script to execute DoWhy which consists of four steps which is a standard DoWhy process.
The mean value of the estimation in the simple model is -6.81 indicating that the survival months of the pathogenic group have -6.81 shorter survival months than the benign group. But, p-value (0.262) is too high to reject the null hypothesis to justify a causal association between the clinical significance and survival months.
3.2. Simple MR analysis with integer value of instrument and exposure
In this test, I use an integer value type of dbSNP and clinical significance variables instead of grouping. Because the instrument variable (number of RS ID) is integer so it doesn’t need any conversion.
The exposure variable (clinical significance) is categorical type so it needs to convert to the ordinal number type. Table 8. shows mapping between the clinical significance category and the ordinal number type. For a process simplification, I assume each clinical significance has the same interval and some clinical significance can not be converted to an ordinal number (Not available).
Table 9. shows number of patients for each clinical significance without patients who have not mapped to the ordinal number.
The mean value of the estimate in this exercise (Fig 5.) is -3.93 which means the pathogenic patient has 23.58 (=-3.93*6) shorter survival months than the benign patient because pathogenic and benign have six level differences. The p-value (0.077) is smaller than 3.1. exercise but still high to reject the null hypothesis.
Fig 4. also shows the refute calculation result for MR model verification. In this case, the new effect value and p-value are 0.0 which means the model estimation is robust.
3.3. MR analysis with real confounder variables
In this exercise, I apply the age, sex, race data as the confounder variable. Table 10, 11, 12 show the basic statistics of the variables.
Fig 5. shows the result of the age confounder MR model. Compared to Fig 5, the p-value decreases.
Fig 7 shows the result of the multiple confounder MR model. Compared to Fig 6, the p-value decreases and can reject the null hypothesis.