In my last blogpost, I reviewed molecular pathways that are influental on Alzheimer's disease (AD) pathogenesis. The main pathway of my interest was PI3K/Akt pathway that is essential for neuronal growth, proliferation, cell survival, synaptic plasticity and cognitive function in central nervous system (CNS). This pathway is mediated by insulin signalling, impairment of which can lead to severe outcome in AD pathogenesis. Type 2 diabetes (T2D) is one major factor that contributes to impairment of insulin signalling, and also causes chronic systemic inflammation, which can furthermore attenuate AD disease progression.
AD is a complex disease and is a sum total of multiple factors, be it environmental, lifestyle or genetic. However, if taken proactive measures to diminish or outright remove some of these factors that cause AD, end result could be much more effective than starting the treatment once symptoms of AD present themselves. One measure could be altering one's lifestyle to diminish the individual's risk to develop AD. My point of interest is T2D, potentially major comorbidity factor to AD [1], and perhaps with proper diet, training and exercise, the risk of developing T2D could be nullified.
However, does T2D actually have functional connection with AD? This can be studied with observational study, where most common design is to control for confounding variables between the exposure and the outcome, but this strategy can easily lead to biased estimates and false conclusions [2]. Has relationship between AD and T2D been misled by confounding factors and is T2D not really causally related to AD, but associates with other factors that contribute to AD development? As chronic inflammation is associated with AD [3], regardless of T2D, could the causal relationship even be reversal, where AD is actually causing T2D? I'm interested in finding out, and with Mendelian randomization (MR) this could be possible!
MR utilizes the random assortment of parents genes to offspring during recombination, providing a method to assess the causal associations between exposure (in this case T2D) and outcome (in this case AD) [4, 5]. This method was originally coined with evaluating effectiveness of allogenic sibling bone marrow transplantation in treatment of acute myeloid leukaemia [6].
Human genome is enriched with single nucleotide polymorphisms (SNPs), where at a specific genomic locus, there can be a variation of single nucleotide from person to person, and these variations at specific loci create specific homozygous or heterozygous genotypes, determined by Mendelian heritability. If such SNPs produce phenotypic differences that mirror modifiable, disease risk altering environmental exposures, the different SNPs should be related to disease risk to the extent predicted by their influence on the phenotype [4].
Mendelian randomization has been propelled by the increasing availability and scale of genome-wide association studies (GWAS) summary data, and new meta-analysis methods have been developed around this data structure. However, if one doesn't plan the study accordingly, their study will make unrealistic simplifying assumptions and generally lack theoretical support such as statistical consistency and asymptotic sampling distribution results [2].
I want to estimate the causal effect of an exposure variable X (T2D) on an outcome variable Y (AD), which may be confounded by unobserved variables. I have p amount of SNPs, that could be instrumental variables (IV), which can help me obtain unbiased estimate of the causal effect between T2D and AD, despite unmeasured confounding. The SNP-exposure effect and SNP-outcome effect can be computed from two different samples with linear or logistic regression [2].
With the assumption that for every SNP, the variances of The SNP-exposure effect and SNP-outcome are known and SNP-exposure effect and SNP-outcome are mutually independent, we can create a following model for GWAS summary data:
Γj ≈ β0 γj for almost all j ∈ [p]
Where Γj is the SNP-outcome effect, γj is the SNP-exposure effect, j represents a single SNP, [p] represents all SNPs and β0 is the outcome effect between X (T2D) on Y (AD) [2]. In order for the MR to be valid for testing causality, each specific SNP must be robustly associated with the exposure, independent of any confounders of the exposure and outcome, and be independent of the outcome given the exposure and the confounders (figure 1) [7].
Figure 1. The general causal diagram and the three criteria for valid instrumental variable (IV). The diagram represents the hypothesized relationship between genetic instrument (G), exposure (X), outcome (Y) and all unmeasured variables (U) which confound X and Y. β is the causal effect of X on Y. δ is the genetic effect on X. Dashed lines and crosses indicate violations of the standard IV assumptions which can lead to bias [7].
Thankfully some streamlined tools have been made available to conduct MR experiments, such as MR-Base. MR-Base was developed to provide easy access to analysis-ready data and allow systematic application of Mendelian randomization methods. One convenient aspect about doing exposure-outcome assessment with MR-Base, is the accessibility of research data. MR-Base combines a database of summary statistics on traits and health outcomes from >20,000 GWAS datasets [5]. It is available as a web application or as an R package. I introduced myself to some handy tutorials, that made the installation and use of the tools a breeze.
In:
library (devtools)
install_github(c("MRCIEU/TwoSampleMR","MRCIEU/MRInstruments"))
# install_github("WSpiller/MRPracticals",build_opts = c("--no-resave-data", "--no-manual"),build_vignettes = FALSE) # useful tutorial vignettes for learning to use MR-Base
library(dplyr)
library(MRPracticals)
library(TwoSampleMR)
library(MRInstruments)
library(rsnps) # this is a handy tool for fetching metadata about discovered SNPs
Here I'll fetch from NHGRI-EBI GWAS catalog of significant associations relevant datasets with "Type 2 diabetes" keyword in the phenotype_simple column. I'll also add European ancestry "EA" as inclusion criterion in the phenotype_info column.
In:
data("gwas_catalog")
# Using regular expressions can be handy in finding type 2 diabetes datasets, since there can be other vocabulary mixed in
exposure_gwas <- gwas_catalog[stringi::stri_detect_fixed(gwas_catalog$Phenotype_simple, "Type 2 diabetes"), ]
# EA = european ancestry
exposure_gwas <- subset(exposure_gwas, grepl("EA", Phenotype_info))
# Also, if one would be interested, it is also possible to select focus on either sex in the study,
# but I'm not going to implement that in this experiment
#exposure_gwas <- exposure_gwas[!grepl("women", exposure_gwas$Phenotype_info),]
#exposure_gwas <- exposure_gwas[!grepl("men", exposure_gwas$Phenotype_info),]
Let's inspect the dataset, and set p-value treshold, which in this case will be <5*10^-8. Setting the p-value treshold will ensure robust instruments.
In:
# set p-value treshold
exposure_gwas <- exposure_gwas[exposure_gwas$pval<5*10^-8,]
# format the data
exposure_data <- format_data(exposure_gwas)
Out:
other_allele column has some values that are not A/C/T/G or an indel comprising only these characters or D/I. These SNPs will be excludedMore than one type of unit specified for Type 2 diabetesThe following SNP(s) are missing required information for the MR tests and will be excluded
rs13266634
rs3923113
rs516946
rs7202877
rs7674212
rs7754840
rs7903146
rs944801
Also, I'll prune SNPs with clump_data function with linkage disequilibrium treshold value of 0.001. This will prune the set of SNPs for linkage disequilibrium, as using correlated SNPs can lead to biased causal effect estimates.
In:
# clump the gwas data based on LD
exposure_data <- clump_data(exposure_data, clump_r2 = 0.001)
Out:
API: public: http://gwas-api.mrcieu.ac.uk/
Please look at vignettes for options on running this locally if you need to run many instances of this command.
Clumping KoMTD8, 2 variants, using EUR population reference
Clumping x6K7fU, 24 variants, using EUR population reference
Removing 1 of 24 variants due to LD with other variants or absence from LD reference panel
Clumping KivTpa, 21 variants, using EUR population reference
Removing 1 of 21 variants due to LD with other variants or absence from LD reference panel
Here I'll fetch outcome datasets with the keyword "Alzheimer", from which I'll select a dataset, which has greatest number of SNPs. After this, I'll harmonize the data with harmonise_data function, where the effect of a SNP on the exposure and the effect of that SNP on the outcome must each correspond to the same allele.
In:
ao<-available_outcomes()
outcome_gwas <- subset(ao, grepl("Alzheimer", trait))
# pick GWAS outcome dataset, which has greatest nsnp value
highest_nsnp_count <- outcome_gwas %>% filter(nsnp == max(nsnp))
# fetch the outcome dataset
outcome_data <- extract_outcome_data(snps = exposure_data$SNP, outcomes = highest_nsnp_count[[1,1]])
Out:
Extracting data for 45 SNP(s) from 1 GWAS(s)
Finding proxies for 21 SNPs in outcome finn-a-G6_ALZHEIMER
Extracting data for 21 SNP(s) from 1 GWAS(s)
Because there are several id.exposure identificators, they create different groups in results, from which MR-Base creates several results. This can cause a flood of information. Therefore, I'll create a dplyr pipeline, which will select most numerous exposure by simply summing the number of each id.exposure and picking the one that is most numerous. harmonise_data function creates random identificators, so it will be easiest to pick the most numerous id.exposure this way.
In:
highest_count <- H_data %>% count(id.exposure) %>% filter(n == max(n))
highest_count
Out:
id.exposure | n | |
---|---|---|
1 | ctCHA2 | 20 |
In:
H_data2 <- subset(H_data, grepl(highest_count[1,1], id.exposure))
mr_results <- mr(H_data2)
mr_results
Out:
id.exposure | id.outcome | outcome | exposure | method | nsnp | b | se | pval | |
---|---|---|---|---|---|---|---|---|---|
1 | ctCHA2 | finn-a-G6_ALZHEIMER | Alzheimer disease || id:finn-a-G6_ALZHEIMER | Type 2 diabetes (unit decrease) | MR Egger | 20 | 0.36 | 0.41 | 0.40 |
2 | ctCHA2 | finn-a-G6_ALZHEIMER | Alzheimer disease || id:finn-a-G6_ALZHEIMER | Type 2 diabetes (unit decrease) | Weighted median | 20 | 0.19 | 0.16 | 0.24 |
3 | ctCHA2 | finn-a-G6_ALZHEIMER | Alzheimer disease || id:finn-a-G6_ALZHEIMER | Type 2 diabetes (unit decrease) | Inverse variance weighted | 20 | 0.30 | 0.11 | 0.01 |
4 | ctCHA2 | finn-a-G6_ALZHEIMER | Alzheimer disease || id:finn-a-G6_ALZHEIMER | Type 2 diabetes (unit decrease) | Simple mode | 20 | 0.09 | 0.27 | 0.75 |
5 | ctCHA2 | finn-a-G6_ALZHEIMER | Alzheimer disease || id:finn-a-G6_ALZHEIMER | Type 2 diabetes (unit decrease) | Weighted mode | 20 | 0.14 | 0.25 | 0.59 |
Let's see the odds ratio scale if there is any statistically significant relationship between T2D and AD.
In:
generate_odds_ratios(mr_results)
Out:
id.exposure | id.outcome | outcome | exposure | method | nsnp | b | se | pval | lo_ci | up_ci | or | or_lci95 | or_uci95 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ctCHA2 | finn-a-G6_ALZHEIMER | Alzheimer disease || id:finn-a-G6_ALZHEIMER | Type 2 diabetes (unit decrease) | MR Egger | 20 | 0.36 | 0.41 | 0.40 | -0.45 | 1.17 | 1.43 | 0.64 | 3.21 |
2 | ctCHA2 | finn-a-G6_ALZHEIMER | Alzheimer disease || id:finn-a-G6_ALZHEIMER | Type 2 diabetes (unit decrease) | Weighted median | 20 | 0.19 | 0.16 | 0.24 | -0.13 | 0.50 | 1.21 | 0.88 | 1.66 |
3 | ctCHA2 | finn-a-G6_ALZHEIMER | Alzheimer disease || id:finn-a-G6_ALZHEIMER | Type 2 diabetes (unit decrease) | Inverse variance weighted | 20 | 0.30 | 0.11 | 0.01 | 0.08 | 0.51 | 1.34 | 1.08 | 1.67 |
4 | ctCHA2 | finn-a-G6_ALZHEIMER | Alzheimer disease || id:finn-a-G6_ALZHEIMER | Type 2 diabetes (unit decrease) | Simple mode | 20 | 0.09 | 0.27 | 0.75 | -0.44 | 0.61 | 1.09 | 0.64 | 1.85 |
5 | ctCHA2 | finn-a-G6_ALZHEIMER | Alzheimer disease || id:finn-a-G6_ALZHEIMER | Type 2 diabetes (unit decrease) | Weighted mode | 20 | 0.14 | 0.25 | 0.59 | -0.35 | 0.63 | 1.15 | 0.70 | 1.87 |
Only inverse variance weighted (IVW) method gave p-value of ~0.007. IVW method calculates the Wald ratio for each of the instruments and combines the results using an inverse-variance weighted meta-analysis approach. The slope from this approach can be interpreted as the causal effect of the exposure on the outcome. With mr_pleiotropy_test, we can review MR-Egger regression intercept, which provides an estimate of the magnitude of horizontal pleiotropy.
In:
mr_pleiotropy_test(H_data2)
mr_heterogeneity(H_data2, method_list=c("mr_egger_regression", "mr_ivw"))
Out:
id.exposure | id.outcome | outcome | exposure | method | Q | Q_df | Q_pval | |
---|---|---|---|---|---|---|---|---|
1 | ctCHA2 | finn-a-G6_ALZHEIMER | Alzheimer disease || id:finn-a-G6_ALZHEIMER | Type 2 diabetes (unit decrease) | MR Egger | 17.00 | 18.00 | 0.52 |
2 | ctCHA2 | finn-a-G6_ALZHEIMER | Alzheimer disease || id:finn-a-G6_ALZHEIMER | Type 2 diabetes (unit decrease) | Inverse variance weighted | 17.02 | 19.00 | 0.59 |
mr_scatter_plot function produces method comparison plot, that is a graphical representation of the effect of the SNPs on the exposure, that is always positive and the effect of the SNPs on the outcome is directed accordingly. I'll inspect individual SNP effects with forest plots and leave-one-out plots. Lastly, I'll depict a funnel plot, which presents the effect estimates against a measure of precision, which in this case will give visual assessment of heterogeneity.
In:
res_singles <- mr_singlesnp(H_data2)
res_loo <- mr_leaveoneout(H_data2)
mr_funnel <- mr_funnel_plot(res_singles)
mr_scatter_plot(mr_results, H_data2)
Out:
Figure 2. Method comparison plot. Indicates some minor effect exposure-outcome effect between T2D and AD.
Forest plots depict summary graphs showing the individual effects of SNPs, calculated using the Wald ratio, along with the overall results to assess the consistency across SNPs.
In:
mr_forest_plot(res_singles)
Out:
Figure 3. Forest plot of causal effects of individual SNPs to the outcome.
Leave-one-out graph shows the results of MR analyses using the IVW method when leaving one SNP out each time. This analysis can be used to assess whether the SNPs are consistent in terms of their effect on the overall outcome or whether the results are being driven by a single outlying SNP.
In:
mr_leaveoneout_plot(res_loo)
Out:
Figure 4. Leave-one-out analysis plot.
Horizontal pleiotropy occurs when the outcome is affected by the SNPs through a pathway that is independent of the exposure and invalidates the second Mendelian randomization assumption. A funnel plot can be used to assess heterogeneity, particularly horizontal pleiotropy, which is likely if points are spread. Directional horizontal pleiotropy may be present if the graph is not symmetrical. If the spread is weighted toward certain direction, this can be evidence of directional pleiotropy.
In:
mr_funnel_plot(res_singles)
Out:
Figure 5. Funnel plot visually to assess heterogeneity.
It thought it would handy, if the script would also automatically return a report about the SNPs presented in the forest plot. This would make inspecting SNPs much simpler, since the user doesn't necessarily have search from external sources metadata concerning these SNPs.
In:
snps <- res_singles$SNP[1:as.numeric(length(res_singles$SNP)-2)]
ncbi_snp_query(snps)
Out:
query | chromosome | bp | class | rsid | gene | alleles | ancestral_allele | variation_allele | seqname | hgvs | assembly | ref_seq | minor | maf | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | rs10811661 | 9 | 22134095.00 | snv | rs10811661 | T,A,C | T | A,C | NC_000009.12 | NC_000009.12:g.22134095T>A,NC_000009.12:g.22134095T>C | GRCh38.p12 | T | C | 0.14 | |
2 | rs1111875 | 10 | 92703125.00 | snv | rs1111875 | C,G,T | C | G,T | NC_000010.11 | NC_000010.11:g.92703125C>G,NC_000010.11:g.92703125C>T | GRCh38.p12 | C | T | 0.38 | |
3 | rs11708067 | 3 | 123346931.00 | snv | rs11708067 | ADCY5 | A,G | A | G | NC_000003.12 | NC_000003.12:g.123346931A>G | GRCh38.p12 | A | G | 0.18 |
4 | rs12571751 | 10 | 79182874.00 | snv | rs12571751 | ZMIZ1 | A,G | A | G | NC_000010.11 | NC_000010.11:g.79182874A>G | GRCh38.p12 | A | G | 0.47 |
5 | rs13266634 | 8 | 117172544.00 | snv | rs13266634 | SLC30A8/LOC105375716 | C,A,T | C | A,T | NC_000008.11 | NC_000008.11:g.117172544C>A,NC_000008.11:g.117172544C>T | GRCh38.p12 | C | T | 0.27 |
6 | rs1552224 | 11 | 72722053.00 | snv | rs1552224 | ARAP1 | A,C | A | C | NC_000011.10 | NC_000011.10:g.72722053A>C | GRCh38.p12 | A | C | 0.13 |
7 | rs1801282 | 3 | 12351626.00 | snv | rs1801282 | PPARG | C,G | C | G | NC_000003.12 | NC_000003.12:g.12351626C>G | GRCh38.p12 | C | G | 0.10 |
8 | rs2237892 | 11 | 2818521.00 | snv | rs2237892 | KCNQ1 | C,T | C | T | NC_000011.10 | NC_000011.10:g.2818521C>T | GRCh38.p12 | C | T | 0.09 |
9 | rs2421016 | 10 | 122407996.00 | snv | rs2421016 | PLEKHA1 | C,T | C | T | NC_000010.11 | NC_000010.11:g.122407996C>T | GRCh38.p12 | C | T | 0.48 |
10 | rs2796441 | 9 | 81694033.00 | snv | rs2796441 | LOC101927502 | G,A,T | G | A,T | NC_000009.12 | NC_000009.12:g.81694033G>A,NC_000009.12:g.81694033G>T | GRCh38.p12 | G | A | 0.35 |
11 | rs3923113 | 2 | 164645339.00 | snv | rs3923113 | A,C,G,T | A | C,G,T | NC_000002.12 | NC_000002.12:g.164645339A>C,NC_000002.12:g.164645339A>G,NC_000002.12:g.164645339A>T | GRCh38.p12 | A | C | 0.43 | |
12 | rs4457053 | 5 | 77129124.00 | snv | rs4457053 | ZBED3-AS1 | G,A | G | A | NC_000005.10 | NC_000005.10:g.77129124G>A | GRCh38.p12 | G | A | 0.76 |
13 | rs4607103 | 3 | 64726228.00 | snv | rs4607103 | ADAMTS9-AS2 | C,G,T | C | G,T | NC_000003.12 | NC_000003.12:g.64726228C>G,NC_000003.12:g.64726228C>T | GRCh38.p12 | C | T | 0.27 |
14 | rs5215 | 11 | 17387083.00 | snv | rs5215 | KCNJ11 | C,T | C | T | NC_000011.10 | NC_000011.10:g.17387083C>T | GRCh38.p12 | C | T | 0.70 |
15 | rs7202877 | 16 | 75213347.00 | snv | rs7202877 | T,C,G | T | C,G | NC_000016.10 | NC_000016.10:g.75213347T>C,NC_000016.10:g.75213347T>G | GRCh38.p12 | ||||
16 | rs7578326 | 2 | 226155937.00 | snv | rs7578326 | LOC646736 | A,G | A | G | NC_000002.12 | NC_000002.12:g.226155937A>G | GRCh38.p12 | A | G | 0.35 |
17 | rs7578597 | 2 | 43505684.00 | snv | rs7578597 | THADA | T,C | T | C | NC_000002.12 | NC_000002.12:g.43505684T>C | GRCh38.p12 | T | C | 0.13 |
18 | rs7957197 | 12 | 121022883.00 | snv | rs7957197 | OASL | T,A | T | A | NC_000012.12 | NC_000012.12:g.121022883T>A | GRCh38.p12 | T | A | 0.18 |
19 | rs864745 | 7 | 28140937.00 | snv | rs864745 | JAZF1 | T,C | T | C | NC_000007.14 | NC_000007.14:g.28140937T>C | GRCh38.p12 | T | C | 0.42 |
20 | rs9505118 | 6 | 7290204.00 | snv | rs9505118 | SSR1 | A,G | A | G | NC_000006.12 | NC_000006.12:g.7290204A>G | GRCh38.p12 | A | G | 0.37 |
Unfortunately, the results are quite inconsistent, and only inverse variance weighted (IVW) method gave p-value of ~0.007. Other methods present non-significant causal effect between T2D and AD, that may be the result from too small number of SNPs in my dataset, too differing ethnic background between exposure and outcome datasets, or there isn't significant causal relationship between T2D and AD and there may may be bias in IVW method.
Method comparison plot depicted some exposure-outcome effect between T2D and AD, but the different method regressions spread out as the magnitude increased, indicating non-constant variability in the measurement. Pleiotropy levels were ~-0.0057 (SE: ~0.037, p = ~0.88). Thus, MR-Egger intercept was not significant using a 95% threshold, neither there was any heterogeneity between individual SNP estimates at a global level, which was indicated by the Q statistics and corresponding p-values for heterogeneity. This was further supported by funnel plot, which, compared to examples [5], doesn't have too wide spread, and also seems symmetrical. However, the relatively small number of SNPs leave some room for speculation.
Overall, these results don't have the final say on causal relationship between T2D and AD, but it was a good exercise in using R and MR-Base.