linear_modeling

Last updated: 2018-09-28

Code version: e0ca9da

Linear Modeling

Having performed some basic quality control and filtering on the data, I now move to quantify how well Hi-C interaction frequency values can be predicted from species identity. To accomplish this, I utilize limma to linearly model Hi-C interaction frequencies as a function of all the metadata about each sample’s species, sex, and batch. I then add information from linear modeling to the full dataframe, including log-fold changes between species, p-values, betas, and variances. I then make a plot of the mean vs. the log-fold change as an MA plot proxy for quality control. I move on to look at distributions of all the p-values and betas for each of the terms, make some QQ plots to check for normality in the p-values for each term, and represent the log-fold change and p-values in a volcano plot of the data.

###Read in data, normalized in initial_QC document, both with all hits and conditioned upon a hit being discovered in 4 individuals.
full.data <- fread("~/Desktop/Hi-C/full.data.10.2018", header = TRUE, data.table=FALSE, stringsAsFactors = FALSE, showProgress = FALSE)
data.4 <- fread("~/Desktop/Hi-C/data.4.10.2018", header=TRUE, data.table=FALSE, stringsAsFactors = FALSE, showProgress = FALSE)

###Reassign metadata here, just for quick reference.
meta.data <- data.frame("SP"=c("H", "H", "C", "C", "H", "H", "C", "C"), "SX"=c("F", "M", "M", "F", "M", "F", "M", "F"), "Batch"=c(1, 1, 1, 1, 2, 2, 2, 2))

###Now to move on to linear modeling, doing it in both the full data set and the | 4 individuals condition. I utilize limma to make this quick, parllelizable, and simple. First make the model, then do the actual fitting, and finally do multiple testing adjustment with topTable.
design <- model.matrix(~1+meta.data$SP+meta.data$SX+meta.data$Batch) #Parameterize the linear model, with an intercept, and fixed effect terms for species, sex, and library prep batch. If you prefer to think in contrasts, my contrast is humans minus chimps. I prefer to think of one species as the baseline in the linear model, and in this case that's chimps (so chimps get a 0 for species, and humans get a 1).
lmFit(full.data[,304:311], design) %>% eBayes(.) -> model.full
lmFit(data.4[,304:311], design) %>% eBayes(.) -> model.4
volc.full <- topTable(model.full, coef = 2, sort.by = "none", number = Inf)
volc.4 <- topTable(model.4, coef = 2, sort.by = "none", number = Inf)

###Now append the information extracted from linear modeling (p-values and betas for species, sex, and batch) to the 2 data frames:
full.data$sp_beta <- volc.full$logFC #Beta already is logFC since values are log2(obs/exp) Hi-C reads.
full.data$sp_pval <- volc.full$P.Value #Unadjusted P-val
full.data$sp_BH_pval <- volc.full$adj.P.Val #Benjamini-Hochberg adjusted p-values
full.data$avg_IF <- volc.full$AveExpr #Average Interaction Frequency value across all individuals. Useful later for a variety of plots.
full.data$t_statistic <- volc.full$t #t-statistic from limma topTable
full.data$B_statistic <- volc.full$B #B-statistic (LOD) from limma topTable
full.data$sx_pval <- model.full$p.value[,3]
full.data$btc_pval <- model.full$p.value[,4]
full.data$sx_beta <- model.full$coefficients[,3]
full.data$btc_beta <- model.full$coefficients[,4]
full.data$SE <- sqrt(model.full$s2.post)*model.full$stdev.unscaled[,2]


data.4$sp_beta <- volc.4$logFC
data.4$sp_pval <- volc.4$P.Value #Unadjusted P-val, normal scale
data.4$sp_BH_pval <- volc.4$adj.P.Val #Benajimini-Hochberg adjusted p-values
data.4$avg_IF <- volc.4$AveExpr #Average interaction frequency
data.4$t_statistic <- volc.4$t #t-statistic from limma topTable
data.4$B_statistic <- volc.4$B #B-statistic (LOD) from limma topTable
data.4$sx_pval <- model.4$p.value[,3]
data.4$btc_pval <- model.4$p.value[,4]
data.4$sx_beta <- model.4$coefficients[,3]
data.4$btc_beta <- model.4$coefficients[,4]
data.4$SE <- sqrt(model.4$s2.post) * model.4$stdev.unscaled[,2]


###Before moving to any actual assessment of the linear modeling, do a QC check by producing MA plots for both the full set and | 4 individuals. The MA plot here is the mean of Hi-C contact frequencies (avg_IF) on the x-axis, and the logFC (species beta, in this case) on the y-axis. What we are generally looking for is a fairly uniform cloud around 0 stretching across the x-axis.
ggplot(data=full.data[,c('avg_IF','sp_beta')], aes(x=avg_IF, y=sp_beta)) + geom_point() + xlab("Means of Hi-C Contact Frequencies") + ylab("Species Beta (log ratio H/C)") + ggtitle("MA Plot, All Data")

ggplot(data=data.4[,c('avg_IF', 'sp_beta')], aes(x=avg_IF, y=sp_beta)) + geom_point() + xlab("Means of Hi-C Contact Frequencies") + ylab("Species Beta (log ratio H/C)") + ggtitle("MA Plot | 4 Individuals")

#Both of these plots certainly have some outliers, and both have a linear trend I have previously examined that is symmetric across betas--I believe these lines to primarily be created by an orthology-calling issue, as many of the hits along them have all contact frequencies in one species extremely close to 0 (and not near-zero in the other species). Aside from this, there is also a large cloud in the negative beta side that is somewhat disturbing; this is something I still need to check. At the very least, the linear trends seen are symmetric across betas, so this is hopefully an issue that is affecting the two species SOMEWHAT equally.

###Check some of the more basic linear modeling issues: distributions of betas, p-vals, QQplot, volcano plot, etc.
##Check p-vals and betas for all covariates in the full model, on all the data.
ggplot(data=full.data[,322:332], aes(x=sp_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Species") + xlab("P-value")

ggplot(data=full.data[,322:332], aes(x=sx_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Sex") + xlab("P-value")

ggplot(data=full.data[,322:332], aes(x=btc_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Batch") + xlab("P-value")

ggplot(data=full.data[,322:332], aes(x=sp_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Species") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 3500))

ggplot(data=full.data[,322:332], aes(x=sx_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Sex") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 3500))

ggplot(data=full.data[,322:332], aes(x=btc_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Batch") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 3500))

##Check p-vals and betas for all covariates in the full model, on data conditioned upon hit discovery in 4 individuals.
ggplot(data=data.4[,322:332], aes(x=sp_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Species | 4") + xlab("P-value")

ggplot(data=data.4[,322:332], aes(x=sx_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Sex | 4") + xlab("P-value")

ggplot(data=data.4[,322:332], aes(x=btc_pval)) + geom_histogram(binwidth=0.001) + ggtitle("P-vals for Batch | 4") + xlab("P-value")

ggplot(data=data.4[,322:332], aes(x=sp_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Species | 4") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 850))

ggplot(data=data.4[,322:332], aes(x=sx_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Sex | 4") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 850))

ggplot(data=data.4[,322:332], aes(x=btc_beta)) + geom_histogram(binwidth=0.001) + ggtitle("Betas for Batch | 4") + xlab("Beta") + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 850))

#These all look pretty good, with p-val distributions for sex and batch being fairly uniform, and betas all around being fairly normally distributed. We see distinct enrichment for significant p-values for the species term, which is what we were hoping for! Also of note is the fact that the beta distribution for species looks wider than those for batch and sex, reassuring us that species is a driving factor in differential Hi-C contacts.

###Now, to double-check on the p-values with some QQplots. First, I define a function for creating a qqplot easily and coloring elements of the distribution in order to understand where most of the density on the plot is:
newqqplot=function(pvals, quant, title){  
  len = length(pvals)
  res=qqplot(-log10((1:len)/(1+len)),pvals,plot.it=F)
  plot(res$x,res$y, main=title, xlab="Theoretical", ylab="Actual", col=ifelse(res$y>as.numeric(quantile(res$y, quant[1])), ifelse(res$y>as.numeric(quantile(res$y, quant[2])), "red", "blue"), "black"))
  abline(0, 1)
}

##Start QQplotting some of these actual values.
newqqplot(-log10(full.data$sp_pval), c(0.5, 0.75), "QQ Plot, Species P-vals, Full Data")

newqqplot(-log10(full.data$sx_pval), c(0.5, 0.75), "QQ Plot, Sex P-vals, Full Data")

newqqplot(-log10(full.data$btc_pval), c(0.5, 0.75), "QQ Plot, Batch P-vals, Full Data")

newqqplot(-log10(data.4$sp_pval), c(0.5, 0.75), "QQ Plot, Species P-vals, Data | 4")

newqqplot(-log10(data.4$sx_pval), c(0.5, 0.75), "QQ Plot, Sex P-vals, Data | 4")

newqqplot(-log10(data.4$btc_pval), c(0.5, 0.75), "QQ Plot, Batch P-vals, Data | 4")

#The batch and sex QQ plots look fairly decent, but the species QQplot has the distribution rising off the axis extremely quickly, before even reaching 50% of the data. I will examine this in a separate markdown document for quality control on the linear modeling. For now, I show volcano plots of the data before looking to rectify this issue:
volcplot.full <- data.frame(pval=-log10(full.data$sp_pval), beta=full.data$sp_beta, species=full.data$disc_species)
volcplot.4 <- data.frame(pval=-log10(data.4$sp_pval), beta=data.4$sp_beta, species=data.4$disc_species)
ggplot(data=volcplot.full, aes(x=beta, y=pval, color=species)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("P-values") + ggtitle("Contact Frequency Differences Colored by Species of Discovery")

ggplot(data=volcplot.4, aes(x=beta, y=pval, color=species)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("-log10 p-values") + ggtitle("Contact Frequency Differences Colored by Species of Discovery | 4") + scale_color_manual(name="Species", labels=c("Both", "Chimp", "Human"), values=c("#F8766D", "#00BA38", "#619CFF")) #Modify to add clearer labels

ggplot(data=volcplot.full, aes(x=beta, y=pval)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("-log10 BH-corrected p-values") + ggtitle("Contact Frequency Differences")

ggplot(data=volcplot.4, aes(x=beta, y=pval)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("-log10 p-values") + ggtitle("Contact Frequency Differences | 4 Individuals")

#Write out the data with the new columns added on!
fwrite(full.data, "~/Desktop/Hi-C/full.data.10.2018", quote = TRUE, sep = "\t", row.names = FALSE, col.names = TRUE, na="NA", showProgress = FALSE)
fwrite(data.4, "~/Desktop/Hi-C/data.4.10.2018", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE, na="NA", showProgress = FALSE)

From this analysis two concerns emerge for further exploration: the QQ plots for the species p-values seem inflated, and the volcano plot seems highly asymmetrical. I’ll examine this further in the next analysis on linear modeling QC.

sessionInfo()

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] bedr_1.0.4          forcats_0.3.0       purrr_0.2.4        
 [4] readr_1.1.1         tibble_1.4.2        tidyverse_1.2.1    
 [7] edgeR_3.20.9        RColorBrewer_1.1-2  heatmaply_0.14.1   
[10] viridis_0.5.0       viridisLite_0.3.0   stringr_1.3.0      
[13] gplots_3.0.1        Hmisc_4.1-1         Formula_1.2-2      
[16] survival_2.41-3     lattice_0.20-35     dplyr_0.7.4        
[19] plotly_4.7.1        cowplot_0.9.3       ggplot2_2.2.1      
[22] reshape2_1.4.3      data.table_1.10.4-3 tidyr_0.8.0        
[25] plyr_1.8.4          limma_3.34.9       

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2     class_7.3-14         modeltools_0.2-21   
 [4] mclust_5.4           rprojroot_1.3-2      htmlTable_1.11.2    
 [7] futile.logger_1.4.3  base64enc_0.1-3      rstudioapi_0.7      
[10] flexmix_2.3-14       mvtnorm_1.0-7        lubridate_1.7.3     
[13] xml2_1.2.0           R.methodsS3_1.7.1    codetools_0.2-15    
[16] splines_3.4.0        mnormt_1.5-5         robustbase_0.92-8   
[19] knitr_1.20           jsonlite_1.5         broom_0.4.3         
[22] cluster_2.0.6        kernlab_0.9-25       R.oo_1.21.0         
[25] httr_1.3.1           backports_1.1.2      assertthat_0.2.0    
[28] Matrix_1.2-12        lazyeval_0.2.1       cli_1.0.0           
[31] acepack_1.4.1        htmltools_0.3.6      tools_3.4.0         
[34] bindrcpp_0.2         gtable_0.2.0         glue_1.2.0          
[37] Rcpp_0.12.16         cellranger_1.1.0     trimcluster_0.1-2   
[40] gdata_2.18.0         nlme_3.1-131.1       iterators_1.0.9     
[43] fpc_2.1-11           psych_1.7.8          testthat_2.0.0      
[46] rvest_0.3.2          gtools_3.5.0         dendextend_1.7.0    
[49] DEoptimR_1.0-8       MASS_7.3-49          scales_0.5.0        
[52] TSP_1.1-5            hms_0.4.2            parallel_3.4.0      
[55] lambda.r_1.2         yaml_2.1.18          gridExtra_2.3       
[58] rpart_4.1-13         latticeExtra_0.6-28  stringi_1.1.7       
[61] gclus_1.3.1          foreach_1.4.4        checkmate_1.8.5     
[64] seriation_1.2-3      caTools_1.17.1       rlang_0.2.0         
[67] pkgconfig_2.0.1      prabclus_2.2-6       bitops_1.0-6        
[70] evaluate_0.10.1      bindr_0.1.1          labeling_0.3        
[73] htmlwidgets_1.0      magrittr_1.5         R6_2.2.2            
[76] pillar_1.2.1         haven_1.1.1          whisker_0.3-2       
[79] foreign_0.8-69       nnet_7.3-12          modelr_0.1.1        
[82] crayon_1.3.4         futile.options_1.0.0 KernSmooth_2.23-15  
[85] rmarkdown_1.9        locfit_1.5-9.1       grid_3.4.0          
[88] readxl_1.0.0         git2r_0.21.0         digest_0.6.15       
[91] diptest_0.75-7       webshot_0.5.0        VennDiagram_1.6.19  
[94] R.utils_2.6.0        stats4_3.4.0         munsell_0.4.3       
[97] registry_0.5

This R Markdown site was created with workflowr

linear_modeling

Ittai Eres

2017-10-11

Linear Modeling