Last updated: 2020-08-05

Checks: 6 1

Knit directory: HiCiPSC/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190311)

The command set.seed(20190311) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: absolute

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute	relative
/Users/ittaieres/HiCiPSC	.

Repository version: 75e11d0

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    code/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/TADs/.DS_Store
    Ignored:    data/TADs/Human_inter_30_KR_contact_domains/.DS_Store
    Ignored:    data/TADs/scripts/.DS_Store
    Ignored:    docs/.DS_Store
    Ignored:    docs/figure/.DS_Store
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  Rplot.jpeg
    Untracked:  Rplot001.jpeg
    Untracked:  Rplot002.jpeg
    Untracked:  Rplot003.jpeg
    Untracked:  Rplot004.jpeg
    Untracked:  Rplot005.jpeg
    Untracked:  Rplot006.jpeg
    Untracked:  Rplot007.jpeg
    Untracked:  Rplot008.jpeg
    Untracked:  Rplot009.jpeg
    Untracked:  Rplot010.jpeg
    Untracked:  Rplot011.jpeg
    Untracked:  Rplot012.jpeg
    Untracked:  Rplot013.jpeg
    Untracked:  Rplot014.jpeg
    Untracked:  Rplot015.jpeg
    Untracked:  Rplot016.jpeg
    Untracked:  Rplot017.jpeg
    Untracked:  Rplot018.jpeg
    Untracked:  Rplot019.jpeg
    Untracked:  Rplot020.jpeg
    Untracked:  Rplot021.jpeg
    Untracked:  Rplot022.jpeg
    Untracked:  Rplot023.jpeg
    Untracked:  Rplot024.jpeg
    Untracked:  Rplot025.jpeg
    Untracked:  Rplot026.jpeg
    Untracked:  Rplot027.jpeg
    Untracked:  Rplot028.jpeg
    Untracked:  Rplot029.jpeg
    Untracked:  Rplot030.jpeg
    Untracked:  Rplot031.jpeg
    Untracked:  Rplot032.jpeg
    Untracked:  Rplot033.jpeg
    Untracked:  Rplot034.jpeg
    Untracked:  Rplot035.jpeg
    Untracked:  Rplot036.jpeg
    Untracked:  Rplot037.jpeg
    Untracked:  Rplot038.jpeg
    Untracked:  Rplot039.jpeg
    Untracked:  Rplot040.jpeg
    Untracked:  Rplot041.jpeg
    Untracked:  Rplot042.jpeg
    Untracked:  Rplot043.jpeg
    Untracked:  Rplot044.jpeg
    Untracked:  Rplot045.jpeg
    Untracked:  Rplot046.jpeg
    Untracked:  Rplot047.jpeg
    Untracked:  Rplot048.jpeg
    Untracked:  Rplot049.jpeg
    Untracked:  Rplot050.jpeg
    Untracked:  Rplot051.jpeg
    Untracked:  Rplot052.jpeg
    Untracked:  Rplot053.jpeg
    Untracked:  Rplot054.jpeg
    Untracked:  Rplot055.jpeg
    Untracked:  Rplot056.jpeg
    Untracked:  Rplot057.jpeg
    Untracked:  Rplot058.jpeg
    Untracked:  Rplot059.jpeg
    Untracked:  Rplot060.jpeg
    Untracked:  Rplot061.jpeg
    Untracked:  Rplot062.jpeg
    Untracked:  Rplot063.jpeg
    Untracked:  Rplot064.jpeg
    Untracked:  Rplot065.jpeg
    Untracked:  Rplot066.jpeg
    Untracked:  Rplot067.jpeg
    Untracked:  Rplot068.jpeg
    Untracked:  Rplot069.jpeg
    Untracked:  Rplot070.jpeg
    Untracked:  Rplot071.jpeg
    Untracked:  Rplot072.jpeg
    Untracked:  Rplot073.jpeg
    Untracked:  Rplot074.jpeg
    Untracked:  Rplot075.jpeg
    Untracked:  Rplot076.jpeg
    Untracked:  Rplot077.jpeg
    Untracked:  Rplot078.jpeg
    Untracked:  Rplot079.jpeg
    Untracked:  Rplot080.jpeg
    Untracked:  Rplot081.jpeg
    Untracked:  Rplot082.jpeg
    Untracked:  Rplot083.jpeg
    Untracked:  Rplot084.jpeg
    Untracked:  Rplot085.jpeg
    Untracked:  Rplot086.jpeg
    Untracked:  Rplot087.jpeg
    Untracked:  Rplot088.jpeg
    Untracked:  Rplot089.jpeg
    Untracked:  Rplot090.jpeg
    Untracked:  Rplot091.jpeg
    Untracked:  Rplot092.jpeg
    Untracked:  Rplot093.jpeg
    Untracked:  Rplot094.jpeg
    Untracked:  Rplot095.jpeg
    Untracked:  Rplot096.jpeg
    Untracked:  Rplot097.jpeg
    Untracked:  Rplot098.jpeg
    Untracked:  Rplot099.jpeg
    Untracked:  Rplot100.jpeg
    Untracked:  Rplot101.jpeg
    Untracked:  Rplot102.jpeg
    Untracked:  Rplot103.jpeg
    Untracked:  Rplot104.jpeg
    Untracked:  Rplot105.jpeg
    Untracked:  Rplot106.jpeg
    Untracked:  Rplot107.jpeg
    Untracked:  Rplot108.jpeg
    Untracked:  Rplot109.jpeg
    Untracked:  Rplot110.jpeg
    Untracked:  Rplot111.jpeg
    Untracked:  Rplot112.jpeg
    Untracked:  Rplot113.jpeg
    Untracked:  Rplot114.jpeg
    Untracked:  Rplot115.jpeg
    Untracked:  Rplot116.jpeg
    Untracked:  Rplot117.jpeg
    Untracked:  Rplot118.jpeg
    Untracked:  Rplot119.jpeg
    Untracked:  Rplot120.jpeg
    Untracked:  Rplot121.jpeg
    Untracked:  Rplot122.jpeg
    Untracked:  Rplot123.jpeg
    Untracked:  Rplot124.jpeg
    Untracked:  Rplot125.jpeg
    Untracked:  Rplot126.jpeg
    Untracked:  Rplot127.jpeg
    Untracked:  Rplot128.jpeg
    Untracked:  Rplot129.jpeg
    Untracked:  Rplot130.jpeg
    Untracked:  Rplot131.jpeg
    Untracked:  Rplot132.jpeg
    Untracked:  Rplot133.jpeg
    Untracked:  Rplot134.jpeg
    Untracked:  Rplot135.jpeg
    Untracked:  Rplot136.jpeg
    Untracked:  Rplot137.jpeg
    Untracked:  Rplot138.jpeg
    Untracked:  Rplot139.jpeg
    Untracked:  Rplot140.jpeg
    Untracked:  Rplot141.jpeg
    Untracked:  Rplot142.jpeg
    Untracked:  Rplot143.jpeg
    Untracked:  Rplot144.jpeg
    Untracked:  Rplot145.jpeg
    Untracked:  Rplot146.jpeg
    Untracked:  Rplot147.jpeg
    Untracked:  Rplot148.jpeg
    Untracked:  Rplot149.jpeg
    Untracked:  Rplot150.jpeg
    Untracked:  Rplot151.jpeg
    Untracked:  Rplot152.jpeg
    Untracked:  Rplot153.jpeg
    Untracked:  Rplot154.jpeg
    Untracked:  Rplot155.jpeg
    Untracked:  Rplot156.jpeg
    Untracked:  Rplot157.jpeg
    Untracked:  Rplot158.jpeg
    Untracked:  Rplot159.jpeg
    Untracked:  Rplot160.jpeg
    Untracked:  Rplot161.jpeg
    Untracked:  Rplot162.jpeg
    Untracked:  Rplot163.jpeg
    Untracked:  Rplot164.jpeg
    Untracked:  Rplot165.jpeg
    Untracked:  Rplot166.jpeg
    Untracked:  Rplot167.jpeg
    Untracked:  Rplot168.jpeg
    Untracked:  Rplot169.jpeg
    Untracked:  Rplot170.jpeg
    Untracked:  Rplot171.jpeg
    Untracked:  Rplot172.jpeg
    Untracked:  Rplot173.jpeg
    Untracked:  Rplot174.jpeg
    Untracked:  Rplot175.jpeg
    Untracked:  Rplot176.jpeg
    Untracked:  Rplot177.jpeg
    Untracked:  Rplot178.jpeg
    Untracked:  Rplot179.jpeg
    Untracked:  Rplot180.jpeg
    Untracked:  Rplot181.jpeg
    Untracked:  Rplot182.jpeg
    Untracked:  Rplot183.jpeg
    Untracked:  Rplot184.jpeg
    Untracked:  Rplot185.jpeg
    Untracked:  Rplot186.jpeg
    Untracked:  Rplot187.jpeg
    Untracked:  Rplot188.jpeg
    Untracked:  Rplot189.jpeg
    Untracked:  Rplot190.jpeg
    Untracked:  Rplot191.jpeg
    Untracked:  Rplot192.jpeg
    Untracked:  Rplot193.jpeg
    Untracked:  Rplot194.jpeg
    Untracked:  Rplot195.jpeg
    Untracked:  Rplot196.jpeg
    Untracked:  Rplot197.jpeg
    Untracked:  Rplot198.jpeg
    Untracked:  Rplot199.jpeg
    Untracked:  Rplot200.jpeg
    Untracked:  Rplot201.jpeg
    Untracked:  Rplot202.jpeg
    Untracked:  Rplot203.jpeg
    Untracked:  Rplot204.jpeg
    Untracked:  Rplot205.jpeg
    Untracked:  Rplot206.jpeg
    Untracked:  Rplot207.jpeg
    Untracked:  Rplot208.jpeg
    Untracked:  Rplot209.jpeg
    Untracked:  Rplot210.jpeg
    Untracked:  Rplot211.jpeg
    Untracked:  Rplot212.jpeg
    Untracked:  Rplot213.jpeg
    Untracked:  Rplot214.jpeg
    Untracked:  Rplot215.jpeg
    Untracked:  Rplot216.jpeg
    Untracked:  Rplot217.jpeg
    Untracked:  Rplot218.jpeg
    Untracked:  Rplot219.jpeg
    Untracked:  Rplot220.jpeg
    Untracked:  Rplot221.jpeg
    Untracked:  Rplot222.jpeg
    Untracked:  Rplot223.jpeg
    Untracked:  Rplot224.jpeg
    Untracked:  Rplot225.jpeg
    Untracked:  Rplot226.jpeg
    Untracked:  Rplot227.jpeg
    Untracked:  Rplot228.jpeg
    Untracked:  Rplot229.jpeg
    Untracked:  Rplot230.jpeg
    Untracked:  Rplot231.jpeg
    Untracked:  Rplot232.jpeg
    Untracked:  Rplot233.jpeg
    Untracked:  Rplot234.jpeg
    Untracked:  Rplot235.jpeg
    Untracked:  Rplot236.jpeg
    Untracked:  Rplot237.jpeg
    Untracked:  Rplot238.jpeg
    Untracked:  Rplot239.jpeg
    Untracked:  Rplot240.jpeg
    Untracked:  Rplot241.jpeg
    Untracked:  Rplot242.jpeg
    Untracked:  Rplot243.jpeg
    Untracked:  Rplot244.jpeg
    Untracked:  Rplot245.jpeg
    Untracked:  Rplot246.jpeg
    Untracked:  Rplot247.jpeg
    Untracked:  Rplot248.jpeg
    Untracked:  Rplot249.jpeg
    Untracked:  Rplot250.jpeg
    Untracked:  Rplot251.jpeg
    Untracked:  Rplot252.jpeg
    Untracked:  Rplot253.jpeg
    Untracked:  Rplot254.jpeg
    Untracked:  Rplot255.jpeg
    Untracked:  Rplot256.jpeg
    Untracked:  Rplot257.jpeg
    Untracked:  Rplot258.jpeg
    Untracked:  Rplot259.jpeg
    Untracked:  Rplot260.jpeg
    Untracked:  Rplot261.jpeg
    Untracked:  Rplot262.jpeg
    Untracked:  Rplot263.jpeg
    Untracked:  Rplot264.jpeg
    Untracked:  Rplot265.jpeg
    Untracked:  Rplot266.jpeg
    Untracked:  Rplot267.jpeg
    Untracked:  Rplot268.jpeg
    Untracked:  Rplot269.jpeg
    Untracked:  Rplot270.jpeg
    Untracked:  Rplot271.jpeg
    Untracked:  Rplot272.jpeg
    Untracked:  Rplot273.jpeg
    Untracked:  Rplot274.jpeg
    Untracked:  Rplot275.jpeg
    Untracked:  Rplot276.jpeg
    Untracked:  Rplot277.jpeg
    Untracked:  Rplot278.jpeg
    Untracked:  Rplot279.jpeg
    Untracked:  Rplot280.jpeg
    Untracked:  Rplot281.jpeg
    Untracked:  Rplot282.jpeg
    Untracked:  Rplot283.jpeg
    Untracked:  Rplot284.jpeg
    Untracked:  Rplot285.jpeg
    Untracked:  Rplot286.jpeg
    Untracked:  Rplot287.jpeg
    Untracked:  Rplot288.jpeg
    Untracked:  Rplot289.jpeg
    Untracked:  Rplot290.jpeg
    Untracked:  Rplot291.jpeg
    Untracked:  Rplot292.jpeg
    Untracked:  Rplot293.jpeg
    Untracked:  Rplot294.jpeg
    Untracked:  Rplot295.jpeg
    Untracked:  Rplot296.jpeg
    Untracked:  Rplot297.jpeg
    Untracked:  Rplot298.jpeg
    Untracked:  Rplot299.jpeg
    Untracked:  Rplot300.jpeg
    Untracked:  Rplot301.jpeg
    Untracked:  Rplot302.jpeg
    Untracked:  Rplot303.jpeg
    Untracked:  Rplot304.jpeg
    Untracked:  S2A.jpeg
    Untracked:  S2B.jpeg
    Untracked:  code/mediation_test.R
    Untracked:  data/Chimp_orthoexon_extended_info.txt
    Untracked:  data/Human_orthoexon_extended_info.txt
    Untracked:  data/Meta_data.txt
    Untracked:  data/TADs/Arrowhead_individuals/
    Untracked:  data/TADs/CH.10kb.closest.panTro5
    Untracked:  data/TADs/CTCF.overlap.computer.sh
    Untracked:  data/TADs/CTCF/
    Untracked:  data/TADs/HC.10kb.closest.hg38
    Untracked:  data/TADs/Human_inter_30_KR_contact_domains_PT6/
    Untracked:  data/TADs/Rao/
    Untracked:  data/TADs/TopDom/
    Untracked:  data/TADs/deprecated/
    Untracked:  data/TADs/mega.bounds.intersect.c.sh
    Untracked:  data/TADs/mega.bounds.intersect.merge.sh
    Untracked:  data/TADs/mega.bounds.rao.sh
    Untracked:  data/TADs/mega.domains.bedtoolsc.sh
    Untracked:  data/TADs/mega.domains.rao.sh
    Untracked:  data/TADs/overlaps/
    Untracked:  data/TADs/overlaps_rao_style/
    Untracked:  data/TADs/rao.chimp.subsample.tester.sh
    Untracked:  data/chimp_lengths.txt
    Untracked:  data/counts_iPSC.txt
    Untracked:  data/epigenetic_enrichments/
    Untracked:  data/final.10kb.homer.df
    Untracked:  data/final.juicer.10kb.KR
    Untracked:  data/final.juicer.10kb.VC
    Untracked:  data/hic_gene_overlap/
    Untracked:  data/human_lengths.txt
    Untracked:  data/old_mediation_permutations/
    Untracked:  output/DC_regions.txt
    Untracked:  output/IEE.RPKM.RDS
    Untracked:  output/IEE_voom_object.RDS
    Untracked:  output/data.4.filtered.lm.QC
    Untracked:  output/data.4.fixed.init.LM
    Untracked:  output/data.4.fixed.init.QC
    Untracked:  output/data.4.init.LM
    Untracked:  output/data.4.init.QC
    Untracked:  output/data.4.lm.QC
    Untracked:  output/full.data.10.init.LM
    Untracked:  output/full.data.10.init.QC
    Untracked:  output/full.data.10.lm.QC
    Untracked:  output/full.data.annotations.RDS
    Untracked:  output/gene.hic.filt.KR.RDS
    Untracked:  output/gene.hic.filt.RDS
    Untracked:  output/gene.hic.filt.VC.RDS
    Untracked:  output/homer_mediation.rda
    Untracked:  output/juicer.IEE.RPKM.RDS
    Untracked:  output/juicer.IEE_voom_object.RDS
    Untracked:  output/juicer.filt.KR
    Untracked:  output/juicer.filt.KR.final
    Untracked:  output/juicer.filt.KR.lm
    Untracked:  output/juicer.filt.VC
    Untracked:  output/juicer.filt.VC.final
    Untracked:  output/juicer.filt.VC.lm
    Untracked:  output/juicer_mediation.rda
    Untracked:  output/mc_de.rds
    Untracked:  output/mc_de_homer.rds
    Untracked:  output/mc_de_juicer.rds
    Untracked:  output/mc_node.rds
    Untracked:  output/mc_node_homer.rds
    Untracked:  output/mc_node_juicer.rds
    Untracked:  ~/

Unstaged changes:
    Modified:   analysis/TADs.Rmd
    Deleted:    analysis/alt_mediation.Rmd
    Modified:   analysis/enrichment.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
html	2ec8067	Ittai Eres	2019-05-09	Build site.
html	b69c73d	Ittai Eres	2019-05-05	Build site.
Rmd	2419813	Ittai Eres	2019-05-05	Update all files.
html	6f6db11	Ittai Eres	2019-04-23	Build site.
Rmd	29c50c3	Ittai Eres	2019-04-23	Update many analyses and move outputs/intermediaries to /data folder
html	972c85a	Ittai Eres	2019-03-14	Build site.
Rmd	ef02896	Ittai Eres	2019-03-14	Add Linear modeling QC and modify index.Rmd to reflect figures therein

First, load necessary libraries: limma, plyr, tidyr, data.table, reshape2, cowplot, plotly, dplyr, Hmisc, gplots, stringr, heatmaply, RColorBrewer, edgeR, tidyverse, and compiler

QC on QQ Plots

Having observed inflated p-values in QQ plots from doing linear modeling, as well as a stark asymmetry in the distribution of betas for the species term in the volcano plot, here I move to examine both issues. The QQ plot issue is not truly concerning, as we were not necessarily expecting normality in this set of hits enriched for significant Hi-C contacts. It seems likely that any set of significant Hi-C hits called independently in each species will have some inflation in the significance of a species term for linearly modeling the interaction frequencies. That the QQ plot does not look like a GWAS QQ plot is not truly concerning, however, just to be sure, first I will run a series of QQ plots once again after shuffling some of the metadata labels. This simply ensures the inflation seen is not an artifact of processing. I tackle the volcano plot asymmetry after alleviating QQ plot concerns and doing some further genomic filtering explained below.

#This file will examine first the inflation of p-values from the QQ plot at the end of linear_modeling.Rmd, and then the asymmetry seen in the volcano plots from linear modeling.

#First, grab the qqplotting function I utilize from linear_modeling.Rmd:
newqqplot=function(pvals, quant, title){  
  len = length(pvals)
  res=qqplot(-log10((1:len)/(1+len)),pvals,plot.it=F)
  plot(res$x,res$y, main=title, xlab="Theoretical", ylab="Actual", col=ifelse(res$y>as.numeric(quantile(res$y, quant[1])), ifelse(res$y>as.numeric(quantile(res$y, quant[2])), "red", "blue"), "black"))
  abline(0, 1)
}

#Now, as always, read in data files modified by initial_QC.Rmd and linear_modeling.Rmd
setwd("/Users/ittaieres/HiCiPSC")
full.data <- fread("output/full.data.10.init.LM", header = TRUE, data.table=FALSE, stringsAsFactors = FALSE, showProgress = FALSE)
data.4 <- fread("output/data.4.init.LM", header=TRUE, data.table=FALSE, stringsAsFactors = FALSE, showProgress = FALSE)
data.4.fixed <- fread("output/data.4.fixed.init.LM", header=TRUE, data.table=FALSE, stringsAsFactors = FALSE, showProgress = FALSE)

#This is the QQ plot for species from the linear model for Hi-C values from linear_modeling.Rmd. We can see a significant inflation of p-values here rising above the expected normal distribution alarmingly quickly:
newqqplot(-log10(data.4$sp_pval), c(0.5, 0.75), "QQ Plot, Species P-vals, Data | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

#In order to check that this extreme inflation of p values is not merely a technical artifact, here I try shuffling the species labels and running the linear model again. I would hope to see a more normal QQplot and this would perhaps suggest this the inflation seen is due to true biological effects, rather than technical factors. The fake designs just have some species swapped; the first two fake designs are balanced across sex and batch, and the second two are balanced with respect to batch (equal numbers of humans and chimps in both), but any sex-species interaction would be confounded with batch (since all members of a species in one batch are the same sex). Note that this ultimately shouldn't matter since I'm just for checking QQ normality here, especially since I'll start with the full model but then also remove batch and sex as covariates and re-check the QQ plot, to rule out overfitting.
fake.meta1 <- data.frame("SP"=c("H", "C", "H", "C", "H", "C", "C", "H"), "SX"=c("F", "M", "M", "F", "M", "F", "M", "F"), "Batch"=c(1, 1, 1, 1, 2, 2, 2, 2))
fake.meta2 <- data.frame("SP"=c("C", "H", "C", "H", "C", "H", "H", "C"), "SX"=c("F", "M", "M", "F", "M", "F", "M", "F"), "Batch"=c(1, 1, 1, 1, 2, 2, 2, 2))
fake.meta3 <- data.frame("SP"=c("H", "C", "C", "H", "C", "H", "C", "H"), "SX"=c("F", "M", "M", "F", "M", "F", "M", "F"), "Batch"=c(1, 1, 1, 1, 2, 2, 2, 2))
fake.meta4 <- data.frame("SP"=c("C", "H", "H", "C", "C", "H", "C", "H"), "SX"=c("F", "M", "M", "F", "M", "F", "M", "F"), "Batch"=c(1, 1, 1, 1, 2, 2, 2, 2))

#First, test out the fake metadataframess utilizing the linear model with all covariates included--species, sex, and batch.
lmFit(data.4[,304:311], model.matrix(~1+fake.meta1$SP+fake.meta1$SX+fake.meta1$Batch)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake1
lmFit(data.4[,304:311], model.matrix(~1+fake.meta2$SP+fake.meta2$SX+fake.meta2$Batch)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake2
lmFit(data.4[,304:311], model.matrix(~1+fake.meta3$SP+fake.meta3$SX+fake.meta3$Batch)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake3

Coefficients not estimable: fake.meta3$SXM

Warning: Partial NA coefficients for 347206 probe(s)

Warning in .ebayes(fit = fit, proportion = proportion, stdev.coef.lim =
stdev.coef.lim, : Estimation of var.prior failed - set to default value

lmFit(data.4[,304:311], model.matrix(~1+fake.meta4$SP+fake.meta4$SX+fake.meta4$Batch)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake4

newqqplot(-log10(fake1$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design 1 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(fake2$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design 2 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(fake3$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design 3 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(fake4$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design 4 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

#While these QQ plots are nice in that they no longer show inflation of p-values quickly, many of them show slight deflation. Here I try doing the same thing again but without sex and then without batch, and then without both, as covariates--to account for the possibility that inclusion of these covariates in the model is overfitting:
lmFit(data.4[,304:311], model.matrix(~1+fake.meta1$SP+fake.meta1$Batch)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake1
lmFit(data.4[,304:311], model.matrix(~1+fake.meta2$SP+fake.meta2$Batch)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake2
lmFit(data.4[,304:311], model.matrix(~1+fake.meta3$SP+fake.meta3$Batch)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake3
lmFit(data.4[,304:311], model.matrix(~1+fake.meta4$SP+fake.meta4$Batch)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake4
newqqplot(-log10(fake1$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design no SX, 1 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(fake2$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design no SX, 2 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(fake3$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design no SX, 3 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(fake4$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design no SX, 4 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

lmFit(data.4[,304:311], model.matrix(~1+fake.meta1$SP+fake.meta1$SX)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake1
lmFit(data.4[,304:311], model.matrix(~1+fake.meta2$SP+fake.meta2$SX)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake2
lmFit(data.4[,304:311], model.matrix(~1+fake.meta3$SP+fake.meta3$SX)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake3

Coefficients not estimable: fake.meta3$SXM

Warning: Partial NA coefficients for 347206 probe(s)

Warning: Estimation of var.prior failed - set to default value

lmFit(data.4[,304:311], model.matrix(~1+fake.meta4$SP+fake.meta4$SX)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake4
newqqplot(-log10(fake1$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design no BTC, 1 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(fake2$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design no BTC, 2 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(fake3$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design no BTC, 3 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(fake4$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design no BTC, 4 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

lmFit(data.4[,304:311], model.matrix(~1+fake.meta1$SP)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake1
lmFit(data.4[,304:311], model.matrix(~1+fake.meta2$SP)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake2
lmFit(data.4[,304:311], model.matrix(~1+fake.meta3$SP)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake3
lmFit(data.4[,304:311], model.matrix(~1+fake.meta4$SP)) %>% eBayes(.) %>% topTable(., coef=2, sort.by="none", number=Inf) -> fake4
newqqplot(-log10(fake1$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design SP only, 1 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(fake2$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design SP only, 2 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(fake3$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design SP only, 3 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(fake4$P.Value), c(0.5, 0.75), "QQ Plot, Shuffled Design SP only, 4 | 4")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

These are slightly better, but many show some small degree of deflation. After looking at all these QQ plots concerns about p-value inflation are fairly assuaged, and it’s again worth noting that there was no real cause for concern in the first place. Once again, I pulled out significant Hi-C hits from each species, so it makes sense that this set of hits might already be enriched for strong species differences. I conclude that the signal seen in the QQ plot is not an artifact of making incorrect assumptions in linear modeling.

Filtering on Genomic Differences in Bin Sizes and Pair Distances

However, we might still be concerned that some of the observed differences are not truly driven by biology, but due to differences in genome builds or other issues induced by liftOver. I now add several other metrics measuring differences in bin sizes and pair distances between humans and chimps, to help identify differences that may be primarily driven by mapping of orthology or different genome builds rather than true biology. I will then subset the QQ plots to particular classes of these hits and see if any pattern emerges. This utilizes the normal set of p-values from the original design conditioned upon discovery in 4 individuals. I also place hits into classes based on differences in bin size or distance between pair mates, and use a 3-dimensional plot to see if any class(es) of hits show strong inflation in their species p-values from linear modeling. This represents one of the final filtering steps.

#To do this, I first have to pull out the mean start and end positions of bins for each of the species from the data:
H1startCmean <- rowMeans(data.4[,c('Hstart1-C', 'Hstart1-D', 'Hstart1-G', 'Hstart1-H')], na.rm=TRUE)
H1endCmean <- rowMeans(data.4[,c('Hend1-C', 'Hend1-D', 'Hend1-G', 'Hend1-H')], na.rm=TRUE)
H2startCmean <- rowMeans(data.4[,c('Hstart2-C', 'Hstart2-D', 'Hstart2-G', 'Hstart2-H')], na.rm=TRUE)
H2endCmean <- rowMeans(data.4[,c('Hend2-C', 'Hend2-D', 'Hend2-G', 'Hend2-H')], na.rm=TRUE)

C1startHmean <- rowMeans(data.4[,c('Cstart1-A', 'Cstart1-B', 'Cstart1-E', 'Cstart1-F')], na.rm=TRUE) #this is identical and easier.
C1endHmean <- rowMeans(data.4[,c('Cend1-A', 'Cend1-B', 'Cend1-E', 'Cend1-F')], na.rm=TRUE)
C2startHmean <- rowMeans(data.4[,c('Cstart2-A', 'Cstart2-B', 'Cstart2-E', 'Cstart2-F')], na.rm=TRUE)
C2endHmean <- rowMeans(data.4[,c('Cend2-A', 'Cend2-B', 'Cend2-E', 'Cend2-F')], na.rm=TRUE)
  
#Now, I use these data to add columns to the data frame for the sizes of bins and distance between bins. Note that I am only really looking at differences in the size of the "orthologous" bins called through liftover, as compared to their original size (10kb). Bin size differences in the values at the start of each individual's portion of the data frame for coordinates matching that species are merely due to size differences in reciprocal best hits liftover. The true size of bins within their own species is always 10kb. So here bin sizes being appended to the data frame are for lifted-over bins. I do the distance differences based on the HC-pair values (H1/H2 and C1/C2) that have been rounded to the nearest 10kb from the original values given by homer; since this should be balanced across species it shouldn't matter much. Worst case it would make the estimate off by 20kb, maximum.
H1sizeC <- H1endCmean-H1startCmean
H2sizeC <- H2endCmean-H2startCmean
data.4$H1diff <- abs(10000-H1sizeC)
data.4$H2diff <- abs(10000-H2sizeC)
C1sizeH <- C1endHmean-C1startHmean
C2sizeH <- C2endHmean-C2startHmean
data.4$C1diff <- abs(10000-C1sizeH)
data.4$C2diff <- abs(10000-C2sizeH)
data.4$Hdist <- abs(as.numeric(sub(".*-", "", data.4$H2))-as.numeric(sub(".*-", "", data.4$H1)))
data.4$Cdist <- abs(as.numeric(sub(".*-", "", data.4$C2))-as.numeric(sub(".*-", "", data.4$C1)))
data.4$dist_diff <- abs(data.4$Hdist-data.4$Cdist)

#Now I look at the distributions of some of these metrics to inform me about how best to bin the data for filtering and further QC checking.
quantile(data.4$H1diff, na.rm=TRUE)

  0%  25%  50%  75% 100% 
   0    9   21   51 1094

quantile(data.4$H2diff, na.rm=TRUE)

  0%  25%  50%  75% 100% 
   0    8   21   51 1101

quantile(data.4$C1diff, na.rm=TRUE)

  0%  25%  50%  75% 100% 
   0    9   21   50 1096

quantile(data.4$C2diff, na.rm=TRUE)

  0%  25%  50%  75% 100% 
   0    8   21   50 1098

quantile(data.4$dist_diff, na.rm=TRUE)

      0%      25%      50%      75%     100% 
       0        0        0    20000 77730000

#From this I can see that the majority of bin size differences are relatively small (<25 bp), and the majority of hits do not have a difference in distance between the mates in a pair across the species (all the way up to 50th percentile the distance difference is still 0). Now I'll take a look at some QQ plots filtering along these values to get an idea of if any of these technical orthology-calling artifacts are driving the inflation we see.
newqqplot(-log10(filter(data.4, H1diff<21&H2diff<21&C1diff<21&C2diff<21)$sp_pval), c(0.5, 0.75), "QQ Plot, Bin Size Changes < 21 bp | 4") #We can see that the QQplot looks a bit better if we only utilize hits where the bin size difference is less than the 50% percentile of their distribution.

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(filter(data.4, H1diff>=21&H2diff>=21&C1diff>=21&C2diff>=21)$sp_pval), c(0.5, 0.75), "QQ Plot, Bin Size Changes >= 21bp | 4") #This doesn't look much different--what about the case where there are large differences in the size?

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(filter(data.4, H1diff>=100|H2diff>=100|C1diff>=100|C2diff>=100)$sp_pval), c(0.5, 0.75), "QQ Plot, Bin Size Changes >= 100 bp | 4") #It's hard to know what to make of this since here I am just allowing for any bin to have a size change greater than 100 bp. It looks fairly similar to the QQplot of the full data.

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(filter(data.4, H1diff>=500|H2diff>=500|C1diff>=500|C2diff>=500)$sp_pval), c(0.5, 0.75), "QQ Plot, Bin Size Changes >= 500 bp | 4") #It's hard to know what to make of this since here I am just allowing for any bin to have a size change greater than 500 bp.

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(filter(data.4, H1diff>=1000|H2diff>=1000|C1diff>=1000|C2diff>=1000)$sp_pval), c(0.5, 0.75), "QQ Plot, Bin Size Changes >= 1000 bp | 4") #It's hard to know what to make of this since here I am just allowing for any bin to have a size change greater than 1kb.

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

#It may be hard to say anything super definitive about the bin size changes from these plots, but the fact that they all still show inflation, and I don't see any stark difference between bin size changes exceeding 100 bp, suggests to me that bin size is having a minimal effect here. Still, we will include it when looking at filtering criteria below for figuring out what hits might need to be removed still.

newqqplot(-log10(filter(data.4, dist_diff>=100000)$sp_pval), c(0.5, 0.75), "QQ Plot, Distance Diff >= 100kb | 4") #This is good to know--see EXTREMELY strong inflation amongst this class of p-values, perhaps giving another criteria for filtering

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(filter(data.4, dist_diff>=50000)$sp_pval), c(0.5, 0.75), "QQ Plot, Distance Diff >= 50kb | 4") #This is good to know--see that the inflation is a little weaker if we move towards including more pairs that have smaller distance differences between the species. What about pairs where there is no difference?

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(filter(data.4, dist_diff==0)$sp_pval), c(0.5, 0.75), "QQ Plot, NO Distance Diff | 4") #Values are still inflated, but the first solid 50% of the distribution stays along the normal line--this is great!

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

newqqplot(-log10(filter(data.4, dist_diff<=20000)$sp_pval), c(0.5, 0.75), "QQ Plot, Distance Diff <=20kb | 4") #And now we can see the trend in the opposite direction--SLIGHT inflation of the p-values when including more hits that have a larger distance difference.

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

#Based on the quantiles from before, the size differences have relatively little variation in their distribution. Hence I will take them and summarize them as a single value here:
sizediffs <- rowMeans(select(data.4, H1diff, H2diff, C1diff, C2diff), na.rm=TRUE)
quantile(sizediffs, probs=seq(0, 1, 0.2))

    0%    20%    40%    60%    80%   100% 
  0.00  13.50  23.25  41.50 108.50 970.50

#Now what I would like to do is find classes of hits--combinations between size and distance differences--that comprise roughly 10% of the data, or ~30k hits here. I will look at these classes in a 3-dimensional plot that includes size and distance differences as two of the axes, and the FDR from linear modeling on the 3rd. What I am in search of are classes of hits that show inflated FDR.
plot3d <- data.frame(bin_min=pmin(data.4$H1diff, data.4$H2diff, data.4$C1diff, data.4$C2diff, na.rm=TRUE), bin_max=pmax(data.4$H1diff, data.4$H2diff, data.4$C1diff, data.4$C2diff, na.rm=TRUE), sizediffs=sizediffs, dist_diff=data.4$dist_diff, FDR=data.4$sp_BH_pval)

#Of course, I already inherently know I will have a difficult time finding sets large enough with the bigger differences in distance, as only 10k or so of the hits even have a distance difference greater than 50kb. About half the hits have no distance difference, and of the remaining half, about half have size differences of 1-2 bins (10-20 kb), and half have greater size differences.

#First, I go in search of sets that will be large enough.
nrow(filter(plot3d, sizediffs<=13.5&dist_diff<10000)) #35k, a class of minimal changes--no distance difference, and minimal size difference.

[1] 35260

nrow(filter(plot3d, sizediffs<=23.33&sizediffs>13.5&dist_diff<10000)) #35k, still minimal changes

[1] 35426

nrow(filter(plot3d, sizediffs<=42.18750&sizediffs>23.33&dist_diff<10000)) #35k, minimal changes still, bin size getting up there

[1] 35933

nrow(filter(plot3d, sizediffs<=109.5&sizediffs>42.18750&dist_diff<10000)) #35k, minimal changes still, bin size even higher though

[1] 35105

nrow(filter(plot3d, sizediffs>109.5&dist_diff<10000)) #And there's another 35k! With big bin size changes.

[1] 35283

nrow(filter(plot3d, sizediffs<=13.5&dist_diff>=10000&dist_diff<=20000)) #23k, a class of minimal changes--no size difference, and minimal distance difference (1-2 bins off).

[1] 22902

nrow(filter(plot3d, sizediffs<=23.33&sizediffs>13.5&dist_diff>=10000&dist_diff<=20000)) #23k, small size difference, and minimal distance difference (1-2 bins off, 10-20 kb).

[1] 23367

nrow(filter(plot3d, sizediffs<=42.18750&sizediffs>23.33&dist_diff>=10000&dist_diff<=20000)) #23k, minimal changes still, bin size getting up there, and minimal distance difference (1-2 bins off, 10-20 kb).

[1] 23082

nrow(filter(plot3d, sizediffs<=109.5&sizediffs>42.18750&dist_diff>=10000&dist_diff<=20000)) #23k, minimal changes still, bin size even higher though, and minimal distance difference (1-2 bins off, 10-20 kb).

[1] 22788

nrow(filter(plot3d, sizediffs>109.5&dist_diff>=10000&dist_diff<=20000)) #And there's another 23k! With big bin size changes, and minimal distance difference (1-2 bins off, 10-20 kb).

[1] 22924

nrow(filter(plot3d, sizediffs<=13.5&dist_diff>20000)) #11k, a class of large changes--big distance difference, and minimal size difference.

[1] 11638

nrow(filter(plot3d, sizediffs<=23.33&sizediffs>13.5&dist_diff>20000)) #11k, a class of large changes--big distance difference, and relatively small size difference.

[1] 11113

nrow(filter(plot3d, sizediffs<=42.18750&sizediffs>23.33&dist_diff>20000)) #11k, a class of large changes--big distance difference, and median size difference.

[1] 10793

nrow(filter(plot3d, sizediffs<=109.5&sizediffs>42.18750&dist_diff>20000)) #11k, a class of large changes--big distance difference, and fairly large size difference.

[1] 10949

nrow(filter(plot3d, sizediffs>109.5&dist_diff>20000)) #11k, a class of the largest changes--big distance difference, and big size difference.

[1] 10643

#Now, create a data frame with these classes, their sizes, and the number of hits in each of the classes that is at FDR <= 0.05.
true3d <- data.frame(dist_class=c(rep("None", 5), rep("Short (<=20kb)", 5), rep("Long (>20kb)", 5)), bin_quant=rep(1:5, 3), set_size=c(nrow(filter(plot3d, sizediffs<=13.5&dist_diff<10000)), nrow(filter(plot3d, sizediffs<=23.33&sizediffs>13.5&dist_diff<10000)), nrow(filter(plot3d, sizediffs<=42.18750&sizediffs>23.33&dist_diff<10000)), nrow(filter(plot3d, sizediffs<=109.5&sizediffs>42.18750&dist_diff<10000)), nrow(filter(plot3d, sizediffs>109.5&dist_diff<10000)), nrow(filter(plot3d, sizediffs<=13.5&dist_diff>=10000&dist_diff<=20000)), nrow(filter(plot3d, sizediffs<=23.33&sizediffs>13.5&dist_diff>=10000&dist_diff<=20000)), nrow(filter(plot3d, sizediffs<=42.18750&sizediffs>23.33&dist_diff>=10000&dist_diff<=20000)), nrow(filter(plot3d, sizediffs<=109.5&sizediffs>42.18750&dist_diff>=10000&dist_diff<=20000)), nrow(filter(plot3d, sizediffs>109.5&dist_diff>=10000&dist_diff<=20000)), nrow(filter(plot3d, sizediffs<=13.5&dist_diff>20000)), nrow(filter(plot3d, sizediffs<=23.33&sizediffs>13.5&dist_diff>20000)), nrow(filter(plot3d, sizediffs<=42.18750&sizediffs>23.33&dist_diff>20000)), nrow(filter(plot3d, sizediffs<=109.5&sizediffs>42.18750&dist_diff>20000)), nrow(filter(plot3d, sizediffs>109.5&dist_diff>20000))), num_sig=c(nrow(filter(plot3d, sizediffs<=13.5&dist_diff<10000&FDR<=0.05)), nrow(filter(plot3d, sizediffs<=23.33&sizediffs>13.5&dist_diff<10000&FDR<=0.05)), nrow(filter(plot3d, sizediffs<=42.18750&sizediffs>23.33&dist_diff<10000&FDR<=0.05)), nrow(filter(plot3d, sizediffs<=109.5&sizediffs>42.18750&dist_diff<10000&FDR<=0.05)), nrow(filter(plot3d, sizediffs>109.5&dist_diff<10000&FDR<=0.05)), nrow(filter(plot3d, sizediffs<=13.5&dist_diff>=10000&dist_diff<=20000&FDR<=0.05)), nrow(filter(plot3d, sizediffs<=23.33&sizediffs>13.5&dist_diff>=10000&dist_diff<=20000&FDR<=0.05)), nrow(filter(plot3d, sizediffs<=42.18750&sizediffs>23.33&dist_diff>=10000&dist_diff<=20000&FDR<=0.05)), nrow(filter(plot3d, sizediffs<=109.5&sizediffs>42.18750&dist_diff>=10000&dist_diff<=20000&FDR<=0.05)), nrow(filter(plot3d, sizediffs>109.5&dist_diff>=10000&dist_diff<=20000&FDR<=0.05)), nrow(filter(plot3d, sizediffs<=13.5&dist_diff>20000&FDR<=0.05)), nrow(filter(plot3d, sizediffs<=23.33&sizediffs>13.5&dist_diff>20000&FDR<=0.05)), nrow(filter(plot3d, sizediffs<=42.18750&sizediffs>23.33&dist_diff>20000&FDR<=0.05)), nrow(filter(plot3d, sizediffs<=109.5&sizediffs>42.18750&dist_diff>20000&FDR<=0.05)), nrow(filter(plot3d, sizediffs>109.5&dist_diff>20000&FDR<=0.05))))
true3d$prop <- true3d$num_sig/true3d$set_size #Get the proportion of these different sets that are significant (at or below 5% FDR)

#I will then take this data frame and export it to plot.ly's online interface, in order to make a 3D plot that will allow me to visualize FDR significance inflation in any of these sets (and subsequently filter them out). The 3D plot looks like this:
htmltools::includeHTML("output/QQQC.html")

#FIGS9B
#Reviewer 3 also suggested/requested turning this into a 2D scatterplot of dist diff vs. bin diff, with a color scale to indicate proportion significant:
levels(true3d$dist_class)

[1] "Long (>20kb)"   "None"           "Short (<=20kb)"

true3d$dist_class <- factor(true3d$dist_class, levels(true3d$dist_class)[c(2, 3, 1)]) #Reorder factor levels!
levels(true3d$dist_class)

[1] "None"           "Short (<=20kb)" "Long (>20kb)"

ggplot(data=true3d) + geom_point(aes(x=bin_quant, y=dist_class, size=set_size, color=prop)) + xlab("Increasing Bin Size Differences") + ylab("Mate Pair Distance Differences") + scale_color_gradient(low="blue", high="red", name="Proportion Significant") + scale_size(name="Set Size")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

#Based on the 3D QQ quality control plot, the hit classes with larger distance-between-mates differences should be filtered out due to inflation of # hits significant @ 5% FDR. This gets rid of ~55k hits, or about a seventh of them. Essentially I am removing any hits that showed a difference in distance of greater than 20kb when lifting Over acros the species, to eliminate technical genomic differences that may drive signal in the species term:
highclass <- which(plot3d$dist_diff>20000)
data.filtered <- data.4[-highclass,] #Just removed high class since it has the starkest effect on proportion of hits significant @ 5% FDR

#Check on which of these hits that we have filtered out were discovered in one or both species; response to Reviewer 2 questions:
table(data.4$disc_species[highclass]) #~1500 H-specific, ~4000 C-specific, ~50k found in both--not really "due to liftOver"


    B     C     H 
49488  4101  1547

Now I have obliterated any remaining concerns about the QQ plot inflation, and filtered out another class of Hi-C significant hits where species differences may have been driven by issues with liftOver between genomes.

Volcano Plot Asymmetry

Now, I go to the asymmetry issue seen in the volcano plot on the linear modeling outputs. I first utilize the set of data conditioned upon discovery in 4 individuals, and break this information down on a chromosome-by-chromosome basis to see if there are particular chromosomes driving the asymmetry. Upon observation of chromosome-specific effects I quantify the extent of asymmetry on individual chromosomes using a null expectation of a binomial distribution with 50/50 probability of betas being positive or negative. I then repeat these analyses and visualize them on the final set of filtered data from above.

#Here I once again show the volcano plot for 4 individuals. Observe the stark asymmetry with a pile-up of hits on the left side as compared to the right.
volcplot.4 <- data.frame(pval=-log10(data.4$sp_pval), beta=data.4$sp_beta, species=data.4$disc_species, chr=data.4$Hchr)
ggplot(data=volcplot.4, aes(x=beta, y=pval)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("-log10 BH-corrected p-values") + ggtitle("Contact Frequency Differences | 4 Individuals") #No colors

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

volcplot.4$specorder <- ifelse(volcplot.4$species=="B", 3, ifelse(volcplot.4$species=="C", 2, 1))
volcplot.4 <- volcplot.4[order(volcplot.4$specorder),]

ggplot(data=volcplot.4, aes(x=beta, y=pval, color=species)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("-log10 BH-corrected p-values") + ggtitle("Contact Frequency Differences Colored by Species of Discovery | 4") + scale_color_manual(name="Species", labels=c("Both", "Chimp", "Human"), values=c("#F8766D", "#00BA38", "#619CFF")) #Modify to add clearer labels

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

#FIGS9A
ggplot(data=volcplot.4, aes(x=beta, y=pval, color=species)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("-log10 BH-corrected p-values") + ggtitle("Contact Frequency Differences") + scale_color_manual(name="Species", labels=c("Both", "Chimp", "Human"), values=c("#F8766D", "#00BA38", "#619CFF")) + theme(plot.title=element_text(hjust=0.35)) #Modify to add clearer labels, hjust to adjust title location.

Version	Author	Date
b69c73d	Ittai Eres	2019-05-05
972c85a	Ittai Eres	2019-03-14

#As expected, we see that hits which produce a strong negative beta in the linear model (suggesting a marked decrease in contact frequency in humans as compared to chimps) are primarily discovered as significant by homer in chimpanzees. The inverse also holds true for human discoveries. This is reassuring, but still, why the asymmetry? Here I break this plot down on a chromosome-by-chromosome basis to see if this is being driven by individual chromosomes ore is an overall issue with the technique or its processing:
#First rearrange chrs to make a prettier plot that has chrs sequential:
levels(volcplot.4$chr)

 [1] "chr1"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16"
 [9] "chr17" "chr18" "chr19" "chr2"  "chr20" "chr21" "chr22" "chr3" 
[17] "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"  "chrX"  "chrY"

volcplot.4$chr <- factor(volcplot.4$chr, levels(volcplot.4$chr)[c(1, 12, 16:22, 2:11, 13:15, 23:24)]) #Reorder factor levels!
levels(volcplot.4$chr)

 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8" 
 [9] "chr9"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16"
[17] "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"

ggplot(volcplot.4, aes(x=beta, y=pval, color=species)) + geom_point(size=0.001) + ggtitle("Volcano Plots by Chr") + facet_wrap(~as.factor(chr), nrow=5, ncol=5) + guides(color=FALSE) + ylab("-log10 p-values") + xlab("Log2 Fold Change in Contact Frequency")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

#This is extremely interesting. It appears that the asymmetry seen is being driven primarily by only some of the chromosomes, and particularly ones where large-scale rearrangements have transpired between humans and chimps (e.g. chrs 2, 16, 17). This will warrant further investigation in another element of the analysis; for now, I am satisfied that the asymmetry is not an issue with the entire dataset but is confined to individual chromosomes.

#Now, check this again, but on the data I have filtered accounting for potential genomic differences introduced by liftOver. From here on out all analyses will be run on this filtered set of data:
volcplot.filt <- data.frame(pval=-log10(data.filtered$sp_BH_pval), beta=data.filtered$sp_beta, species=data.filtered$disc_species, chr=data.filtered$Hchr)
#volcplot.filt$alpha <- ifelse(volcplot.filt$species=="B", 1, 0.3) #Not in love with this solution
volcplot.filt$specorder <- ifelse(volcplot.filt$species=="B", 3, ifelse(volcplot.filt$species=="C", 2, 1))
volcplot.filt <- volcplot.filt[order(volcplot.filt$specorder),]

#FIGS9C (same as FIG2A)
ggplot(data=volcplot.filt, aes(x=beta, y=pval, color=species)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("-log10 BH-corrected p-values") + ggtitle("Filtered Contact Frequency Differences") + scale_color_manual(name="Species", labels=c("Both", "Chimp", "Human"), values=c("#F8766D", "#00BA38", "#619CFF")) #+ geom_hline(yintercept=-log10(0.05), color="red") #Modify to add clearer labels, draw points in different order, and add red line at p=0.05 (actually, removed this, not sure it really matters or helps to show that), perfect for presentations.

Version	Author	Date
b69c73d	Ittai Eres	2019-05-05
972c85a	Ittai Eres	2019-03-14

#FIG2A (same as FIGS9C)
FIG2A <- ggplot(data=volcplot.filt, aes(x=beta, y=pval, color=species)) + geom_point() + xlab("Log2 Fold Change in Contact Frequency") + ylab("-log10 BH-corrected p-values") + ggtitle("Contact Frequency Differences") + scale_color_manual(name="Species", labels=c("Both", "Chimp", "Human"), values=c("#F8766D", "#00BA38", "#619CFF")) + theme(plot.title=element_text(hjust=0.5, size=12), axis.text=element_text(size=10))
#This volcano plot makes it look as though the asymmetry goes away entirely, with both sides of the volcano showing a visually similar number of hits (no longer AS stark of asymmetry).

#Get 2A on graphics device; export as PDF and move to photoshop for paper.
FIG2A

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

#If we break this down on a chromosome-by-chromosome basis:
levels(volcplot.filt$chr)

 [1] "chr1"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16"
 [9] "chr17" "chr18" "chr19" "chr2"  "chr20" "chr21" "chr22" "chr3" 
[17] "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"  "chrX"  "chrY"

levels(volcplot.filt$chr) <- gsub("chr", "Chr. ", levels(volcplot.filt$chr)) #Change names of factors and in actual data to make plotting easier
data.filtered$Hchr <- gsub("chr", "Chr. ", data.filtered$Hchr)
volcplot.filt$chr <- factor(volcplot.filt$chr, levels(volcplot.filt$chr)[c(1, 12, 16:22, 2:11, 13:15, 23:24)]) #Reorder factor levels!
levels(volcplot.filt$chr)

 [1] "Chr. 1"  "Chr. 2"  "Chr. 3"  "Chr. 4"  "Chr. 5"  "Chr. 6"  "Chr. 7" 
 [8] "Chr. 8"  "Chr. 9"  "Chr. 10" "Chr. 11" "Chr. 12" "Chr. 13" "Chr. 14"
[15] "Chr. 15" "Chr. 16" "Chr. 17" "Chr. 18" "Chr. 19" "Chr. 20" "Chr. 21"
[22] "Chr. 22" "Chr. X"  "Chr. Y"

ggplot(volcplot.filt, aes(x=beta, y=pval, color=species)) + geom_point(size=0.001) + ggtitle("Volcano Plots by Chr") + facet_wrap(~as.factor(chr), nrow=5, ncol=5) + guides(color=FALSE) + ylab("-log10 BH-adjusted p-values") + xlab("Log2 Fold Change in Contact Frequency")

Version	Author	Date
972c85a	Ittai Eres	2019-03-14

#We can still see that there are certain chromosomes with a lot of asymmetry to them in terms of distribution of significant betas for the linear modeling--chimps still seem to dominate chromosomes 2, 15-17, and some others.

options(scipen=0)
#Now try to quantify the extent of asymmetry in the chromosomes, with a particular focus on hits that are statistically significant.
asym.stats <- data.frame(chr=unique(data.filtered$Hchr), binom.p=rep(NA, 24), prop.h=rep(NA, 24), prop.c=rep(NA, 24))
rownames(asym.stats) <- asym.stats$chr
for(chromo in unique(data.filtered$Hchr)){
  mydat <- filter(data.filtered, Hchr==chromo, sp_BH_pval<=0.05) #Iterate through chromosomes
  side <- ifelse(mydat$sp_beta<0, 0, 1) #Assign sides of the beta dist'n to the betas
  asym.obs <- min(sum(side==1), sum(side==0)) #Find which side of the beta dist'n has less points, so I can use pbinom on default w/ lower.tail to find probability of observing a result as or MORE asymmetric than this one.
  asym.stats[chromo, 2] <- pbinom(asym.obs, length(side), 0.5) #Find that probability with the assumption of 50/50 chance of landing on either side.
  asym.stats[chromo, 3] <- sum(side==1)/length(side) #Find the proportion of the hits that fall on the human side of the distribution (positive, indicating increased contact frequency here in humans as compared to chimps)
  asym.stats[chromo, 4] <- sum(side==0)/length(side) #Same thing as above, but for chimps this time.
}
#Now we can look at the actual results to quantify how asymmetric the significant hits are from each chromosome:
asym.stats

            chr      binom.p    prop.h     prop.c
Chr. 10 Chr. 10 5.385967e-07 0.3284314 0.67156863
Chr. 12 Chr. 12 4.198505e-08 0.6445087 0.35549133
Chr. 13 Chr. 13 4.842465e-11 0.2775120 0.72248804
Chr. 14 Chr. 14 3.425391e-01 0.5115681 0.48843188
Chr. 15 Chr. 15 1.145818e-29 0.2111111 0.78888889
Chr. 16 Chr. 16 1.778800e-18 0.2971175 0.70288248
Chr. 17 Chr. 17 7.770944e-63 0.1908957 0.80910426
Chr. 18 Chr. 18 8.547464e-04 0.6428571 0.35714286
Chr. 19 Chr. 19 6.708114e-47 0.8058252 0.19417476
Chr. 1   Chr. 1 2.966482e-06 0.4072848 0.59271523
Chr. 20 Chr. 20 2.084187e-06 0.6602871 0.33971292
Chr. 21 Chr. 21 1.584002e-02 0.5676692 0.43233083
Chr. 22 Chr. 22 1.576454e-01 0.5279330 0.47206704
Chr. 11 Chr. 11 2.604538e-01 0.4829268 0.51707317
Chr. 2   Chr. 2 3.902883e-37 0.1737892 0.82621083
Chr. 3   Chr. 3 3.398324e-01 0.5165877 0.48341232
Chr. 4   Chr. 4 1.751408e-02 0.4180791 0.58192090
Chr. 5   Chr. 5 4.525288e-01 0.4946619 0.50533808
Chr. 6   Chr. 6 3.627575e-36 0.2710927 0.72890733
Chr. 7   Chr. 7 0.000000e+00 0.9532921 0.04670793
Chr. 8   Chr. 8 7.356143e-02 0.4476190 0.55238095
Chr. 9   Chr. 9 1.996949e-07 0.6168421 0.38315789
Chr. X   Chr. X 1.139666e-72 0.9512195 0.04878049
Chr. Y   Chr. Y 1.000000e+00       NaN        NaN

#ChrY did not have any significant hits, so get a p-value of 1. Chr7 is so asymmetric it is impossible to calculate, while chromosomes 1, 2, 6, 9, 11, 15, 16, 17, and X all showed asymmetry in direction of effects with extreme statistical significance.

#Now, make volcano plots by chromosome again, this time labeling each chromosome with its binomial p-value and the percentage of significant hits showing stronger contact frequencies in each species on either side of the distribution.
volcplot.noY.filt <- filter(volcplot.filt, chr!="Chr. Y") #Remove Chr Y as requested by a reviewer (makes sense, no real data there for this)
volcplot.noY.filt$chr <- factor(volcplot.noY.filt$chr)
levels(volcplot.noY.filt$chr)

 [1] "Chr. 1"  "Chr. 2"  "Chr. 3"  "Chr. 4"  "Chr. 5"  "Chr. 6"  "Chr. 7" 
 [8] "Chr. 8"  "Chr. 9"  "Chr. 10" "Chr. 11" "Chr. 12" "Chr. 13" "Chr. 14"
[15] "Chr. 15" "Chr. 16" "Chr. 17" "Chr. 18" "Chr. 19" "Chr. 20" "Chr. 21"
[22] "Chr. 22" "Chr. X"

#volcplot.noY.filt$chr <- factor(volcplot.noY.filt$chr, levels(volcplot.noY.filt$chr)[c(10, 15:22, 1, 14, 2:9, 11:13, 23)])
asym.stats <- asym.stats[-24,] #Get rid of the Chr. Y row from the asym stats table.
ggplot(data=volcplot.noY.filt, aes(x=beta, y=pval, color=species)) + geom_point(size=0.001) + ggtitle("Volcano Plots by Chromosome") + facet_wrap(~chr, nrow=5, ncol=5) + guides(color=FALSE) + ylab("-log10 BH-corrected p-values") + xlab("Log2 Fold Change in Contact Frequency") + geom_text(data=asym.stats, aes(x=-4.5, y=0.75, label=paste(round(prop.c*100, digits=1), "%", sep=""), color=NULL), show.legend=FALSE, size=2) + geom_text(data=asym.stats, aes(x=4, y=0.75, label=paste(round(prop.h*100, digits=1), "%", sep=""), color=NULL), show.legend=FALSE, size=2) + geom_text(data=asym.stats, aes(x=0, y=3.75, label=signif(binom.p, digits=3), color=NULL), show.legend=FALSE, size=2) + theme(strip.text=element_text(size=10, lineheight=0.5), axis.text.y=element_text(size=8), axis.text.x=element_text(size=8))

#FIG2B
FIG2B <- ggplot(data=volcplot.noY.filt, aes(x=beta, y=pval, color=species)) + geom_point(size=0.001) + ggtitle("Volcano Plots by Chromosome") + facet_wrap(~chr, nrow=5, ncol=5) + guides(color=FALSE) + ylab("-log10 BH-corrected p-values") + xlab("Log2 Fold Change in Contact Frequency") + geom_text(data=asym.stats, aes(x=-4.25, y=0.75, label=paste(round(prop.c*100, digits=1), "%", sep=""), color=NULL), show.legend=FALSE, size=3) + geom_text(data=asym.stats, aes(x=4.25, y=0.75, label=paste(round(prop.h*100, digits=1), "%", sep=""), color=NULL), show.legend=FALSE, size=3) + geom_text(data=asym.stats, aes(x=0, y=4.5, label=signif(binom.p, digits=3), color=NULL), show.legend=FALSE, size=3) + theme(strip.text=element_text(size=10, lineheight=0.5), axis.text.y=element_text(size=10), axis.text.x=element_text(size=10)) + coord_cartesian(xlim=c(-6, 6), ylim=c(0, 5)) + scale_y_continuous(breaks=c(0, 2, 4))

#Get 2B on graphics device to export as PDF and bring into photoshop for paper publication.
FIG2B

#FIG2
FIG2 <- plot_grid(FIG2A, FIG2B, labels=c("A", "B"))
save_plot("~/Desktop/Fig2.tiff", FIG2, ncol=2, nrow=1, base_aspect_ratio=1.6) #Will need to drag into Adobe and edit afterwards as well, to reduce size. Used this and just took cropped copies of each panel to fit into the psd file for PLOS.
#save_plot("~/Desktop/Fig2.final.tiff", FIG2, ncol=1, nrow=2, base_height=3, base_width=5)
######
#ggsave('~/Desktop/volcchr.jpg', device="jpeg", antialias="none") This was an earlier attempt at clearing up some image resolution blurriness with the text.
#ggsave('~/Desktop/volcchr2.jpg', device="jpeg", dpi=5000) #Works well!
#ggsave("~/Desktop/volcchr2.jpg", device="jpg", dpi=200)
#Write out the data with the new columns added on!
fwrite(full.data, "output/full.data.10.lm.QC", quote = TRUE, sep = "\t", row.names = FALSE, col.names = TRUE, na="NA", showProgress = FALSE)
fwrite(data.4, "output/data.4.lm.QC", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE, na="NA", showProgress = FALSE)
fwrite(data.filtered, "output/data.4.filtered.lm.QC", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE, na="NA", showProgress = FALSE)

#Extract DC regions for supplementary table 10 (Table S10) for paper:
DC.table <- filter(data.filtered, sp_BH_pval<=0.05)
DC.table <- select(DC.table, Hchr, H1, H2, Cchr, C1, C2, sp_BH_pval)
DC.table$Hchr <- gsub("Chr. ", "chr", DC.table$Hchr)
DC.table$H1 <- as.numeric(gsub(".*-", "", DC.table$H1))
DC.table$H2 <- as.numeric(gsub(".*-", "", DC.table$H2))
DC.table$C1 <- as.numeric(gsub(".*-", "", DC.table$C1))
DC.table$C2 <- as.numeric(gsub(".*-", "", DC.table$C2))
DC.table$H1.end <- DC.table$H1+10000
DC.table$H2.end <- DC.table$H2+10000
DC.table$C1.end <- DC.table$C1+10000
DC.table$C2.end <- DC.table$C2+10000
DC.table <- DC.table[,c(1, 2, 8, 3, 9, 4, 5, 10, 6, 11, 7)]
colnames(DC.table) <- c("Human.chromosome", "Human.1.start", "Human.1.end", "Human.2.start", "Human.2.end", "Chimp.chromosome", "Chimp.1.start", "Chimp.1.end", "Chimp.2.start", "Chimp.2.end", "DC FDR")
#fwrite(DC.table, "~/Desktop/Paper Drafts/PLOS/Revision/Revision_2/Final?/FINAL/supptables/S10 Table.txt", quote = FALSE, sep="\t")

From this analysis we have dealt with some quality control issues, and filtered down the data to a final set of biologically significant Hi-C interaction frequencies, many of which appear species-specific. There are clearly strong differences between the species that make their 3D regulatory landscapes divergent. Now, I move to orthogonal gene expression analyses.

sessionInfo()

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] forcats_0.4.0      purrr_0.3.2        readr_1.3.1       
 [4] tibble_2.1.3       tidyverse_1.2.1    edgeR_3.20.9      
 [7] RColorBrewer_1.1-2 heatmaply_0.16.0   viridis_0.5.1     
[10] viridisLite_0.3.0  stringr_1.4.0      gplots_3.0.1.1    
[13] Hmisc_4.2-0        Formula_1.2-3      survival_2.44-1.1 
[16] lattice_0.20-38    dplyr_0.8.3        plotly_4.9.0      
[19] cowplot_0.9.4      ggplot2_3.2.1      reshape2_1.4.3    
[22] data.table_1.12.0  tidyr_1.0.0        plyr_1.8.4        
[25] limma_3.34.9      

loaded via a namespace (and not attached):
 [1] nlme_3.1-137        bitops_1.0-6        fs_1.3.1           
 [4] lubridate_1.7.4     webshot_0.5.1       httr_1.4.1         
 [7] rprojroot_1.3-2     tools_3.4.0         backports_1.1.4    
[10] R6_2.4.0            rpart_4.1-15        KernSmooth_2.23-15 
[13] lazyeval_0.2.2      colorspace_1.4-1    nnet_7.3-12        
[16] withr_2.1.2         tidyselect_0.2.5    gridExtra_2.3      
[19] git2r_0.26.1        cli_1.1.0           rvest_0.3.4        
[22] htmlTable_1.13.2    TSP_1.1-7           xml2_1.2.2         
[25] labeling_0.3        caTools_1.17.1.2    scales_1.0.0       
[28] checkmate_1.9.4     digest_0.6.18       foreign_0.8-72     
[31] rmarkdown_1.12      base64enc_0.1-3     pkgconfig_2.0.3    
[34] htmltools_0.3.6     highr_0.8           readxl_1.3.1       
[37] htmlwidgets_1.3     rlang_0.4.0         rstudioapi_0.10    
[40] generics_0.0.2      jsonlite_1.6        gtools_3.8.1       
[43] acepack_1.4.1       dendextend_1.12.0   magrittr_1.5       
[46] Matrix_1.2-17       Rcpp_1.0.1          munsell_0.5.0      
[49] lifecycle_0.1.0     stringi_1.4.3       whisker_0.4        
[52] yaml_2.2.0          MASS_7.3-51.4       grid_3.4.0         
[55] gdata_2.18.0        crayon_1.3.4        haven_2.1.1        
[58] splines_3.4.0       hms_0.5.1           locfit_1.5-9.1     
[61] zeallot_0.1.0       knitr_1.22          pillar_1.4.2       
[64] codetools_0.2-16    glue_1.3.1          gclus_1.3.2        
[67] evaluate_0.13       latticeExtra_0.6-28 modelr_0.1.5       
[70] vctrs_0.2.0         foreach_1.4.7       cellranger_1.1.0   
[73] gtable_0.3.0        assertthat_0.2.1    xfun_0.5           
[76] broom_0.5.2         seriation_1.2-3     iterators_1.0.12   
[79] registry_0.5-1      workflowr_1.4.0     cluster_2.0.7-1

Quality Control on Linear Modeling

Ittai Eres

2019-03-13

QC on QQ Plots

Filtering on Genomic Differences in Bin Sizes and Pair Distances

Volcano Plot Asymmetry