Hi-C Data Normalization and Initial Quality Control, Juicer

Last updated: 2020-08-05

Checks: 7 0

Knit directory: HiCiPSC/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190311)

The command set.seed(20190311) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: eb19ae1

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    code/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/TADs/.DS_Store
    Ignored:    data/TADs/Human_inter_30_KR_contact_domains/.DS_Store
    Ignored:    data/TADs/scripts/.DS_Store
    Ignored:    docs/.DS_Store
    Ignored:    docs/figure/.DS_Store
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  Rplot.jpeg
    Untracked:  Rplot001.jpeg
    Untracked:  Rplot002.jpeg
    Untracked:  Rplot003.jpeg
    Untracked:  Rplot004.jpeg
    Untracked:  Rplot005.jpeg
    Untracked:  Rplot006.jpeg
    Untracked:  Rplot007.jpeg
    Untracked:  Rplot008.jpeg
    Untracked:  Rplot009.jpeg
    Untracked:  Rplot010.jpeg
    Untracked:  Rplot011.jpeg
    Untracked:  Rplot012.jpeg
    Untracked:  Rplot013.jpeg
    Untracked:  Rplot014.jpeg
    Untracked:  Rplot015.jpeg
    Untracked:  Rplot016.jpeg
    Untracked:  Rplot017.jpeg
    Untracked:  Rplot018.jpeg
    Untracked:  Rplot019.jpeg
    Untracked:  Rplot020.jpeg
    Untracked:  Rplot021.jpeg
    Untracked:  Rplot022.jpeg
    Untracked:  Rplot023.jpeg
    Untracked:  Rplot024.jpeg
    Untracked:  Rplot025.jpeg
    Untracked:  Rplot026.jpeg
    Untracked:  Rplot027.jpeg
    Untracked:  Rplot028.jpeg
    Untracked:  Rplot029.jpeg
    Untracked:  Rplot030.jpeg
    Untracked:  Rplot031.jpeg
    Untracked:  Rplot032.jpeg
    Untracked:  Rplot033.jpeg
    Untracked:  Rplot034.jpeg
    Untracked:  Rplot035.jpeg
    Untracked:  Rplot036.jpeg
    Untracked:  Rplot037.jpeg
    Untracked:  Rplot038.jpeg
    Untracked:  Rplot039.jpeg
    Untracked:  Rplot040.jpeg
    Untracked:  Rplot041.jpeg
    Untracked:  Rplot042.jpeg
    Untracked:  Rplot043.jpeg
    Untracked:  Rplot044.jpeg
    Untracked:  Rplot045.jpeg
    Untracked:  Rplot046.jpeg
    Untracked:  Rplot047.jpeg
    Untracked:  Rplot048.jpeg
    Untracked:  Rplot049.jpeg
    Untracked:  Rplot050.jpeg
    Untracked:  Rplot051.jpeg
    Untracked:  Rplot052.jpeg
    Untracked:  Rplot053.jpeg
    Untracked:  Rplot054.jpeg
    Untracked:  Rplot055.jpeg
    Untracked:  Rplot056.jpeg
    Untracked:  Rplot057.jpeg
    Untracked:  Rplot058.jpeg
    Untracked:  Rplot059.jpeg
    Untracked:  Rplot060.jpeg
    Untracked:  Rplot061.jpeg
    Untracked:  Rplot062.jpeg
    Untracked:  Rplot063.jpeg
    Untracked:  Rplot064.jpeg
    Untracked:  Rplot065.jpeg
    Untracked:  Rplot066.jpeg
    Untracked:  Rplot067.jpeg
    Untracked:  Rplot068.jpeg
    Untracked:  Rplot069.jpeg
    Untracked:  Rplot070.jpeg
    Untracked:  Rplot071.jpeg
    Untracked:  Rplot072.jpeg
    Untracked:  Rplot073.jpeg
    Untracked:  Rplot074.jpeg
    Untracked:  Rplot075.jpeg
    Untracked:  Rplot076.jpeg
    Untracked:  Rplot077.jpeg
    Untracked:  Rplot078.jpeg
    Untracked:  Rplot079.jpeg
    Untracked:  Rplot080.jpeg
    Untracked:  Rplot081.jpeg
    Untracked:  Rplot082.jpeg
    Untracked:  Rplot083.jpeg
    Untracked:  Rplot084.jpeg
    Untracked:  Rplot085.jpeg
    Untracked:  Rplot086.jpeg
    Untracked:  Rplot087.jpeg
    Untracked:  Rplot088.jpeg
    Untracked:  Rplot089.jpeg
    Untracked:  Rplot090.jpeg
    Untracked:  Rplot091.jpeg
    Untracked:  Rplot092.jpeg
    Untracked:  Rplot093.jpeg
    Untracked:  Rplot094.jpeg
    Untracked:  Rplot095.jpeg
    Untracked:  Rplot096.jpeg
    Untracked:  Rplot097.jpeg
    Untracked:  Rplot098.jpeg
    Untracked:  Rplot099.jpeg
    Untracked:  Rplot100.jpeg
    Untracked:  Rplot101.jpeg
    Untracked:  Rplot102.jpeg
    Untracked:  Rplot103.jpeg
    Untracked:  Rplot104.jpeg
    Untracked:  Rplot105.jpeg
    Untracked:  Rplot106.jpeg
    Untracked:  Rplot107.jpeg
    Untracked:  Rplot108.jpeg
    Untracked:  Rplot109.jpeg
    Untracked:  Rplot110.jpeg
    Untracked:  Rplot111.jpeg
    Untracked:  Rplot112.jpeg
    Untracked:  Rplot113.jpeg
    Untracked:  Rplot114.jpeg
    Untracked:  Rplot115.jpeg
    Untracked:  Rplot116.jpeg
    Untracked:  Rplot117.jpeg
    Untracked:  Rplot118.jpeg
    Untracked:  Rplot119.jpeg
    Untracked:  Rplot120.jpeg
    Untracked:  Rplot121.jpeg
    Untracked:  Rplot122.jpeg
    Untracked:  Rplot123.jpeg
    Untracked:  Rplot124.jpeg
    Untracked:  Rplot125.jpeg
    Untracked:  Rplot126.jpeg
    Untracked:  Rplot127.jpeg
    Untracked:  Rplot128.jpeg
    Untracked:  Rplot129.jpeg
    Untracked:  Rplot130.jpeg
    Untracked:  Rplot131.jpeg
    Untracked:  Rplot132.jpeg
    Untracked:  Rplot133.jpeg
    Untracked:  Rplot134.jpeg
    Untracked:  Rplot135.jpeg
    Untracked:  Rplot136.jpeg
    Untracked:  Rplot137.jpeg
    Untracked:  Rplot138.jpeg
    Untracked:  Rplot139.jpeg
    Untracked:  Rplot140.jpeg
    Untracked:  Rplot141.jpeg
    Untracked:  Rplot142.jpeg
    Untracked:  Rplot143.jpeg
    Untracked:  Rplot144.jpeg
    Untracked:  Rplot145.jpeg
    Untracked:  Rplot146.jpeg
    Untracked:  Rplot147.jpeg
    Untracked:  Rplot148.jpeg
    Untracked:  Rplot149.jpeg
    Untracked:  Rplot150.jpeg
    Untracked:  Rplot151.jpeg
    Untracked:  Rplot152.jpeg
    Untracked:  Rplot153.jpeg
    Untracked:  Rplot154.jpeg
    Untracked:  Rplot155.jpeg
    Untracked:  Rplot156.jpeg
    Untracked:  Rplot157.jpeg
    Untracked:  Rplot158.jpeg
    Untracked:  Rplot159.jpeg
    Untracked:  Rplot160.jpeg
    Untracked:  Rplot161.jpeg
    Untracked:  Rplot162.jpeg
    Untracked:  Rplot163.jpeg
    Untracked:  Rplot164.jpeg
    Untracked:  Rplot165.jpeg
    Untracked:  Rplot166.jpeg
    Untracked:  Rplot167.jpeg
    Untracked:  Rplot168.jpeg
    Untracked:  Rplot169.jpeg
    Untracked:  Rplot170.jpeg
    Untracked:  Rplot171.jpeg
    Untracked:  Rplot172.jpeg
    Untracked:  Rplot173.jpeg
    Untracked:  Rplot174.jpeg
    Untracked:  Rplot175.jpeg
    Untracked:  Rplot176.jpeg
    Untracked:  Rplot177.jpeg
    Untracked:  Rplot178.jpeg
    Untracked:  Rplot179.jpeg
    Untracked:  Rplot180.jpeg
    Untracked:  Rplot181.jpeg
    Untracked:  Rplot182.jpeg
    Untracked:  Rplot183.jpeg
    Untracked:  Rplot184.jpeg
    Untracked:  Rplot185.jpeg
    Untracked:  Rplot186.jpeg
    Untracked:  Rplot187.jpeg
    Untracked:  Rplot188.jpeg
    Untracked:  Rplot189.jpeg
    Untracked:  Rplot190.jpeg
    Untracked:  Rplot191.jpeg
    Untracked:  Rplot192.jpeg
    Untracked:  Rplot193.jpeg
    Untracked:  Rplot194.jpeg
    Untracked:  Rplot195.jpeg
    Untracked:  Rplot196.jpeg
    Untracked:  Rplot197.jpeg
    Untracked:  Rplot198.jpeg
    Untracked:  Rplot199.jpeg
    Untracked:  Rplot200.jpeg
    Untracked:  Rplot201.jpeg
    Untracked:  Rplot202.jpeg
    Untracked:  Rplot203.jpeg
    Untracked:  Rplot204.jpeg
    Untracked:  Rplot205.jpeg
    Untracked:  Rplot206.jpeg
    Untracked:  Rplot207.jpeg
    Untracked:  Rplot208.jpeg
    Untracked:  Rplot209.jpeg
    Untracked:  Rplot210.jpeg
    Untracked:  Rplot211.jpeg
    Untracked:  Rplot212.jpeg
    Untracked:  Rplot213.jpeg
    Untracked:  Rplot214.jpeg
    Untracked:  Rplot215.jpeg
    Untracked:  Rplot216.jpeg
    Untracked:  Rplot217.jpeg
    Untracked:  Rplot218.jpeg
    Untracked:  Rplot219.jpeg
    Untracked:  Rplot220.jpeg
    Untracked:  Rplot221.jpeg
    Untracked:  Rplot222.jpeg
    Untracked:  Rplot223.jpeg
    Untracked:  Rplot224.jpeg
    Untracked:  Rplot225.jpeg
    Untracked:  Rplot226.jpeg
    Untracked:  Rplot227.jpeg
    Untracked:  Rplot228.jpeg
    Untracked:  Rplot229.jpeg
    Untracked:  Rplot230.jpeg
    Untracked:  Rplot231.jpeg
    Untracked:  Rplot232.jpeg
    Untracked:  Rplot233.jpeg
    Untracked:  Rplot234.jpeg
    Untracked:  Rplot235.jpeg
    Untracked:  Rplot236.jpeg
    Untracked:  Rplot237.jpeg
    Untracked:  Rplot238.jpeg
    Untracked:  Rplot239.jpeg
    Untracked:  Rplot240.jpeg
    Untracked:  Rplot241.jpeg
    Untracked:  Rplot242.jpeg
    Untracked:  Rplot243.jpeg
    Untracked:  Rplot244.jpeg
    Untracked:  Rplot245.jpeg
    Untracked:  Rplot246.jpeg
    Untracked:  Rplot247.jpeg
    Untracked:  Rplot248.jpeg
    Untracked:  Rplot249.jpeg
    Untracked:  Rplot250.jpeg
    Untracked:  Rplot251.jpeg
    Untracked:  Rplot252.jpeg
    Untracked:  Rplot253.jpeg
    Untracked:  Rplot254.jpeg
    Untracked:  Rplot255.jpeg
    Untracked:  Rplot256.jpeg
    Untracked:  Rplot257.jpeg
    Untracked:  Rplot258.jpeg
    Untracked:  Rplot259.jpeg
    Untracked:  Rplot260.jpeg
    Untracked:  Rplot261.jpeg
    Untracked:  Rplot262.jpeg
    Untracked:  Rplot263.jpeg
    Untracked:  Rplot264.jpeg
    Untracked:  Rplot265.jpeg
    Untracked:  Rplot266.jpeg
    Untracked:  Rplot267.jpeg
    Untracked:  Rplot268.jpeg
    Untracked:  Rplot269.jpeg
    Untracked:  Rplot270.jpeg
    Untracked:  Rplot271.jpeg
    Untracked:  Rplot272.jpeg
    Untracked:  Rplot273.jpeg
    Untracked:  Rplot274.jpeg
    Untracked:  Rplot275.jpeg
    Untracked:  Rplot276.jpeg
    Untracked:  Rplot277.jpeg
    Untracked:  Rplot278.jpeg
    Untracked:  Rplot279.jpeg
    Untracked:  Rplot280.jpeg
    Untracked:  Rplot281.jpeg
    Untracked:  Rplot282.jpeg
    Untracked:  Rplot283.jpeg
    Untracked:  Rplot284.jpeg
    Untracked:  Rplot285.jpeg
    Untracked:  Rplot286.jpeg
    Untracked:  Rplot287.jpeg
    Untracked:  Rplot288.jpeg
    Untracked:  Rplot289.jpeg
    Untracked:  Rplot290.jpeg
    Untracked:  Rplot291.jpeg
    Untracked:  Rplot292.jpeg
    Untracked:  Rplot293.jpeg
    Untracked:  Rplot294.jpeg
    Untracked:  Rplot295.jpeg
    Untracked:  Rplot296.jpeg
    Untracked:  Rplot297.jpeg
    Untracked:  Rplot298.jpeg
    Untracked:  Rplot299.jpeg
    Untracked:  Rplot300.jpeg
    Untracked:  Rplot301.jpeg
    Untracked:  Rplot302.jpeg
    Untracked:  Rplot303.jpeg
    Untracked:  Rplot304.jpeg
    Untracked:  S2A.jpeg
    Untracked:  S2B.jpeg
    Untracked:  code/mediation_test.R
    Untracked:  data/Chimp_orthoexon_extended_info.txt
    Untracked:  data/Human_orthoexon_extended_info.txt
    Untracked:  data/Meta_data.txt
    Untracked:  data/TADs/Arrowhead_individuals/
    Untracked:  data/TADs/CH.10kb.closest.panTro5
    Untracked:  data/TADs/CTCF.overlap.computer.sh
    Untracked:  data/TADs/CTCF/
    Untracked:  data/TADs/HC.10kb.closest.hg38
    Untracked:  data/TADs/Human_inter_30_KR_contact_domains_PT6/
    Untracked:  data/TADs/Rao/
    Untracked:  data/TADs/TopDom/
    Untracked:  data/TADs/deprecated/
    Untracked:  data/TADs/mega.bounds.intersect.c.sh
    Untracked:  data/TADs/mega.bounds.intersect.merge.sh
    Untracked:  data/TADs/mega.bounds.rao.sh
    Untracked:  data/TADs/mega.domains.bedtoolsc.sh
    Untracked:  data/TADs/mega.domains.rao.sh
    Untracked:  data/TADs/overlaps/
    Untracked:  data/TADs/overlaps_rao_style/
    Untracked:  data/TADs/rao.chimp.subsample.tester.sh
    Untracked:  data/chimp_lengths.txt
    Untracked:  data/counts_iPSC.txt
    Untracked:  data/epigenetic_enrichments/
    Untracked:  data/final.10kb.homer.df
    Untracked:  data/final.juicer.10kb.KR
    Untracked:  data/final.juicer.10kb.VC
    Untracked:  data/hic_gene_overlap/
    Untracked:  data/human_lengths.txt
    Untracked:  data/old_mediation_permutations/
    Untracked:  output/DC_regions.txt
    Untracked:  output/IEE.RPKM.RDS
    Untracked:  output/IEE_voom_object.RDS
    Untracked:  output/data.4.filtered.lm.QC
    Untracked:  output/data.4.fixed.init.LM
    Untracked:  output/data.4.fixed.init.QC
    Untracked:  output/data.4.init.LM
    Untracked:  output/data.4.init.QC
    Untracked:  output/data.4.lm.QC
    Untracked:  output/full.data.10.init.LM
    Untracked:  output/full.data.10.init.QC
    Untracked:  output/full.data.10.lm.QC
    Untracked:  output/full.data.annotations.RDS
    Untracked:  output/gene.hic.filt.KR.RDS
    Untracked:  output/gene.hic.filt.RDS
    Untracked:  output/gene.hic.filt.VC.RDS
    Untracked:  output/homer_mediation.rda
    Untracked:  output/juicer.IEE.RPKM.RDS
    Untracked:  output/juicer.IEE_voom_object.RDS
    Untracked:  output/juicer.filt.KR
    Untracked:  output/juicer.filt.KR.final
    Untracked:  output/juicer.filt.KR.lm
    Untracked:  output/juicer.filt.VC
    Untracked:  output/juicer.filt.VC.final
    Untracked:  output/juicer.filt.VC.lm
    Untracked:  output/juicer_mediation.rda
    Untracked:  output/mc_de.rds
    Untracked:  output/mc_de_homer.rds
    Untracked:  output/mc_de_juicer.rds
    Untracked:  output/mc_node.rds
    Untracked:  output/mc_node_homer.rds
    Untracked:  output/mc_node_juicer.rds
    Untracked:  ~/

Unstaged changes:
    Modified:   analysis/TADs.Rmd
    Deleted:    analysis/alt_mediation.Rmd
    Modified:   analysis/enrichment.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	eb19ae1	Ittai Eres	2020-08-05	Update formatting for headers
html	b69c73d	Ittai Eres	2019-05-05	Build site.
Rmd	2419813	Ittai Eres	2019-05-05	Update all files.
html	b7d82fc	Ittai Eres	2019-04-30	Build site.
Rmd	8380e8b	Ittai Eres	2019-04-30	Add variety of juicer analyses into website files.

Initial Data read-in and quality control

First, I read in the data and normalize it with cyclic pairwise loess normalization. Then I look at histograms of the distributions of the contact frequencies on an individual-by-individual basis, to see if they are comparable. I also look at a plot of the mean vs. the variance as an additional quality control metric; this is typically done on RNA-seq count data. In that case, it’s done to check if the normalized data are distributed in a particular fashion (e.g. a poisson model would have a linear mean-variance relationship). In the limma/voom pipeline, count data are log-cpm transformed and a loess trend line of variance vs. mean is then fit to create weights for individual genes to be fed into linear modeling with limma. Since this is not a QC metric typically used for Hi-C data, the only thing I hope to see is no particularly strong relationship between variance and mean.

###Read in VC and KR normalized data, normalize each with cyclic loess (pairwise), clean it from any rows that have missing values.
full.KR <- fread("data/final.juicer.10kb.KR", header=TRUE, stringsAsFactors = FALSE, data.table=FALSE, showProgress = FALSE) #Start w/ roughly 31.5k hits here
full.VC <- fread("data/final.juicer.10kb.VC", header=TRUE, stringsAsFactors = FALSE, data.table = FALSE, showProgress = FALSE) #~84k hits...

#Note that many more hits show up as significant under the VC normalization paradigm than do under KR balancing. This is especially bad for A-21792 for some reason.

#Subsetting down to only complete cases (I.E. none of the individuals have NA values goes to ~28k and ~77k hits for KR and VC, respectively. Reductions of both ~10%.)
full.KR <- full.KR[complete.cases(full.KR[,112:119]),]
full.VC <- full.VC[complete.cases(full.VC[,112:119]),]

#Visualize both.
boxplot(full.KR[,112:119], ylim=c(0, 30), horizontal = TRUE, main="KR Distributions")

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

boxplot(full.VC[,112:119], ylim=c(0, 30), horizontal=TRUE, main="VC Distributions")

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

KR.contacts <- full.KR[,112:119]
VC.contacts <- full.VC[,112:119]

###Pearson is a product moment correlation, meaning it evaluates the relationship b/t two continuous variables. Spearman is a rank-order correlation, is mainly evaluating monotonic rlationship b/t two continuous/ordinal variables (variables change together, but not necessarily at a constant rate). It's based on ranked values for each variable, rather than the raw data itself, and is thus not quite as precise IMO. Hence here I optimize my normalization schemes WRT seeing proper clustering in a Pearson correlation heatmap. OR I do it WRT a Spearman heatmap b/c Pearson is more quantitative and thus harder to fine-tune, and keeps messing up here--can't find a single normalization scheme that gives good clustering that isn't super dirty on Pearson.
###For now, just use Spearman correlations 
full.contacts.VC <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.28, iterations=3, method="pairs"))#Pearson clusters properly, but correlation values are not very pretty for separation. Same situation in Spearman but slightly less ugly. #6 iterations of 0.1 aren't bad, and can also get beautiful results with method="fast" instead of pairs (but this takes loess to the average of all the data so that's why, maybe not as valid). 0.21 and 0.28 spans also worth checking? They produce nice looking boxplots but not great looking heatmaps in either case...

#VC Correlations
corheat <- cor(full.contacts.VC, use="complete.obs", method="pearson")
corheats <- cor(full.contacts.VC, use="complete.obs", method="spearman")
colnames(corheat) <- c("H_F1", "H_M1", "C_M1", "C_F1", "H_M2", "H_F2", "C_M2", "C_F2")
rownames(corheat) <- colnames(corheat)
colnames(corheats) <- colnames(corheat)
rownames(corheats) <- colnames(corheat)

#VC Clustering
heatmaply(corheat, main="Pairwise Pearson Correlation Clustering @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters well, but not super distinct. #FIGS1B

heatmaply(corheats, main="Pairwise Spearman Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Same as above.

#Can't manage to find a cyclic loess normalization here that makes samples cluster properly for both types of correlations...can get samples clustering in appropriate places but not get the dendrogram to look how I expect--just use same settings as above, in interest of consistency.
full.contacts.KR <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=0.36, iterations=3, method="pairs")) #Method="fast" doesn't give us the same awesomeness here, but that's not too surprising considering different normalization schemes. .26 and .32 may also be worth checking here.

#KR Correlations
corheat <- cor(full.contacts.KR, use="complete.obs", method="pearson") #Corheat for the full data set, and heatmap. Pearson clusters poorly.
corheats <- cor(full.contacts.KR, use="complete.obs", method="spearman")
colnames(corheat) <- c("A_HF", "B_HM", "C_CM", "D_CF", "E_HM", "F_HF", "G_CM", "H_CF")
rownames(corheat) <- colnames(corheat)
colnames(corheats) <- colnames(corheat)
rownames(corheats) <- colnames(corheat)

#KR Clustering
heatmaply(corheat, main="Pairwise Pearson Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters poorly, but at least species groups are maintained!

heatmaply(corheats, main="Pairwise Spearman Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters excellently!

#FINALLY have something that works for both, and boxplots look good. One last check on them before moving to hists of dist'ns:
boxplot(full.contacts.KR, ylim=c(0, 30), horizontal = TRUE, main="KR Distributions, Normalized")

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

boxplot(full.contacts.VC, ylim=c(0, 30), horizontal=TRUE, main="VC Distributions, Normalized")

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

###First, a quick look at histograms of the distributions of the significant Hi-C hits from homer, in both humans and chimps. Create the melted dfs for each first otherwise it takes an absurd amount of time/memory:
VC.humans <- melt(full.contacts.VC[,c(1:2, 5:6)])

No id variables; using all as measure variables

VC.chimps <- melt(full.contacts.VC[,c(3:4, 7:8)])

No id variables; using all as measure variables

KR.humans <- melt(full.contacts.KR[,c(1:2, 5:6)])

No id variables; using all as measure variables

KR.chimps <- melt(full.contacts.KR[,c(3:4, 7:8)])

No id variables; using all as measure variables

#VC-normalized distributions:
ggplot(data=VC.humans, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Normalized Juicer Hi-C Contacts, Humans") + xlab("Juicer VC-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 3500)) #Human dist'ns

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

ggplot(data=VC.chimps, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Normalized Juicer Hi-C Contacts, Chimps") + xlab("Juicer VC-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 3500)) #Chimp Dist'ns

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

#Both sets of distributions look fairly bimodal, with chimps in particular showing a peak around 0.

###Now, the same thing for the KR-normalized interaction frequencies:
ggplot(data=KR.humans, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Normalized Juicer Hi-C Contacts, Humans") + xlab("Juicer KR-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 1500)) #Human dist'ns

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

ggplot(data=KR.chimps, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Normalized Juicer Hi-C Contacts, Chimps") + xlab("Juicer KR-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 1500)) #Chimp Dist'ns

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

###########Take a look at these again, but on the raw data before normalization between samples:
VC.humans <- melt(VC.contacts[,c(1:2, 5:6)])

No id variables; using all as measure variables

VC.chimps <- melt(VC.contacts[,c(3:4, 7:8)])

No id variables; using all as measure variables

KR.humans <- melt(KR.contacts[,c(1:2, 5:6)])

No id variables; using all as measure variables

KR.chimps <- melt(KR.contacts[,c(3:4, 7:8)])

No id variables; using all as measure variables

#VC-normalized distributions:
ggplot(data=VC.humans, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Humans") + xlab("Juicer VC-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 3500)) #Human dist'ns

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

ggplot(data=VC.chimps, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Chimps") + xlab("Juicer VC-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 3500)) #Chimp Dist'ns

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

#Both sets of distributions look fairly bimodal, with chimps in particular showing a peak around 0.

###Now, the same thing for the KR-normalized interaction frequencies:
ggplot(data=KR.humans, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Humans") + xlab("Juicer KR-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 1500)) #Human dist'ns

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

ggplot(data=KR.chimps, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Chimps") + xlab("Juicer KR-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 1500)) #Chimp Dist'ns

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

#There is no longer a good case for not normalizing in a pairwise cyclic loess fashion, especially now that I have clustering working on both types of normalization--just use what we have from here on out.

######PiCKUP HERE and see if I then slotted these pairwise cyclic loess back into the full DF in the other paradigm. I did indeed slot things into the data frame again in the other context, but here it must be two separate DFs--one for KR and one for VC, as the significant hits differ b/t these normalization schemes anyway.
full.KR[,112:119] <- full.contacts.KR
full.VC[,112:119] <- full.contacts.VC

###As an initial quality control metric, take a look at the mean vs. variance plot for the normalized data--first for KR, then for VC. Also go ahead and add on columns for mean frequencies and variances both within and across the species, while they're being calculated here anyway:

#KR:
select(full.KR, "A-21792_KR", "B-28126_KR", "E-28815_KR", "F-28834_KR") %>% apply(., 1, mean) -> full.KR$Hmean
select(full.KR, "C-3649_KR", "D-40300_KR", "G-3624_KR", "H-3651_KR") %>% apply(., 1, mean) -> full.KR$Cmean
select(full.KR, "A-21792_KR", "B-28126_KR", "E-28815_KR", "F-28834_KR", "C-3649_KR", "D-40300_KR", "G-3624_KR", "H-3651_KR") %>% apply(., 1, mean) -> full.KR$ALLmean #Get means across species.
select(full.KR, "A-21792_KR", "B-28126_KR", "E-28815_KR", "F-28834_KR") %>% apply(., 1, var) -> full.KR$Hvar #Get human variances.
select(full.KR, "C-3649_KR", "D-40300_KR", "G-3624_KR", "H-3651_KR") %>% apply(., 1, var) -> full.KR$Cvar #Get chimp variances.
select(full.KR, "A-21792_KR", "B-28126_KR", "E-28815_KR", "F-28834_KR", "C-3649_KR", "D-40300_KR", "G-3624_KR", "H-3651_KR") %>% apply(., 1, var) -> full.KR$ALLvar #Get variance across species.
{plot(full.KR$ALLmean, full.KR$ALLvar, main="Mean vs. Variance in Juicer KR contact frequency", xlab="Mean Hi-C Contact Frequency", ylab="Variance in Hi-C Contact Frequency")
  abline(lm(full.KR$ALLmean~full.KR$ALLvar), col="red")}

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

#Very flat line of the regression. Solid looking.

#VC:
select(full.VC, "A-21792_VC", "B-28126_VC", "E-28815_VC", "F-28834_VC") %>% apply(., 1, mean) -> full.VC$Hmean
select(full.VC, "C-3649_VC", "D-40300_VC", "G-3624_VC", "H-3651_VC") %>% apply(., 1, mean) -> full.VC$Cmean
select(full.VC, "A-21792_VC", "B-28126_VC", "E-28815_VC", "F-28834_VC", "C-3649_VC", "D-40300_VC", "G-3624_VC", "H-3651_VC") %>% apply(., 1, mean) -> full.VC$ALLmean #Get means across species.
select(full.VC, "A-21792_VC", "B-28126_VC", "E-28815_VC", "F-28834_VC") %>% apply(., 1, var) -> full.VC$Hvar #Get human variances.
select(full.VC, "C-3649_VC", "D-40300_VC", "G-3624_VC", "H-3651_VC") %>% apply(., 1, var) -> full.VC$Cvar #Get chimp variances.
select(full.VC, "A-21792_VC", "B-28126_VC", "E-28815_VC", "F-28834_VC", "C-3649_VC", "D-40300_VC", "G-3624_VC", "H-3651_VC") %>% apply(., 1, var) -> full.VC$ALLvar #Get variance across species.
{plot(full.VC$ALLmean, full.VC$ALLvar, main="Mean vs. Variance in Juicer VC contact frequency", xlab="Mean Hi-C Contact Frequency", ylab="Variance in Hi-C Contact Frequency")
  abline(lm(full.VC$ALLmean~full.VC$ALLvar), col="red")}

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

#Very flat line of the regression. Solid looking.

On the whole, the distributions look fairly comparable across individuals and species, and the mean vs. variance plots show weak correlation at best between the two metrics. Now knowing the data are comparable in this sense, the next question would be whether they are different enough between the species to separate them from one another with unsupervised clustering and PCA methods.

Data clustering with Principal Components Analysis (PCA) and correlation heatmap clustering.

Now I use PCA, breaking down the interaction frequency values into principal components that best represent the variability within the data. This technique uses orthogonal transformation to convert the many dimensions in variability of this dataset into a lower-dimensional picture that can be used to separate out functional groups in the data. In this case, the expectation is that the strongest principal component, representing the plurality of the variance, will separate out humans and chimps along its axis, grouping the two species independently, as we expect their interaction frequency values to cluster together. I then also compute pairwise pearson correlations between each of the individuals across all Hi-C contacts, and use unsupervised hierarchical clustering to create a heatmap that once again will group individuals based on similarity in interaction frequency values. Again, I would expect this technique to separate out the species very distinctly from one another.

###Now do principal components analysis (PCA) on these data to see how they separate:
meta.data <- data.frame("SP"=c("H", "H", "C", "C", "H", "H", "C", "C"), "SX"=c("F", "M", "M", "F", "M", "F", "M", "F"), "Batch"=c(1, 1, 1, 1, 2, 2, 2, 2), "PE_reads"=c(1084472930, 1103077950, 1015696574, 1047650944, 980287418, 1037054332, 930089380, 964085606), "tags_per_BP"=c(0.351291, 0.357315, 0.342325, 0.353097, 0.317553, 0.335937, 0.313494, 0.324936), "inter_reads"=c(0.1825411, 0.1734168, 0.1559306, 0.2010798, 0.1711258, 0.1479523, 0.1604700, 0.1712287)) #need the metadata first to make this interesting; obtain this information from my study design
pca <- prcomp(t(full.contacts.KR), scale=TRUE, center=TRUE)
pca2 <- prcomp(t(full.contacts.VC), scale=TRUE, center=TRUE)
ggplot(data=as.data.frame(pca$x), aes(x=PC1, y=PC2, shape=as.factor(meta.data$SP), color=as.factor(meta.data$Batch), size=2)) + geom_point() +labs(title="PCA on KR normalized Hi-C Contact Frequency") + guides(color=guide_legend(order=1), size=FALSE, shape=guide_legend(order=2)) + xlab(paste("PC1 (", 100*summary(pca)$importance[2,1], "% of variance)")) + ylab((paste("PC2 (", 100*summary(pca)$importance[2,2], "% of variance)"))) + labs(color="Batch", shape="Species") + scale_shape_manual(labels=c("Chimp", "Human"), values=c(16, 17))

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

barplot(summary(pca)$importance[2,], xlab="PCs", ylab="Proportion of Variance Explained", main="PCA on KR normalized Hi-C contact frequency") #Scree plot showing all the PCs and the proportion of the variance they explain.

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

ggplot(data=as.data.frame(pca2$x), aes(x=PC1, y=PC2, shape=as.factor(meta.data$SP), color=as.factor(meta.data$Batch), size=2)) + geom_point() +labs(title="PCA on VC normalized Hi-C Contact Frequency") + guides(color=guide_legend(order=1), size=FALSE, shape=guide_legend(order=2)) + xlab(paste("PC1 (", 100*summary(pca2)$importance[2,1], "% of variance)")) + ylab((paste("PC2 (", 100*summary(pca2)$importance[2,2], "% of variance)"))) + labs(color="Batch", shape="Species") + scale_shape_manual(labels=c("Chimp", "Human"), values=c(16, 17))

Version	Author	Date
b69c73d	Ittai Eres	2019-05-05
b7d82fc	Ittai Eres	2019-04-30

barplot(summary(pca2)$importance[2,], xlab="PCs", ylab="Proportion of Variance Explained", main="PCA on VC normalized Hi-C contact frequency")

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

#Scree plot

#PC Correlations for VC PCA (pca2):
#Actually statistically test correlation w/ PCs:
PC1 <- pca2$x[,1]
PC2 <- pca2$x[,2]
PC3 <- pca2$x[,3]
PC4 <- pca2$x[,4]
PC5 <- pca2$x[,5]
PC6 <- pca2$x[,6]
PC7 <- pca2$x[,7]
PCS <- data.frame(PC1, PC2, PC3, PC4, PC5, PC6, PC7)
summary <- summary(pca2)
PC_pvals <- matrix(data=NA, nrow=7, ncol=6, dimnames=list(c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7"), c("Species", "Sex", "Batch", "PE_reads", "tags_per_BP", "inter_reads")))
PC_rsquareds <- matrix(data=NA, nrow=7, ncol=6, dimnames=list(c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7"), c("Species", "Sex", "Batch", "PE_reads", "tags_per_BP", "inter_reads")))
for(i in 1:7){
  #For species
  speciesPC1 <- lm(PCS[,i] ~ as.factor(meta.data$SP))
  fstat <- as.data.frame(summary(speciesPC1)$fstatistic)
  p_fstat <- 1-pf(fstat[1,], fstat[2,], fstat[3,])
  PC_pvals[i,1] <- p_fstat
  PC_rsquareds[i,1] <- summary(speciesPC1)$r.squared
  
  #For sex
  sexPC1 <- lm(PCS[,i] ~ as.factor(meta.data$SX))
  fstat <- as.data.frame(summary(sexPC1)$fstatistic)
  p_fstat <- 1-pf(fstat[1,], fstat[2,], fstat[3,])
  PC_pvals[i,2] <- p_fstat
  PC_rsquareds[i,2] <- summary(sexPC1)$r.squared
  
  #For batch
  btcPC1 <- lm(PCS[,i] ~ as.factor(meta.data$Batch))
  fstat <- as.data.frame(summary(btcPC1)$fstatistic)
  p_fstat <- 1-pf(fstat[1,], fstat[2,], fstat[3,])
  PC_pvals[i,3] <- p_fstat
  PC_rsquareds[i,3] <- summary(btcPC1)$r.squared
  
  #For PE_reads
  PEPC1 <- lm(PCS[,i] ~ meta.data$PE_reads)
  fstat <- as.data.frame(summary(PEPC1)$fstatistic)
  p_fstat <- 1-pf(fstat[1,], fstat[2,], fstat[3,])
  PC_pvals[i,4] <- p_fstat
  PC_rsquareds[i,4] <- summary(PEPC1)$r.squared
  
  #For tags
  tagsPC1 <- lm(PCS[,i] ~ meta.data$tags_per_BP)
  fstat <- as.data.frame(summary(tagsPC1)$fstatistic)
  p_fstat <- 1-pf(fstat[1,], fstat[2,], fstat[3,])
  PC_pvals[i,5] <- p_fstat
  PC_rsquareds[i,5] <- summary(tagsPC1)$r.squared
  
  #For inter
  interPC1 <- lm(PCS[,i] ~ meta.data$inter_reads)
  fstat <- as.data.frame(summary(interPC1)$fstatistic)
  p_fstat <- 1-pf(fstat[1,], fstat[2,], fstat[3,])
  PC_pvals[i,6] <- p_fstat
  PC_rsquareds[i,6] <- summary(interPC1)$r.squared
}
PC_pvals

       Species        Sex      Batch   PE_reads tags_per_BP inter_reads
PC1 0.01142577 0.30096937 0.60793795 0.08091143   0.3127508 0.602368920
PC2 0.37165036 0.77102754 0.16803309 0.68908188   0.4036627 0.930390385
PC3 0.39399059 0.18911234 0.46144460 0.25667818   0.3601012 0.567914977
PC4 0.66390784 0.09822534 0.82210831 0.52928526   0.3513486 0.376834341
PC5 0.72877547 0.78735358 0.35036759 0.79662950   0.6501039 0.009185124
PC6 0.96541467 0.91378702 0.08580873 0.21316471   0.1163830 0.567808248
PC7 0.88337621 0.36756652 0.92420574 0.76637027   0.6741631 0.993834150

PC_rsquareds

         Species         Sex       Batch   PE_reads tags_per_BP
PC1 0.6830233808 0.175898521 0.046525011 0.42272655  0.16829338
PC2 0.1344364328 0.015218818 0.290544520 0.02856576  0.11853216
PC3 0.1231722619 0.267697271 0.093483148 0.20751138  0.14057675
PC4 0.0336042094 0.389394466 0.009111427 0.06912277  0.14538288
PC5 0.0215344493 0.013088684 0.145930021 0.01195358  0.03656567
PC6 0.0003403913 0.002120169 0.412768233 0.24426223  0.35915324
PC7 0.0038888744 0.136582070 0.001637640 0.01585773  0.03149592
     inter_reads
PC1 4.794721e-02
PC2 1.380815e-03
PC3 5.733129e-02
PC4 1.317522e-01
PC5 7.042157e-01
PC6 5.736197e-02
PC7 1.081408e-05

#So PC1 correlates strongly and significantly with species, whereas PC4 has some moderate correlation with sex--but nothing else has a significant effect.

Variance exploration

Now I look at several metrics to assess variance in the data, checking which species hits were discovered as significant in and how many individuals a given hit was discovered in. I look at this to understand if there are some cutoffs that should be made to reduce the noisiness of the data and maximize the significance of further findings.

###Add columns to full.data to indicate species of discovery and number of individuals discovered in. These pieces of information are good to know about each of the hits generally, but can also be used to make decisions about filtering out certain classes of hits if variance is associated with any of these metrics.  
humNAs <- rowSums(is.na(full.KR[,1:53])) #52 is when there is no human info. 13 NAs per individual.
chimpNAs <- rowSums(is.na(full.KR[,54:105])) #Same as above of course.
full.KR$found_in_H <- (4-humNAs/13) #Set a new column identifying number of humans hit was found in
full.KR$found_in_C <- (4-chimpNAs/13) #Set a new column identifying number of chimps hit was found in
full.KR$disc_species <- ifelse(full.KR$found_in_C>0&full.KR$found_in_H>0, "B", ifelse(full.KR$found_in_C==0, "H", "C")) #Set a column identifying which species (H, C, or Both) the hit in question was identified in. Works with nested ifelses.
full.KR$tot_indiv_IDd <- full.KR$found_in_C+full.KR$found_in_H #Add a column specifying total number of individuals homer found the significant hit in.

###Take a look at what proportion of the significant hits were discovered in either of the species (or both of them).
sum(full.KR$disc_species=="H") #~9.5k

[1] 9526

sum(full.KR$disc_species=="C") #~12k

[1] 11916

sum(full.KR$disc_species=="B") #~7k

[1] 6736

#This is reassuring, as similar numbers of discoveries in both species suggests comparable power for detection.

###Now I look at variance in interaction frequency as a function of the number of individuals in which a pair was independently called as significant. Essentially, I'm checking here to see if there is some kind of cutoff I should make for the significant hits--i.e., if the variance looks to drop off significantly after a hit is discovered in at least 2 individuals, this suggests possibly filtering out any hit only discovered in 1 individual.
myplot <- data.frame(disc_species=full.KR$disc_species, found_in_H=full.KR$found_in_H, found_in_C=full.KR$found_in_C, tot_indiv_IDd=full.KR$tot_indiv_IDd, Hvar=full.KR$Hvar, Cvar=full.KR$Cvar, ALLvar=full.KR$ALLvar)
ggplot(data=myplot, aes(group=tot_indiv_IDd, x=tot_indiv_IDd, y=ALLvar)) + geom_boxplot() + ggtitle("Number of Individuals a Hi-C Hit is Discovered in vs. Variance") + xlab("# Individuals where Hi-C contact called significant") + ylab("Variance in KR Interaction Frequency") + coord_cartesian(ylim=c(0, 10))

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

ggplot(data=myplot, aes(group=tot_indiv_IDd, x=tot_indiv_IDd, y=ALLvar)) + geom_boxplot() + scale_y_continuous(limits=c(0, 1)) + ggtitle("Number of Individuals a Hi-C Hit is Discovered in vs. Variance") + xlab("# Individuals where Hi-C contact called significant") + ylab("Variance in KR Interaction Frequency")

Warning: Removed 4631 rows containing non-finite values (stat_boxplot).

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

#There appears to be no clear trend here; simply decide to require a hit is discovered in at least 2 individuals to call it as biologically significant and not just technical nonsense? Do the same process for the VC data after this as well:
filt.KR <- filter(full.KR, tot_indiv_IDd>=2) #Brings us down from ~13k hits to only 4k hits though...


###EXACT same thing for VC data:
humNAs <- rowSums(is.na(full.VC[,1:53])) #52 is when there is no human info. 13 NAs per individual.
chimpNAs <- rowSums(is.na(full.VC[,54:105])) #Same as above of course.
full.VC$found_in_H <- (4-humNAs/13) #Set a new column identifying number of humans hit was found in
full.VC$found_in_C <- (4-chimpNAs/13) #Set a new column identifying number of chimps hit was found in
full.VC$disc_species <- ifelse(full.VC$found_in_C>0&full.VC$found_in_H>0, "B", ifelse(full.VC$found_in_C==0, "H", "C")) #Set a column identifying which species (H, C, or Both) the hit in question was identified in. Works with nested ifelses.
full.VC$tot_indiv_IDd <- full.VC$found_in_C+full.VC$found_in_H #Add a column specifying total number of individuals homer found the significant hit in.

###Take a look at what proportion of the significant hits were discovered in either of the species (or both of them).
sum(full.VC$disc_species=="H") #~56k

[1] 55851

sum(full.VC$disc_species=="C") #~12k

[1] 11718

sum(full.VC$disc_species=="B") #~9k

[1] 9228

#This is the opposite of reassuring, as different numbers of discoveries in both species suggests incomparable power for detection...how is the coverage normalization making such a big difference here for humans?! I need to probably go back and look at my initial subsetting and significance calling and extraction again to make sure it's all right...

###Now I look at variance in interaction frequency as a function of the number of individuals in which a pair was independently called as significant. Essentially, I'm checking here to see if there is some kind of cutoff I should make for the significant hits--i.e., if the variance looks to drop off significantly after a hit is discovered in at least 2 individuals, this suggests possibly filtering out any hit only discovered in 1 individual.
myplot <- data.frame(disc_species=full.VC$disc_species, found_in_H=full.VC$found_in_H, found_in_C=full.VC$found_in_C, tot_indiv_IDd=full.VC$tot_indiv_IDd, Hvar=full.VC$Hvar, Cvar=full.VC$Cvar, ALLvar=full.VC$ALLvar)
ggplot(data=myplot, aes(group=tot_indiv_IDd, x=tot_indiv_IDd, y=ALLvar)) + geom_boxplot() + ggtitle("Number of Individuals a Hi-C Hit is Discovered in vs. Variance") + xlab("# Individuals where Hi-C contact called significant") + ylab("Variance in VC Interaction Frequency") + coord_cartesian(ylim=c(0, 10))

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

ggplot(data=myplot, aes(group=tot_indiv_IDd, x=tot_indiv_IDd, y=ALLvar)) + geom_boxplot() + scale_y_continuous(limits=c(0, 1)) + ggtitle("Number of Individuals a Hi-C Hit is Discovered in vs. Variance") + xlab("# Individuals where Hi-C contact called significant") + ylab("Variance in VC Interaction Frequency")

Warning: Removed 21571 rows containing non-finite values (stat_boxplot).

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

#There appears to be no clear trend here; simply decide to require a hit is discovered in at least 2 individuals to call it as biologically significant and not just technical nonsense?
filt.VC <- filter(full.VC, tot_indiv_IDd>=2) #Brings us down from ~13k hits to only 4k hits though...


sum(filt.VC$disc_species=="H") #~4k

[1] 3991

sum(filt.VC$disc_species=="C") #~2.5k

[1] 2448

sum(filt.VC$disc_species=="B") #~9k

[1] 9228

sum(filt.KR$disc_species=="H") #~56k

[1] 2136

sum(filt.KR$disc_species=="C") #~12k

[1] 2824

sum(filt.KR$disc_species=="B") #~9k

[1] 6736

###Take one last look at the new distributions after doing the filtering, just to see what they look like, and also take a look at the PCA and hierarchical clustering, just for reference (don't really think it matters at this point?):
VC.humans <- melt(filt.VC[,c(112:113, 116:117)])

No id variables; using all as measure variables

VC.chimps <- melt(filt.VC[,c(114:115, 118:119)])

No id variables; using all as measure variables

KR.humans <- melt(filt.KR[,c(112:113, 116:117)])

No id variables; using all as measure variables

KR.chimps <- melt(filt.KR[,c(114:115, 118:119)])

No id variables; using all as measure variables

#VC-normalized distributions:
ggplot(data=VC.humans, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Humans") + xlab("Juicer VC-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 750)) #Human dist'ns

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

ggplot(data=VC.chimps, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Chimps") + xlab("Juicer VC-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 750)) #Chimp Dist'ns

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

#Both sets of distributions look fairly bimodal, with chimps in particular showing a peak around 0.

###Now, the same thing for the KR-normalized interaction frequencies:
ggplot(data=KR.humans, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Humans") + xlab("Juicer KR-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 500)) #Human dist'ns

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

ggplot(data=KR.chimps, aes(x=value)) + geom_histogram(binwidth=0.1, aes(group=variable)) + facet_wrap(~as.factor(variable)) + ggtitle("Distributions of Juicer-Significant Hi-C Contacts, Chimps") + xlab("Juicer KR-Normalized Interaction Frequency") + coord_cartesian(xlim=c(-1, 12), ylim=c(0, 500)) #Chimp Dist'ns

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

####Check on clustering and PCA now:
corheat <- cor(filt.KR[,112:119], use="complete.obs", method="pearson") #Corheat for the full data set, and heatmap. Pearson clusters poorly.
corheats <- cor(filt.KR[,112:119], use="complete.obs", method="spearman")
colnames(corheat) <- c("A_HF", "B_HM", "C_CM", "D_CF", "E_HM", "F_HF", "G_CM", "H_CF")
rownames(corheat) <- colnames(corheat)
colnames(corheats) <- colnames(corheat)
rownames(corheats) <- colnames(corheat)

#KR Clustering
heatmaply(corheat, main="Pairwise Pearson Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters poorly, but at least species groups are maintained!

heatmaply(corheats, main="Pairwise Spearman Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters excellently!

#FOR KR, after filtering, Spearman holds up perfectly, Pearson has slight issues (dendrogram isn't right although larger groups clustered properly)

##Repeat for VC before looking at PCA on both:
corheat <- cor(filt.VC[,112:119], use="complete.obs", method="pearson") #Corheat for the full data set, and heatmap. Pearson clusters poorly.
corheats <- cor(filt.VC[,112:119], use="complete.obs", method="spearman")
colnames(corheat) <- c("A_HF", "B_HM", "C_CM", "D_CF", "E_HM", "F_HF", "G_CM", "H_CF")
rownames(corheat) <- colnames(corheat)
colnames(corheats) <- colnames(corheat)
rownames(corheats) <- colnames(corheat)

#KR Clustering
heatmaply(corheat, main="Pairwise Pearson Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters poorly, but at least species groups are maintained!

heatmaply(corheats, main="Pairwise Spearman Correlation @ 10kb", k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)) #Clusters excellently!

#FOR VC as well, after filtering, Spearman holds up perfectly, Pearson has slight issues (dendrogram isn't right although larger groups clustered properly)


###PCA:
pca <- prcomp(t(filt.KR[,112:119]), scale=TRUE, center=TRUE)
pca2 <- prcomp(t(filt.VC[,112:119]), scale=TRUE, center=TRUE)
ggplot(data=as.data.frame(pca$x), aes(x=PC1, y=PC2, shape=as.factor(meta.data$SP), color=as.factor(meta.data$Batch), size=2)) + geom_point() +labs(title="PCA on KR normalized Hi-C Contact Frequency") + guides(color=guide_legend(order=1), size=FALSE, shape=guide_legend(order=2)) + xlab(paste("PC1 (", 100*summary(pca)$importance[2,1], "% of variance)")) + ylab((paste("PC2 (", 100*summary(pca)$importance[2,2], "% of variance)"))) + labs(color="Batch", shape="Species") + scale_shape_manual(labels=c("Chimp", "Human"), values=c(16, 17))

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

barplot(summary(pca)$importance[2,], xlab="PCs", ylab="Proportion of Variance Explained", main="PCA on KR normalized Hi-C contact frequency") #Scree plot showing all the PCs and the proportion of the variance they explain.

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

ggplot(data=as.data.frame(pca2$x), aes(x=PC1, y=PC2, shape=as.factor(meta.data$SP), color=as.factor(meta.data$Batch), size=2)) + geom_point() +labs(title="PCA on VC Normalized Hi-C Contact Frequency") + guides(color=guide_legend(order=1), size=FALSE, shape=guide_legend(order=2)) + xlab(paste("PC1 (", signif(100*summary(pca2)$importance[2,1],2), "% of variance)")) + ylab((paste("PC2 (", signif(100*summary(pca2)$importance[2,2],2), "% of variance)"))) + labs(color="Batch", shape="Species") + scale_shape_manual(labels=c("Chimp", "Human"), values=c(16, 17)) #FIGS1A

Version	Author	Date
b69c73d	Ittai Eres	2019-05-05
b7d82fc	Ittai Eres	2019-04-30

barplot(summary(pca2)$importance[2,], xlab="PCs", ylab="Proportion of Variance Explained", main="PCA on VC normalized Hi-C contact frequency")

Version	Author	Date
b7d82fc	Ittai Eres	2019-04-30

#Scree plot

#Ok, well nice to know. Now to write out files.

fwrite(filt.KR, "output/juicer.filt.KR", quote = TRUE, sep = "\t", row.names = FALSE, col.names = TRUE, na="NA", showProgress = FALSE)
fwrite(filt.VC, "output/juicer.filt.VC", quote=TRUE, sep="\t", row.names=FALSE, col.names=TRUE, na="NA", showProgress=FALSE)

###No longer used, just notes about the normalization scheme.
#full.contacts2 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.25, iterations=3, method="pairs")) #Good Spearman, poor Pearson (not even proper clustering)
#full.contacts3 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.29, iterations=3, method="pairs")) #Both terrible!
#full.contacts4 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.39, iterations=3, method="pairs")) #Spearman clusters properly but correlation values not great in terms of separation; Pearson just terrible and doesn't cluster right.
#full.contacts5 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.35, iterations=3, method="pairs")) #Both TERRIBUL
#full.contacts6 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.2, iterations=2, method="pairs")) #Gives the right clustering but not great correlations on Pearson. Same but a little better with Spearman.
#full.contacts7 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.3, iterations=1, method="pairs")) #Spearman decent but Pearson terrible.
#full.contacts8 <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.333, iterations=5, method="pairs")) #Terrible!


#Normalization schemes
#KR.1 <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=0.17, iterations=3, method="pairs"))
#KR.2 <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=0.31, iterations=2, method="pairs"))
#VC.contacts <- as.data.frame(normalizeCyclicLoess(VC.contacts, span=0.31, iterations=3, method="pairs"))

# Kfull.contacts2 <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=1/4, iterations=3, method="pairs")) #Awful Pearson, slightly worse than above Spearman.
# Kfull.contacts3 <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=0.29, iterations=3, method="pairs")) #Terrible Pearson, middling Spearman.
# Kfull.contacts6 <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=0.3, iterations=1, method="pairs")) #Terrible Pearson, solid Spearman.
# Kfull.contacts7 <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=0.5, iterations=2, method="pairs"))
# 
# for(myspan in seq(0.2, 0.25, 0.01)){
#   iteration <- 2
#     Kcontacts <- as.data.frame(normalizeCyclicLoess(KR.contacts, span=myspan, iterations=iteration, method="pairs"))
#     corheat <- cor(Kcontacts, use="complete.obs", method="pearson")
#     corheats <- cor(Kcontacts, use="complete.obs", method="spearman")
#     colnames(corheat) <- c("A_HF", "B_HM", "C_CM", "D_CF", "E_HM", "F_HF", "G_CM", "H_CF")
#     rownames(corheat) <- colnames(corheat)
#     colnames(corheats) <- colnames(corheat)
#     rownames(corheats) <- colnames(corheat)
#     print(heatmaply(corheat, main=paste("Pearson, 10kb, ", myspan, " span, ", iteration, " iterations.", sep=""), k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)))
#     print(heatmaply(corheats, main=paste("Spearman, 10kb, ", myspan, " span, ", iteration, " iterations.", sep=""), k_row=2, k_col=2, symm=TRUE, margins=c(50, 50, 30, 30)))
# }

sessionInfo()

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] bedr_1.0.7         forcats_0.4.0      purrr_0.3.2       
 [4] readr_1.3.1        tibble_2.1.3       tidyverse_1.2.1   
 [7] edgeR_3.20.9       RColorBrewer_1.1-2 heatmaply_0.16.0  
[10] viridis_0.5.1      viridisLite_0.3.0  stringr_1.4.0     
[13] gplots_3.0.1.1     Hmisc_4.2-0        Formula_1.2-3     
[16] survival_2.44-1.1  lattice_0.20-38    dplyr_0.8.3       
[19] plotly_4.9.0       cowplot_0.9.4      ggplot2_3.2.1     
[22] reshape2_1.4.3     data.table_1.12.0  tidyr_1.0.0       
[25] plyr_1.8.4         limma_3.34.9      

loaded via a namespace (and not attached):
 [1] colorspace_1.4-1     rprojroot_1.3-2      htmlTable_1.13.2    
 [4] futile.logger_1.4.3  base64enc_0.1-3      fs_1.3.1            
 [7] rstudioapi_0.10      lubridate_1.7.4      xml2_1.2.2          
[10] codetools_0.2-16     splines_3.4.0        R.methodsS3_1.7.1   
[13] knitr_1.22           zeallot_0.1.0        jsonlite_1.6        
[16] workflowr_1.4.0      broom_0.5.2          cluster_2.0.7-1     
[19] R.oo_1.22.0          shiny_1.3.2          httr_1.4.1          
[22] backports_1.1.4      assertthat_0.2.1     Matrix_1.2-17       
[25] lazyeval_0.2.2       cli_1.1.0            later_0.8.0         
[28] formatR_1.7          acepack_1.4.1        htmltools_0.3.6     
[31] tools_3.4.0          gtable_0.3.0         glue_1.3.1          
[34] Rcpp_1.0.1           cellranger_1.1.0     vctrs_0.2.0         
[37] gdata_2.18.0         nlme_3.1-137         crosstalk_1.0.0     
[40] iterators_1.0.12     xfun_0.5             testthat_2.2.1      
[43] rvest_0.3.4          mime_0.7             lifecycle_0.1.0     
[46] gtools_3.8.1         dendextend_1.12.0    MASS_7.3-51.4       
[49] scales_1.0.0         TSP_1.1-7            promises_1.0.1      
[52] hms_0.5.1            parallel_3.4.0       lambda.r_1.2.4      
[55] yaml_2.2.0           gridExtra_2.3        rpart_4.1-15        
[58] latticeExtra_0.6-28  stringi_1.4.3        gclus_1.3.2         
[61] foreach_1.4.7        checkmate_1.9.4      seriation_1.2-3     
[64] caTools_1.17.1.2     rlang_0.4.0          pkgconfig_2.0.3     
[67] bitops_1.0-6         evaluate_0.13        labeling_0.3        
[70] htmlwidgets_1.3      tidyselect_0.2.5     magrittr_1.5        
[73] R6_2.4.0             generics_0.0.2       pillar_1.4.2        
[76] haven_2.1.1          whisker_0.4          foreign_0.8-72      
[79] withr_2.1.2          nnet_7.3-12          modelr_0.1.5        
[82] crayon_1.3.4         futile.options_1.0.1 KernSmooth_2.23-15  
[85] rmarkdown_1.12       locfit_1.5-9.1       grid_3.4.0          
[88] readxl_1.3.1         git2r_0.26.1         digest_0.6.18       
[91] webshot_0.5.1        xtable_1.8-4         VennDiagram_1.6.20  
[94] httpuv_1.5.2         R.utils_2.9.0        munsell_0.5.0       
[97] registry_0.5-1

Hi-C Data Normalization and Initial Quality Control, Juicer

Ittai Eres

2018-01-23

Initial Data read-in and quality control

Data clustering with Principal Components Analysis (PCA) and correlation heatmap clustering.

Variance exploration