---
title: "Statistical analysis of TFBS disruption with tfboot"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{intro}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  comment = "#>"
)
options(rmarkdown.html_vignette.check_title = FALSE)

# load the pre-cooked vignette data
mbres <- tfboot::vignettedata$mbres
mball <- tfboot::vignettedata$mball
```


## Introduction

### Motivation

The [motifbreakR](https://bioconductor.org/packages/motifbreakR/) package provides functions to assess how single nucleotide polymorphisms (SNPs) disrupt transcript factor binding sites (TFBS). The tfboot package provides helper functions to aid in assessing the statistical significance of TFBS disruption in _interval sets_, typically defined by -5kb upstream promoter regions on genes of interest. 

An example question might be: we have these 123 genes of interest that we think are important for XYZ phenotype. Can you assess the degree to which SNPs in the promoter region of these genes disrupt TFBS? Is this disruption in these 123 genes statistically significant as compared to SNPs from randomly selected sets of the same number of genes? 

### Setup

To run this vignette, you'll need the chicken BSGenome+TxDb+org.db packages, plyranges, and motifbreakR packages, which can be installed from bioconductor. You'll also need to install this tfboot package, which you can do from source, or directly from GitHub.

```{r, eval=FALSE}
install.packages("BiocManager")
BiocManager::install(c("BSgenome.Ggallus.UCSC.galGal6", 
                       "org.Gg.eg.db", 
                       "TxDb.Ggallus.UCSC.galGal6.refGene", 
                       "plyranges", 
                       "motifbreakR"), update=FALSE)

install.packages("devtools")
devtools::install_github("https://github.com/colossal-compsci/tfboot", upgrade=FALSE)
```

## Basic usage

First, load the libraries you need for this analysis. 

```{r libs, message=FALSE}
library(BSgenome.Ggallus.UCSC.galGal6)
library(TxDb.Ggallus.UCSC.galGal6.refGene)
library(org.Gg.eg.db)
library(plyranges)
library(motifbreakR)
library(tfboot)
```

Use `read_vcf()` to read in your VCF. The example used here is a VCF with variants from chicken (_Gallus gallus_) chromosome 33. Note that your VCF must be pre-filtered to only include variant sites. More information on how to do this can be found in the help page (`?read_vcf`).

```{r}
vcf_file <- system.file("extdata", "galGal6-chr33.vcf.gz", package="tfboot", mustWork = TRUE)
snps <- read_vcf(file=vcf_file, bsgenome=BSgenome.Ggallus.UCSC.galGal6)
snps
```

The `get_upstream_snps()` function from tfboot takes in a SNP list and a TxDb object, and turns intervals 5kb upstream of the transcription start site for all genes in the list. See the help for `?get_upstream_snps()` and the helper function it calls, `get_upstream()`. This function also respects strand information embedded in the genome. For an example on building a TxDb object from a GTF, see `?GenomicFeatures::makeTxDbPackage`.

Here we get the 5kb upstream promoter region of _all_ genes with _any_ SNPs (limited to chromosome 33 in this example).

```{r}
prosnps <- get_upstream_snps(snps, txdb=TxDb.Ggallus.UCSC.galGal6.refGene)

prosnps
```

Now, let's select a set of five random genes of interest.

```{r}
set.seed(123)
mygenes <- sample(unique(prosnps$gene_id), 5)
mygenes
```

What are those genes?

```{r}
AnnotationDbi::select(org.Gg.eg.db, 
                      key=mygenes, 
                      columns=c("ENTREZID", "SYMBOL", "GENENAME"), 
                      keytype="ENTREZID")
```


We can then pull out SNPs in the 5kb promoter region of just those SNPs.

```{r}
myprosnps <- 
  prosnps |> 
  filter(gene_id %in% mygenes)

myprosnps
```

How many SNPs in the promoter region of each of my genes of interest?

```{r}
table(myprosnps$gene_id)
```

What's the median number of SNPs across _all_ genes? 

```{r}
median(table(prosnps$gene_id))
```

Now, let's run motifbreakR analysis with the 2022 JASPAR database on those `r length(mygenes)` genes. First let's look at the TFBS motifs. See `?MotifDb` for more info.

```{r}
motifs <- subset(MotifDb, dataSource=="jaspar2022")
motifs
```

Now, let's run motifbreakR on these SNPs in the promoter regions of these `r length(mygenes)` genes. The result of the motifbreakR analysis is a GenomicRanges object. 

```{r, eval=FALSE}
mbresGR <- motifbreakR(myprosnps, pwmList=motifs)
```

Additionally, let's run motifbreakR on _all_ SNPs for promoter regions in _all_ genes (remember, this entire analysis is restricted to chromosome 33 only).

```{r, eval=FALSE}
mballGR <- motifbreakR(prosnps, pwmList=motifs)
```

Let's use tfboot's `mb_to_tibble()` function to create a compact tibble for each of these. We'll use this data for our statistical analysis coming up next.

```{r, eval=FALSE}
mbres <- mbresGR |> mb_to_tibble()
mball <- mballGR |> mb_to_tibble()
```

```{r}
mbres
mball
```

In reality, when you perform the motifbreakR results across the whole genome, the analysis will take some time (perhaps hours), depending on the number of cores allocated to the job. I recommend that once you perform this once for every gene in the genome, you save the compact tibble (`mball` in this example) using `save()` or `saveRDS()` to read in later, avoiding the need to re-run the genome-wide motifbreakR analysis each time. You can do this with something like this:

```{r, eval=FALSE}
# Save the data so you don't have to run this long-running analysis again
saveRDS(mball, file="genome-wide-motifbreakr-results.rds")
# Read them back in from a file
mball <- readRDS(file="genome-wide-motifbreakr-results.rds")
```

```{r, echo=FALSE, eval=FALSE}
# Save the data
vignettedata <- list()
vignettedata$mbres <- mbres
vignettedata$mball <- mball
usethis::use_data(vignettedata, overwrite=TRUE)
```

## Assessing statistical signficance

### Motivation

In our example above we have a set of five genes. We've run motifbreakR on SNPs in the 5kb upstream region of those genes, and we want to ask questions such as "are there statistically more SNPs in the promoters of _these_ five genes than a random selection of five genes?" or "do SNPs in the promoters of _these_ five genes statistically enriched for those that disrupt TFBS, more so than randomly selected genes?" The tfboot package can help answer these questions.

We initially ran motifbreakR on our set of five genes of interest. To address the kinds of questions above, we could go back to the set of all genes (again, here just those limited to chromosome 33 for demonstration purposes), and randomly draw 1000 sets of 5 genes, run motifbreakR on SNPs in those promoters, then compare our stats to the empirical null distribution to obtain a p-value. However, running motifbreakR on five genes takes time. Imagine our gene set was 300 genes. Then we would need to run motifbreakR 300,000 times to generate our null distribution. Further, if we had _multiple_ gene sets, e.g., one with 100 genes, another with 500 genes, another with 400 genes, we would need to run motifbreakR 1,000,000 times to generate null distrubitions of 1,000 bootstraps for each of these intervals.

Instead, we can generate the motifbreakR results for _all_ genes, do this only once, then save those results to file, as we've done above. Then, when we need to compute our bootstrap sampling to create the empirical null distribution, we only have to bootstrap precomputed results, rather than bootstrapping the motifbreakR procedure itself.

### Demonstration

Let's look again at our motifbreakR results on our genes of interest:

```{r}
# Look at the data
mbres
# How many genes
length(unique(mbres$gene_id))
# How many TFBS are disrupted for each gene?
table(mbres$gene_id)
# How many "strong" in each gene?
table(mbres$gene_id, mbres$effect)
```

Now, let's review the motifbreakR on _all_ genes of interest

```{r}
# Look at the data (note the number of rows)
mball
# How many genes
length(unique(mball$gene_id))
# How many TFBS are disrupted for each gene? Just show the top 5
table(mball$gene_id) |> 
  sort(decreasing = TRUE) |> 
  head(n=5)
```

The tfboot function `mb_summarize` will summarize the results from a motifbreakR analysis into a single-row table. See help for `?mb_summarize` for details. This returns a table with the following columns:

1. `ngenes`: The number of genes in the SNP set.
1. `nsnps`: The number of SNPs total.
1. `nstrong`: The number of SNPs with a "strong" effect.
1. `alleleDiffAbsMean` The mean of the absolute values of the `alleleDiff` scores.
1. `alleleDiffAbsSum` The sum of the absolute values of the `alleleDiff` scores.
1. `alleleEffectSizeAbsMean` The mean of the absolute values of the `alleleEffectSize` scores.
1. `alleleEffectSizeAbsSum` The sum of the absolute values of the `alleleEffectSize` scores.

Let's run it on our set of genes of interest:

```{r}
mbsmry <- mb_summarize(mbres)
mbsmry
```

So, how do these numbers compare to, for example, the number of SNPs, number of strong effects, the average absolute allele difference, or average absolute effect size, from a set of SNPs in promoters in a randomly chosen sample of the same number of genes?

We can use the `mb_bootstrap` function to generate an empirical null distribution given the precomputed set of motifbreakR for the universe of all genes. This function takes motifbreakR results as a tibble (from `mb_to_tibble`), draws `boots` random samples of `ngenes` genes, and returns as a list (1) a wide tibble with results for each bootstrap, and (2) another tibble with the distribution of each metric as a column. We set the random number generator for reproducibility, and we limit the number of bootstrap resamples here for speed.

```{r}
set.seed(42)
mbboot <- mb_bootstrap(mball, ngenes=5, boots=250)
mbboot$bootwide
mbboot$bootdist
```

Finally, we can use the `mb_bootstats()` function to take in the summary on _our_ genes of interest (output from `mb_summary()`) and the bootstrapped empirical null distribution (from `mb_bootstrap()` run on motifbreakR results from _all_ genes), to compare our critical values (in `mbsmry` here) to our empirical null distribution (called `mbboot` here).

```{r}
bootstats <- mb_bootstats(mbsmry = mbsmry, mbboot = mbboot)
bootstats
```

We can also visualize this graphically with the `plot_bootstats()` function, which takes the results from `mb_bootstats()` as input.

```{r, fig.width=7, fig.height=5}
plot_bootstats(bootstats)
```


From these results we see that there is no meaningful statistical enrichment of TFBS-disrupting SNPs in this set of genes compared to randomly selected genes (this is expected -- our five genes _were_ randomly chosen). 

### Caveats

**The set of motifbreakR results for all genes must be recomputed on each new sample.** This vignette demonstrates how to precompute the motifbreakR results _once_ for the universe of all genes. You can use this precomputed set to very quickly bootstrap an empirical null distribution of a set of _k_ randomly selected genes. However, this background set is unique to each sample. Each sample will have a different set of SNPs, and this precomputing procedure computes motifbreakR results for SNPs in upstream regions of all genes. Similarly, if the upstream region changes from the default 5,000 bp, results for all genes will need to be recomputed again.

**SNP density matters.** Promoter regions with a larger number of SNPs are going to have larger values for the sum of the absolute values of the allele difference and effect sizes. This should not affect the mean of these metrics.

**Genes, not transcripts.** This vignette demonstrates running motifbreakR on SNPs in 5kb upstream regions defined by _genes_ (see the help for `?get_upstream` and `?get_upstream_snps` and `?GenomicFeatures::genes`). This is using the most 5' transcript as _the_ transcription start site, and looking 5kb upstream of that position. This does not consider alternative transcription start sites. Further development is required to contend with multiple sets of TFBS motif disruption scores for the same gene if transcripts are to be used.

## Session Information

```{r sessionInfo, echo=FALSE}
sessionInfo()
```