-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy path8-PLIER_pathways_analysis.Rmd
115 lines (99 loc) · 4.87 KB
/
8-PLIER_pathways_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
title: "PLIER pathways analysis"
output: html_notebook
author: "Steven Foltz"
date: "December 2022"
---
```{r}
meaningful_difference <- 0.2
n_repeats <- 10
```
### What additional pathways are identified by PLIER after doubling the sample size?
Here we look for oncogenic pathways (defined by [GSEA MSigDB C6](https://www.gsea-msigdb.org/gsea/msigdb/human/genesets.jsp?collection=C6)) that are found more frequently in the data with the full sample size than in data with half the sample size.
By focusing on pathways that are reliably detected more often in the full sample size data, we can identify stable patterns that emerge when more data is available.
We ran PLIER `r n_repeats` times using each combination of data normalization method and sample size (half or full).
Out of those `r n_repeats` runs, we found the proportion of times each oncogenic pathway was significantly associated with a latent variable.
We set an arbitrary "meaningful difference" threshold of `r meaningful_difference` to detect when that proportion was meaningfully greater using full sample size compared to half sample size and require array and RNA-seq data to both to satisfy the condition.
We also arbitrarily limit results to those pathways found in over half of runs using the full size data.
### Load packages and pathways
```{r}
library(tidyverse)
data("oncogenicPathways", package = "PLIER")
```
### Set input file names
```{r}
plots_data_dir <- here::here("plots", "data")
brca_filename <- file.path(plots_data_dir, "BRCA_subtype_PLIER_pathways.tsv")
gbm_filename <- file.path(plots_data_dir, "GBM_subtype_PLIER_pathways.tsv")
```
### Read in data for each cancer type
```{r}
pathways_df <- NULL
if (file.exists(brca_filename)) {
brca_pathways_df <- read_tsv(brca_filename) %>%
mutate(cancer_type = "BRCA")
pathways_df <- bind_rows(pathways_df,
brca_pathways_df)
} else {
message(str_c("BRCA file ", brca_filename, " does not exist."))
}
if (file.exists(gbm_filename)) {
gbm_pathways_df <- read_tsv(gbm_filename) %>%
mutate(cancer_type = "GBM")
pathways_df <- bind_rows(pathways_df,
gbm_pathways_df)
} else {
message(str_c("GBM file ", gbm_filename, " does not exist."))
}
if (!file.exists(brca_filename) & !file.exists(gbm_filename)) {
stop(str_c("Neither BRCA file ", brca_filename,
" nor GBM file ", gbm_filename,
" exists."))
}
```
### Filter data to identify oncogenic pathways detected more frequently in full data
```{r}
pathways_df %>%
filter(FDR < 0.05, # require significant association with an LV
pathway %in% colnames(oncogenicPathways)) %>% # require oncogenic pathways
# for each combination of normalization method, %RNA-seq, pathway, and cancer type
group_by(nmeth, pseq, pathway, cancer_type) %>%
# summarize by finding the count and proportion of repeats in which that
# pathway was significantly associated with at least one latent variable
summarize(n_seeds = length(unique(seed_index)),
prop_seeds = n_seeds/n_repeats,
.groups = "drop") %>%
# clean up normalization method strings (remove parentheses, spaces, and hyphens)
mutate(nmeth = str_remove_all(nmeth, "[\\(\\)]")) %>%
mutate(nmeth = str_replace_all(nmeth, c(" " = "_", "-" = "_"))) %>%
# create new variable combining normalization method and %RNA-seq
mutate(nmeth_pseq = str_c(nmeth, pseq, sep = "_")) %>%
select(cancer_type, pathway, prop_seeds, nmeth_pseq) %>%
# create new columns for each combination of norm method and %RNA-seq
# each row corresponds to a single pathway from a cancer type
pivot_wider(id_cols = c("cancer_type", "pathway"),
names_from = nmeth_pseq,
values_from = prop_seeds) %>%
mutate_if(is.numeric, replace_na, 0) %>% # replace NAs with 0s
# reduce to "meaningful" results
# 1. difference in array only data must be meaningful
# 2. difference in seq only data must be meaningful
# 3. pathway must be detected in over half of full data sets (array only, RNA-seq only, and NPN 50%/50%)
filter(log_0 - array_only_50 >= meaningful_difference,
log_100 - seq_only_50 >= meaningful_difference,
log_0 > 0.5,
log_100 > 0.5,
npn_50 > 0.5) %>%
select(cancer_type, pathway, array_only_50, seq_only_50, log_0, log_100, npn_50) %>%
arrange(cancer_type) %>%
knitr::kable()
```
| variable_name | meaning |
| --- | --- |
| `array_only_50` | LOG transformed array data (half sample size) |
| `seq_only_50` | LOG transformed RNA-seq data (half sample size) |
| `log_0` | LOG transformed array data (full sample size) |
| `log_100` | LOG transformed RNA-seq data (full sample size) |
| `npn_50` | NPN transformed data, 50% array and 50% RNA-seq (full sample size) |
### Citation:
Mao W, Zaslavsky E, Hartmann BM, Sealfon SC, Chikina M. Pathway-level information extractor (PLIER) for gene expression data. Nat Methods. 2019;16: 607–610.