Easily generate differential expression results from SPARC scRNA-seq data in a FAIR manner.
sPARcRNA_Viz is an all-in-one gene expression visualization utility integratable with o²S²PARC. Using sPARcRNA_Viz, researchers can create an interactive t-SNE from single-cell RNA-sequencing data, as well as perform in silico GSEA analysis to determine the most highly expressed genes. From these statistically significant genes, researchers can determine potential gene ontologies arising from their sample(s). In addition, the seamless integration of sPARcRNA_Viz with the o²S²PARC computing platform enables data accessibility concordant with FAIR Data Principles.
sPARcRNA_Viz provides the user with the ability to fine-tune multiple gene expression parameters:
- Minimum number of cells expressing a gene
- Minimum number of features (genes) per cell
- Maximum number of features (genes) per cell
- Clustering resolution
- Species specification
- Minimum percentage for FindAllMarkers
- Log fold-change threshold for FindAllMarkers
- Minimum gene set size for GSEA
- MSigDB category for GSEA
- o²S²PARC
- R
- GNU Make
- Python3
- Docker
- HTML
- JavaScript
- Tailwind CSS
In recent years, single-cell RNA-sequencing (scRNA-seq) has emerged as a preeminent method for the analysis of gene expression in biological tissue, providing researchers access to genetic data previously inaccessible. This is largely due to advancements in wet lab and dry leb techniques, as well computing power, where these improvements enable the collection of large datasets often spanning hundreds of millions of entries. With this newfound wealth of data, a need has arisen for high-efficiency bioinformatics pipelines and tools that allow for the analysis of scRNA-seq data. One computational method currently in use is differential gene expression (DGE) analysis, which identifies statistically significant genes (i.e., results that are minimally confounded by experimental errors) and determines the expression level of a gene relative to the entire dataset.2 Using these statistically significant results, it is possible to correlate the most highly expressed genes to their tangible, biological effects through the use of gene ontology databases such as the Gene Ontology Knowledgebase (GO).
The SPARC Portal currently hosts a rich collection of scRNA-seq data across several different tissues and species. Therefore, the SPARC platform could be further enhanced by the inclusion of data visualization and the aforementioned DGE tools. This is achieved in sPARcRNA_Viz through the use of t-SNE plotting and GSEA.
t-distributed Stochastic Neighbor Embedding (t-SNE) is a plotting and visualization technique that focuses on pairwise similarities among datasets. Like PCA, it is a dimensionality reduction technique. For its utility in comparing large, complex datasets, t-SNE is commonly employed by RNA-seq researchers.
Gene Set Enrichment Analysis (GSEA) is a popular technique for determining statistically significant genes, as well as those that are upregulated and downregulated.5 This is achieved through a ranking system whereby genes are organized by statistically significance.
As of 8/12/24, the Transcriptomic_oSPARC utility1 would appear to be the most prominent SPARC tool relating to the analysis gene expression. This tool is very effective in displaying industry-standard static graphical outputs, which can prove quite useful to researchers. However, a limitation may perhaps exist in the current customization level; it may be necessary to edit the code itself to change particular parameters. There was also a niche to explore in adding interactivity to the graphs, further enahncing the user experience.
The gene expression data in SPARC is somewhat limited and is in a raw data format, rendering it less interoperable. Our goal was to make it more interoperable and easy to use. Therefore, our team sought to create a RNA-seq visualization utility that supports the specification of specific parameters, as well as interactivity. There was also room for experimentation in predicting gene ontology with GSEA.
To address this challenge, we present sPARcRNA_Viz, an scRNA-seq visualization tool for potential entry alongside Transcriptomic_oSPARC. In incorporating flexible parameters, interactivity, and an additional DEA metric, sPARcRNA_Viz will complement Transcriptomic_oSPARC as part of a growing SPARC gene expression toolkit. This was Codeathon team's first-ever attempt at putting this together.
- GNU Make
- Python3
Docker
(if you wish to build and test the service locally)
sPARcRNA_Viz currently supports the following file format: .csv/.tsv (barcode and feature files), .mtx (matrix file) single-cell matrices along with R data. These formats and 3 files are required to run the analysis successfully.
Can refer to PIPELINE.md.
Load libraries, set options, validate and prepare the directories; find and read raw data files; configure based on inputs.
Seurat is an R package specially designed for the quality control (QC) , analysis, and exploration of single-cell RNA-seq data. Thus, it proved to be a suitable choice for the purposes of sPARcRNA_Viz.
Normalize (so that data reflects true biological differences); find variable features; scale (to standardize the data); perform PCA (Principal Component Analysis to reduce dimensionality); and cluster cells with similar profiles together.
t-SNE allows us to visualize statistically significant genes based on these clusters. From these, researchers can determine potential gene ontologies arising from their sample(s).
Differential gene expression analysis takes the normalized gene read counts and allows researchers to determine quantitative changes in gene expression.
GSEA aids in determining gene groups highly represented in the data.
All the cluster results after running GSEA are saved, and the top pathways are saved as well.
All values from the previous steps and top clusters, pathways, etc are saved in a Seurat object that is later visualized. The user can optionally convert this data into .csv file format. The results are displayed in interactive charts using d3 and ApexCharts. All the results are fully accessible via the file system and do not require any additional software to view.
sPARcRNA_Viz offers a variety of command options:
Option | Description | Default |
---|---|---|
-i , --input |
Input directory path | 'validation/input/data/' |
-o , --output |
Output directory path | 'tmp_output/' |
-n , --name |
The name of the dataset being analyzed | "sPARcRNA" |
-c , --min_cells |
Minimum number of cells expressing a gene | 3 |
-f , --min_features |
Minimum number of features (genes) per cell | 200 |
--max_features |
Maximum number of features (genes) per cell | 2500 |
--resolution |
Resolution parameter for clustering | 0.8 |
--species |
Species for GSEA | "Mus musculus" |
--min_pct |
Minimum percentage for FindAllMarkers | 0.25 |
--logfc_threshold |
Log fold-change threshold for FindAllMarkers | 0.25 |
--gsea_min_size |
Minimum gene set size for GSEA | 15 |
--gsea_max_size |
Maximum gene set size for GSEA | 500 |
--category |
MSigDB category for GSEA | "H" |
The scRNA-seq data used in the tutorial is from the SPARC Portal.
1. Log in to o²S²PARC
(Alternatively, drag and drop the needed files into the workspace.)
sPARcRNA_Viz would be expanded to include other interactive visualizations and API calls to other gene databases. This would provide more ways to analyze genes and integrate with other websites.
Perhaps the most important aspect of sPARcRNA_Viz is its emphasis on the FAIR Data Principles. Summarized below are highlight features of sPARcRNA_Viz supporting the FAIR initiative.
FAIR data is that which is Findable, Accessible, Interoperable, and Reusable.3 These tenets are highly important within the scientific context as a lack of clarity surrounding data may introduce confounding variables. Thus, the FAIR system was derived so as to create a method to organize data, thus serving both present and future researchers.
Particularly in the case of scRNA-seq data, which is expensive from both a wet and dry lab standpoint, it is very useful to adhere to FAIR standards. For instance, one particularly common phenomemon with respect to scRNA-seq is dropout4, where portions of RNA are not captured by experimental techniques. scRNA-seq data can also be signficantly varied with regard to format; often, differently-labeled matrices may contain raw counts data, or data that has been normalized by a method such as CPM, TPM, or RPKM/FPKM. The FAIR article cited on the SPARC website expands upon this idea further: the licensing of data can also pose a challenge for the analysis of gene regulation and expression. Therefore, the intentional categorization and stewardship of data can present a major benefit to transcriptomics researchers, propelling scientific progress.
FAIR Principle | Other Tools | sPARcRNA_Viz |
---|---|---|
Findable | May not be connected to an existing database such as the SPARC Portal, which could hinder the findability of data | sPARcRNA_Viz is connected to o²S²PARC, so it can use the well-organized datasets provided on the SPARC portal, and it is archived on Zenodo with the appropriate metadata |
Accessible | May have a user interface that requires a programming background | sPARcRNA_Viz's friendly user interface and visuals allow researchers to quickly engage with data and is open, free and universally implementable |
Interoperable | May not allow for connections between datasets | Through its use of GSEA, sPARcRNA_Viz allows for the meaningful connection of datasets: scRNA-seq data can be used in association with gene ontology. In addition, visualizations generated for each dataset can be compared with each other |
Reusable | May only support the formatting of one dataset | sPARcRNA_Viz be used with multiple datasets due to the ability to specify parameters. Likewise, sPARcRNA_Viz offers a security benefit through its use of input validation |
To ensure that we are compliant with all the FAIR principles, we have also created a crosswalk between the FAIR principles and the sPARcRNA_Viz tool. This crosswalk can be found in the CROSSWALK.md file.
Please utilize the Issues tab of this repository should you encounter any problems with sPARcRNA_Viz.
Please Fork this repository and submit a Pull Request to contribute.
Samdarshi, M., Soundarajan, S., Simhambhatla, M., Patel, R. C., & Bratton, A. (2024). sPARcRNA_Viz. Zenodo. https://doi.org/10.5281/zenodo.13308297
sPARcRNA_Viz is distributed under the MIT License.
- Mihir Samdarshi (Lead, Sysadmin, Developer)
- Sanjay Soundarajan (Sysadmin, Developer)
- Mahitha Simhambhatla (Developer, Writer)
- Raina Patel (Writer)
- Ayla Bratton (Writer)
[1]
Ben Aribi, H., Ding, M., & Kiran, A. (2023).
Gene expression data visualization tool on the o2S2PARC platform.
F1000Research, 11, 1267.
https://www.pnas.org/doi/abs/10.1073/pnas.0506580102
[2]
EMBL-EBI. (n.d.).
Differential gene expression analysis | Functional genomics II.
https://www.ebi.ac.uk/training/online/courses/functional-genomics-ii-common-technologies-and-data-analysis-methods/rna-sequencing/performing-a-rna-seq-experiment/data-analysis/differential-gene-expression-analysis/
[3]
GO FAIR.(2017).
FAIR Principles - GO FAIR. GO FAIR.
https://www.go-fair.org/fair-principles/
[4]
Kim, T. H., Zhou, X., & Chen, M. (2020). Demystifying “drop-outs” in single-cell UMI data. Genome Biology, 21(1).
https://doi.org/10.1186/s13059-020-02096-y
[5]
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., & Mesirov, J. P. (2005).
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.
Proceedings of the National Academy of Sciences, 102(43), 15545–15550.
https://doi.org/10.1073/pnas.0506580102
[6]
Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., & Gonzalez-Beltran, A. (2016).
The FAIR Guiding Principles for Scientific Data Management and Stewardship. Scientific Data, 3(1).
https://www.nature.com/articles/sdata201618
Logo and figures were created using Microsoft Word; images were formatted using Canva.
We would like to thank the SPARC Codeathon 2024 team for all their guidance and support.