Skip to content

Single Cell RNA-sequencing differential expression analysis and interactive visualization service on oSparc

License

Notifications You must be signed in to change notification settings

SPARC-FAIR-Codeathon/sPARcRNA_Viz

Repository files navigation

Screenshot 2024-08-12 at 2 46 32 PM Screenshot 2024-08-12 at 11 35 07 AM

Screenshot 2024-08-12 at 2 49 10 PM

Table of Contents

About

Easily generate differential expression results from SPARC scRNA-seq data in a FAIR manner.

Research Purpose

Introduction

sPARcRNA_Viz is an all-in-one gene expression visualization utility integratable with o²S²PARC. Using sPARcRNA_Viz, researchers can create an interactive t-SNE from single-cell RNA-sequencing data, as well as perform in silico GSEA analysis to determine the most highly expressed genes. From these statistically significant genes, researchers can determine potential gene ontologies arising from their sample(s). In addition, the seamless integration of sPARcRNA_Viz with the o²S²PARC computing platform enables data accessibility concordant with FAIR Data Principles.

Notable Features of sPARcRNA_Viz

sPARcRNA_Viz provides the user with the ability to fine-tune multiple gene expression parameters:

  • Minimum number of cells expressing a gene
  • Minimum number of features (genes) per cell
  • Maximum number of features (genes) per cell
  • Clustering resolution
  • Species specification
  • Minimum percentage for FindAllMarkers
  • Log fold-change threshold for FindAllMarkers
  • Minimum gene set size for GSEA
  • MSigDB category for GSEA

Technology Stack

  • o²S²PARC
  • R
  • GNU Make
  • Python3
  • Docker
  • HTML
  • JavaScript
  • Tailwind CSS

Background

In recent years, single-cell RNA-sequencing (scRNA-seq) has emerged as a preeminent method for the analysis of gene expression in biological tissue, providing researchers access to genetic data previously inaccessible. This is largely due to advancements in wet lab and dry leb techniques, as well computing power, where these improvements enable the collection of large datasets often spanning hundreds of millions of entries. With this newfound wealth of data, a need has arisen for high-efficiency bioinformatics pipelines and tools that allow for the analysis of scRNA-seq data. One computational method currently in use is differential gene expression (DGE) analysis, which identifies statistically significant genes (i.e., results that are minimally confounded by experimental errors) and determines the expression level of a gene relative to the entire dataset.2 Using these statistically significant results, it is possible to correlate the most highly expressed genes to their tangible, biological effects through the use of gene ontology databases such as the Gene Ontology Knowledgebase (GO).

The SPARC Portal currently hosts a rich collection of scRNA-seq data across several different tissues and species. Therefore, the SPARC platform could be further enhanced by the inclusion of data visualization and the aforementioned DGE tools. This is achieved in sPARcRNA_Viz through the use of t-SNE plotting and GSEA.

About t-SNE Plots

t-distributed Stochastic Neighbor Embedding (t-SNE) is a plotting and visualization technique that focuses on pairwise similarities among datasets. Like PCA, it is a dimensionality reduction technique. For its utility in comparing large, complex datasets, t-SNE is commonly employed by RNA-seq researchers.

About GSEA

Gene Set Enrichment Analysis (GSEA) is a popular technique for determining statistically significant genes, as well as those that are upregulated and downregulated.5 This is achieved through a ranking system whereby genes are organized by statistically significance.

Current SPARC Portal Tools

As of 8/12/24, the Transcriptomic_oSPARC utility1 would appear to be the most prominent SPARC tool relating to the analysis gene expression. This tool is very effective in displaying industry-standard static graphical outputs, which can prove quite useful to researchers. However, a limitation may perhaps exist in the current customization level; it may be necessary to edit the code itself to change particular parameters. There was also a niche to explore in adding interactivity to the graphs, further enahncing the user experience.

The Problem

The gene expression data in SPARC is somewhat limited and is in a raw data format, rendering it less interoperable. Our goal was to make it more interoperable and easy to use. Therefore, our team sought to create a RNA-seq visualization utility that supports the specification of specific parameters, as well as interactivity. There was also room for experimentation in predicting gene ontology with GSEA.

Our Solution: sPARcRNA_Viz

To address this challenge, we present sPARcRNA_Viz, an scRNA-seq visualization tool for potential entry alongside Transcriptomic_oSPARC. In incorporating flexible parameters, interactivity, and an additional DEA metric, sPARcRNA_Viz will complement Transcriptomic_oSPARC as part of a growing SPARC gene expression toolkit. This was Codeathon team's first-ever attempt at putting this together.

Using sPARcRNA_Viz

sPARcRNA_Viz Requirements

  • GNU Make
  • Python3
  • Docker (if you wish to build and test the service locally)

Required Input Format

sPARcRNA_Viz currently supports the following file format: .csv/.tsv (barcode and feature files), .mtx (matrix file) single-cell matrices along with R data. These formats and 3 files are required to run the analysis successfully.

sPARcRNA_Viz Pipeline Workflow

Can refer to PIPELINE.md.

1. Setup

Load libraries, set options, validate and prepare the directories; find and read raw data files; configure based on inputs.

2. Create Seurat object

Seurat is an R package specially designed for the quality control (QC) , analysis, and exploration of single-cell RNA-seq data. Thus, it proved to be a suitable choice for the purposes of sPARcRNA_Viz.

3. Normalize and preprocess the data

Normalize (so that data reflects true biological differences); find variable features; scale (to standardize the data); perform PCA (Principal Component Analysis to reduce dimensionality); and cluster cells with similar profiles together.

4. t-SNE

t-SNE allows us to visualize statistically significant genes based on these clusters. From these, researchers can determine potential gene ontologies arising from their sample(s).

5. Differential Gene Expression Analysis

Differential gene expression analysis takes the normalized gene read counts and allows researchers to determine quantitative changes in gene expression.

6. GSEA

GSEA aids in determining gene groups highly represented in the data.

7. Combine t-SNE and GSEA results

All the cluster results after running GSEA are saved, and the top pathways are saved as well.

8. Export and Display Results

All values from the previous steps and top clusters, pathways, etc are saved in a Seurat object that is later visualized. The user can optionally convert this data into .csv file format. The results are displayed in interactive charts using d3 and ApexCharts. All the results are fully accessible via the file system and do not require any additional software to view.

Configuring sPARcRNA_Viz

sPARcRNA_Viz offers a variety of command options:

Option Description Default
-i, --input Input directory path 'validation/input/data/'
-o, --output Output directory path 'tmp_output/'
-n, --name The name of the dataset being analyzed "sPARcRNA"
-c, --min_cells Minimum number of cells expressing a gene 3
-f, --min_features Minimum number of features (genes) per cell 200
--max_features Maximum number of features (genes) per cell 2500
--resolution Resolution parameter for clustering 0.8
--species Species for GSEA "Mus musculus"
--min_pct Minimum percentage for FindAllMarkers 0.25
--logfc_threshold Log fold-change threshold for FindAllMarkers 0.25
--gsea_min_size Minimum gene set size for GSEA 15
--gsea_max_size Maximum gene set size for GSEA 500
--category MSigDB category for GSEA "H"

Tutorial

The scRNA-seq data used in the tutorial is from the SPARC Portal.

1. Log in to o²S²PARC

Screenshot 2024-08-12 at 9 36 07 PM

2. Open a new Study

Screenshot 2024-08-12 at 9 35 10 PM

3. Add 3 File Picker Nodes and upload the required data

Screenshot 2024-08-12 at 8 52 18 PM

image

Screenshot 2024-08-12 at 9 34 49 PM

(Alternatively, drag and drop the needed files into the workspace.)

4. Add sPARcRNA_Viz Node

Screenshot 2024-08-12 at 9 33 34 PM

5. Connect the Nodes

Screenshot 2024-08-12 at 9 34 08 PM

6. Optionally run outputs through JupyterLab R for further analysis

Screenshot 2024-08-12 at 9 21 20 PM

Future Vision

sPARcRNA_Viz would be expanded to include other interactive visualizations and API calls to other gene databases. This would provide more ways to analyze genes and integrate with other websites.

FAIR-Centered Design

Perhaps the most important aspect of sPARcRNA_Viz is its emphasis on the FAIR Data Principles. Summarized below are highlight features of sPARcRNA_Viz supporting the FAIR initiative.

Importance of FAIR Data Principles

Screenshot 2024-08-12 at 9 58 14 AM

FAIR data is that which is Findable, Accessible, Interoperable, and Reusable.3 These tenets are highly important within the scientific context as a lack of clarity surrounding data may introduce confounding variables. Thus, the FAIR system was derived so as to create a method to organize data, thus serving both present and future researchers.

Particularly in the case of scRNA-seq data, which is expensive from both a wet and dry lab standpoint, it is very useful to adhere to FAIR standards. For instance, one particularly common phenomemon with respect to scRNA-seq is dropout4, where portions of RNA are not captured by experimental techniques. scRNA-seq data can also be signficantly varied with regard to format; often, differently-labeled matrices may contain raw counts data, or data that has been normalized by a method such as CPM, TPM, or RPKM/FPKM. The FAIR article cited on the SPARC website expands upon this idea further: the licensing of data can also pose a challenge for the analysis of gene regulation and expression. Therefore, the intentional categorization and stewardship of data can present a major benefit to transcriptomics researchers, propelling scientific progress.

Summary of FAIR Principles Application

FAIR Principle Other Tools sPARcRNA_Viz
Findable May not be connected to an existing database such as the SPARC Portal, which could hinder the findability of data sPARcRNA_Viz is connected to o²S²PARC, so it can use the well-organized datasets provided on the SPARC portal, and it is archived on Zenodo with the appropriate metadata
Accessible May have a user interface that requires a programming background sPARcRNA_Viz's friendly user interface and visuals allow researchers to quickly engage with data and is open, free and universally implementable
Interoperable May not allow for connections between datasets Through its use of GSEA, sPARcRNA_Viz allows for the meaningful connection of datasets: scRNA-seq data can be used in association with gene ontology. In addition, visualizations generated for each dataset can be compared with each other
Reusable May only support the formatting of one dataset sPARcRNA_Viz be used with multiple datasets due to the ability to specify parameters. Likewise, sPARcRNA_Viz offers a security benefit through its use of input validation

To ensure that we are compliant with all the FAIR principles, we have also created a crosswalk between the FAIR principles and the sPARcRNA_Viz tool. This crosswalk can be found in the CROSSWALK.md file.

Additional Information

Issue Reporting

Please utilize the Issues tab of this repository should you encounter any problems with sPARcRNA_Viz.

How to Contribute

Please Fork this repository and submit a Pull Request to contribute.

Cite Us

Samdarshi, M., Soundarajan, S., Simhambhatla, M., Patel, R. C., & Bratton, A. (2024). sPARcRNA_Viz. Zenodo. https://doi.org/10.5281/zenodo.13308297

License

sPARcRNA_Viz is distributed under the MIT License.

Team

  • Mihir Samdarshi (Lead, Sysadmin, Developer)
  • Sanjay Soundarajan (Sysadmin, Developer)
  • Mahitha Simhambhatla (Developer, Writer)
  • Raina Patel (Writer)
  • Ayla Bratton (Writer)

Materials Cited

[1] Ben Aribi, H., Ding, M., & Kiran, A. (2023). Gene expression data visualization tool on the o2S2PARC platform. F1000Research, 11, 1267. https://www.pnas.org/doi/abs/10.1073/pnas.0506580102
[2] EMBL-EBI. (n.d.). Differential gene expression analysis | Functional genomics II. https://www.ebi.ac.uk/training/online/courses/functional-genomics-ii-common-technologies-and-data-analysis-methods/rna-sequencing/performing-a-rna-seq-experiment/data-analysis/differential-gene-expression-analysis/
[3] GO FAIR.(2017). FAIR Principles - GO FAIR. GO FAIR. https://www.go-fair.org/fair-principles/
[4] Kim, T. H., Zhou, X., & Chen, M. (2020). Demystifying “drop-outs” in single-cell UMI data. Genome Biology, 21(1). https://doi.org/10.1186/s13059-020-02096-y
[5] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., & Mesirov, J. P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545–15550. https://doi.org/10.1073/pnas.0506580102
[6] Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., & Gonzalez-Beltran, A. (2016). The FAIR Guiding Principles for Scientific Data Management and Stewardship. Scientific Data, 3(1). https://www.nature.com/articles/sdata201618


Logo and figures were created using Microsoft Word; images were formatted using Canva.

Acknowledgements

We would like to thank the SPARC Codeathon 2024 team for all their guidance and support.

About

Single Cell RNA-sequencing differential expression analysis and interactive visualization service on oSparc

Resources

License

Stars

Watchers

Forks

Packages

No packages published