Skip to content

Commit

Permalink
Almost all tests working
Browse files Browse the repository at this point in the history
  • Loading branch information
jfnavarro committed Jan 12, 2025
1 parent 0d2a869 commit 141f344
Show file tree
Hide file tree
Showing 25 changed files with 215 additions and 257 deletions.
3 changes: 0 additions & 3 deletions AUTHORS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,2 @@
## Author:
- Jose Fernandez Navarro <jc.fernandez.navarro@gmail.com>

## Contributors:
- Erik Borgström <erik.borgstrom@scilifelab.se>
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@
* Added Docker container
* Added tox
* Updated versions of dependencies
* Perform code optimizations
* Add test for full coveragge
* Bump taggd to 0.4.0

## Version 1.8.2
* Added annotation (htseq) feature type as parameter
Expand Down
54 changes: 29 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Spatial Transcriptomics Pipeline

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)
[![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-310/)
[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-311/)
[![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/release/python-312/)
[![PyPI version](https://badge.fury.io/py/stpipeline.svg)](https://badge.fury.io/py/stpipeline)
[![Build Status](https://github.com/jfnavarro/st_pipeline/actions/workflows/dev.yml/badge.svg)](https://github.com/jfnavarro/st_pipeline/actions/workflows/dev)

The ST Pipeline contains the tools and scripts needed to process and analyze the raw
files generated with Spatial Transcriptomics and Visium raw data in FASTQ format to generate datasets for down-stream analysis.
files generated with Spatial Transcriptomics and Visium in FASTQ format to generate datasets for down-stream analysis.
The ST pipeline can also be used to process single cell RNA-seq data as long as a
file with barcodes identifying each cell is provided (same template as the files in the folder "ids").

Expand All @@ -22,15 +22,15 @@ The following files/parameters are commonly required :
- FASTQ files (Read 1 containing the spatial information and the UMI and read 2 containing the genomic sequence)
- A genome index generated with STAR
- An annotation file in GTF or GFF3 format (optional when using a transcriptome)
- The file containing the barcodes and array coordinates (look at the folder "ids" to use it as a reference).
- A file containing the barcodes and array coordinates (look at the folder "ids" to use it as a reference).
Basically this file contains 3 columns (BARCODE, X and Y), so if you provide this
file with barcodes identifying cells (for example), the ST pipeline can be used for single cell data.
This file is also optional if the data is not barcoded (for example RNA-Seq data).
- A name for the dataset

The ST pipeline has multiple parameters mostly related to trimming, mapping and annotation
but generally the default values are good enough. You can see a full
description of the parameters typing "st_pipeline_run.py --help" after you have installed the ST pipeline.
description of the parameters typing `st_pipeline_run --help` after you have installed the ST pipeline.

The input FASTQ files can be given in gzip/bzip format as well.

Expand All @@ -57,9 +57,9 @@ The ST pipeline will also output a log file with useful stats and information.

## Installation

We recommend you install a virtual environment like Pyenv or Anaconda before you install the pipeline.
We recommend you install a virtual environment like `pyenv` or `Anaconda` before you install the pipeline.

The ST Pipeline works with python 3.9, 3.10 and 3.11.
The ST Pipeline works with python 3.10, 3.11 and 3.12.

You can install the ST Pipeline using PyPy:

Expand Down Expand Up @@ -90,23 +90,34 @@ cd stpipeline
To install the pipeline type:

```bash
python setup.py build
python setup.py install
pip install .
```

You can also use `poetry` to install the pipeline:

```bash
poetry build
poetry install
```

To see the different options type:

```bash
st_pipeline_run.py --help
st_pipeline_run --help
```

## Testing

To run a test type (make sure you are inside the source folder):

```bash
python setup.py test
python -m unittest testrun.py
pytest
```

or

```bash
poetry run pytest
```

## Requirements
Expand All @@ -120,13 +131,6 @@ If you use anaconda you can install STAR with:
conda install -c bioconda star
```

The ST Pipeline requires `samtools` installed in the system
If you use anaconda you can install samtools with:

```bash
conda install -c bioconda samtools openssl=1.0
```

The ST Pipeline needs a computer with at least 32GB of RAM (depending on the size of the genome) and 8 cpu cores.

## Dependencies
Expand All @@ -140,7 +144,7 @@ You can see them in the file `requirements.txt`
An example run would be:

```bash
st_pipeline_run.py --expName test --ids ids_file.txt \
st_pipeline_run --expName test --ids ids_file.txt \
--ref-map path_to_index --log-file log_file.txt --output-folder /home/me/results \
--ref-annotation annotation_file.gtf file1.fastq file2.fastq
```
Expand All @@ -164,7 +168,7 @@ the output file so it contains gene ids/names instead of Ensembl ids.
You can use this tool that comes with the ST Pipeline

```bash
convertEnsemblToNames.py --annotation path_to_annotation_file --output st_data_updated.tsv st_data.tsv
convertEnsemblToNames --annotation path_to_annotation_file --output st_data_updated.tsv st_data.tsv
```

## Merge demultiplexed FASTQ files
Expand All @@ -173,7 +177,7 @@ If you used different indexes to sequence and need to merge the files
you can use the script `merge_fastq.py` that comes with the ST Pipeline

```bash
merge_fastq.py --run-path path_to_run_folder --out-path path_to_output --identifiers S1 S2 S3 S4
merge_fastq --run-path path_to_run_folder --out-path path_to_output --identifiers S1 S2 S3 S4
```

Where `--identifiers` will be strings that identify each demultiplexed sample.
Expand All @@ -185,7 +189,7 @@ to certain gene types (For instance to keep only protein_coding). You can do
so with the script `filter_gene_type_matrix.py` that comes with the ST Pipeline

```bash
filter_gene_type_matrix.py --gene-types-keep protein-coding --annotation path_to_annotation_file stdata.tsv
filter_gene_type_matrix --gene-types-keep protein-coding --annotation path_to_annotation_file stdata.tsv
```

You may include the parameter `--ensembl-ids` if your genes are represented as emsembl ids instead.
Expand All @@ -197,7 +201,7 @@ to keep only spots inside the tissue. You can do so with the script `adjust_matr
that comes with the ST Pipeline

```bash
adjust_matrix_coordinates.py --outfile new_stdata.tsv --coordinates-file coordinates.txt stdata.tsv
adjust_matrix_coordinates --outfile new_stdata.tsv --coordinates-file coordinates.txt stdata.tsv
```

Where `coordinates.txt` will be a tab delimited file with 6 columns:
Expand All @@ -215,13 +219,13 @@ The ST Pipeline generate useful stats/QC information in the LOG file but if you
want to obtain more detailed information about the quality of the data, you can run the following script:

```bash
st_qa.py stdata.tsv
st_qa stdata.tsv
```

If you want to perform quality stats on multiple datasets you can run:

```bash
multi_qa.py stdata1.tsv stadata2.tsv stdata3.tsv stdata4.tsv
multi_qa stdata1.tsv stadata2.tsv stdata3.tsv stdata4.tsv
```

Multi_qa.py generates violing plots, correlation plots/tables and more useful information and
Expand Down
42 changes: 0 additions & 42 deletions README_SHORT.md

This file was deleted.

2 changes: 2 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Test dataset

These dataset were generated
from the publicly available raw FASTQ files
of the Mouse Olfatory Bulb Replicates number 4 and 9
Expand Down
5 changes: 4 additions & 1 deletion docsrc/changes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,13 @@ Changes
* Refactor code to modern Python (black, mypy, isort)
* Added Github Actions
* Added pre-commit hooks
* Added Docker container
* Change the build configuration to Poetry
* Added Docker container
* Added tox
* Updated versions of dependencies
* Perform code optimizations
* Add test for full coveragge
* Bump taggd to 0.4.0

**Version 1.8.2**
* Added annotation (htseq) feature type as parameter
Expand Down
2 changes: 1 addition & 1 deletion docsrc/example.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ The following is an example of an BASH file to run the ST pipeline.
EXP=YOUR_EXP_NAME
# Running the pipeline
st_pipeline_run.py \
st_pipeline_run \
--output-folder $OUTPUT \
--ids $ID \
--ref-map $MAP \
Expand Down
8 changes: 5 additions & 3 deletions docsrc/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ We recommend to download and install Anaconda (https://www.anaconda.com/products
We then create a virtual environment from which we will run the pipeline in.
Type the following command:

``conda create -n stpipeline python=3.9``
``conda create -n stpipeline python=3.10``

The name for the virtual environment that we have just created is specified by
the -n flag. Here is is called stpipeline, but this can be anything that you want
Expand Down Expand Up @@ -46,9 +46,11 @@ Activate the virtual environment (if not already active)

Install the pipeline

``python setup.py build``
``pip install .``

``python setup.py install``
or using poetry:

``poetry install``

Alternatively, you can simply install the pipeline using PyPy:

Expand Down
2 changes: 1 addition & 1 deletion docsrc/license.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ License
-------

The MIT License (MIT)
Copyright (c) 2020 Jose Fernandez Navarro.
Copyright (c) 2024 Jose Fernandez Navarro.
All rights reserved.

* Jose Fernandez Navarro <jc.fernandez.navarro@gmai.com>
Expand Down
2 changes: 1 addition & 1 deletion docsrc/manual.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ the path to a STAR genome index, the path to a annotation file in GTF
format and a dataset name.

The ST Pipeline has many parameters, you can see a description of them
by typing : st_pipeline_run.py --help
by typing : st_pipeline_run --help

Note that the minimum read length is dependant on the type of kit used, and
should be adjusted accordingly, i.e. a 150bp kit should have a different
Expand Down
20 changes: 9 additions & 11 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ version = "2.0.0"
description = "ST Pipeline: An automated pipeline for spatial mapping of unique transcripts"
authors = ["Jose Fernandez Navarro <jc.fernandez.navarro@gmail.com>"]
license = "MIT"
readme = "README_SHORT.md"
readme = "README.md"
keywords = ["visium", "analysis", "pipeline", "spatial", "transcriptomics", "toolkit"]
repository = "https://github.com/jfnavarro/st_pipeline"
classifiers = [
Expand All @@ -22,10 +22,8 @@ classifiers = [
]
include = [
{ path = "README.md" },
{ path = "README_SHORT.md" },
{ path = "LICENSE" },
{ path = "doc/**" },
{ path = "scripts/**" }
{ path = "doc/**" }
]

[tool.poetry.dependencies]
Expand All @@ -47,13 +45,13 @@ dnaio = "^1.2.3"
distance = "^0.1.3"

[tool.poetry.scripts]
st_qa = "scripts.st_qa:main"
st_pipeline_run = "scripts.st_pipeline_run:main"
multi_qa = "scripts.multi_qa:main"
merge_fastq = "scripts.merge_fastq:main"
filter_gene_type_matrix = "scripts.filter_gene_type_matrix:main"
convertEnsemblToNames = "scripts.convertEnsemblToNames:main"
adjust_matrix_coordinates = "scripts.adjust_matrix_coordinates:main"
st_qa = "stpipeline.scripts.st_qa:main"
st_pipeline_run = "stpipeline.scripts.st_pipeline_run:main"
multi_qa = "stpipeline.scripts.multi_qa:main"
merge_fastq = "stpipeline.scripts.merge_fastq:main"
filter_gene_type_matrix = "stpipeline.scripts.filter_gene_type_matrix:main"
convertEnsemblToNames = "stpipeline.scripts.convertEnsemblToNames:main"
adjust_matrix_coordinates = "stpipeline.scripts.adjust_matrix_coordinates:main"

[tool.poetry.extras]
test = [
Expand Down
2 changes: 1 addition & 1 deletion stpipeline/common/filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ def filter_input_data(
dropped_adaptor = 0
too_short_after_trimming = 0

bam_file = pysam.AlignmentFile(out_file, "wb")
bam_file = pysam.AlignmentFile(out_file, "wb", header=bam_header)
if keep_discarded_files:
out_writer_discarded = dnaio.open(out_file_discarded, mode="w") # type: ignore

Expand Down
34 changes: 8 additions & 26 deletions stpipeline/common/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,39 +5,21 @@
from datetime import datetime
import os
import subprocess
from typing import Optional, Generator, IO, Any
from typing import Optional, IO, Any
import shutil


def which_program(program: str) -> Optional[str]:
def which_program(program: str) -> bool:
"""
Checks if a program exists and is executable.
Check if a program is installed and available in the system's PATH.
Args:
program: The program name.
program: The name of the program to check.
Returns:
The full path to the program if found, otherwise None.
"""

def is_exe(fpath: str) -> bool:
return fpath is not None and os.path.exists(fpath) and os.access(fpath, os.X_OK)

def ext_candidates(fpath: str) -> Generator[str, None, None]:
yield fpath
for ext in os.environ.get("PATHEXT", "").split(os.pathsep):
yield fpath + ext

fpath, _ = os.path.split(program)
if fpath:
if is_exe(program):
return program
else:
for path in os.environ["PATH"].split(os.pathsep):
exe_file = os.path.join(path, program)
for candidate in ext_candidates(exe_file):
if is_exe(candidate):
return candidate
return None
True if the program is found and executable, False otherwise.
"""
return shutil.which(program) is not None


class TimeStamper:
Expand Down
Loading

0 comments on commit 141f344

Please sign in to comment.