Skip to content

Commit

Permalink
Update
Browse files Browse the repository at this point in the history
  • Loading branch information
jfnavarro committed Jan 15, 2025
1 parent ba34653 commit 30a1cd6
Show file tree
Hide file tree
Showing 9 changed files with 70 additions and 27 deletions.
9 changes: 5 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,15 @@
* Refactor code to modern Python (black, mypy, isort)
* Added Github Actions
* Added pre-commit hooks
* Change the build configuration to Poetry
* Added Docker container
* Changed the build configuration to Poetry
* Added a Docker container
* Changed tests to pytest
* Updated versions of dependencies
* Perform code optimizations
* Addded tests for full coveragge
* Perform code optimizations and clean ups
* Added more tests for almost full coveragge
* Bump taggd to 0.4.0
* Changed documentation
* Added --demultiplex-chunk-size option

## Version 1.8.2
* Added annotation (htseq) feature type as parameter
Expand Down
4 changes: 3 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ If you are proposing a feature:
* Explain in detail how it would work.
* Keep the scope as narrow as possible, to make it easier to implement.

## Get Started!
## Get Started

Ready to contribute? Here's how to set up `ST Pipeline` for local development.

Expand Down Expand Up @@ -167,3 +167,5 @@ You can also create the documentation manually by running:
``` console
poetry run mkdocs build
```

## Publish package (TODO)
13 changes: 10 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ ENV POETRY_VERSION=1.7.2 \
POETRY_NO_INTERACTION=1 \
PATH="/root/.local/bin:$PATH"

# Install system dependencies and Poetry
# Install system dependencies, Poetry, and STAR
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
build-essential \
Expand All @@ -16,9 +16,16 @@ RUN apt-get update \
libssl-dev \
git \
gcc \
&& curl -sSL https://install.python-poetry.org | python3 - \
wget \
unzip \
&& wget https://github.com/alexdobin/STAR/archive/refs/tags/2.7.10b.zip \
&& unzip 2.7.10b.zip \
&& cd STAR-2.7.10b/source \
&& make STAR \
&& mv STAR /usr/local/bin/ \
&& cd /app \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
&& rm -rf /var/lib/apt/lists/* 2.7.10b.zip STAR-2.7.10b

# Set working directory
WORKDIR /app
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Basically what the ST pipeline does (default mode) is:
- Check for AT and GC content
- Discard reads with a minimum number of bases of that failed any of the checks above
- Contamimant filter step (e.x. rRNA genome) (Optional)
- Mapping with STAR step (only read 2)
- Mapping with STAR step (only read 2) (Optional)
- Demultiplexing with [Taggd](https://github.com/jfnavarro/taggd) step (only read 1) (Optional)
- Keep reads (read 2) that contain a valid barcode and are correctly mapped
- Annotate the reads with htseq-count (custom version)
Expand Down
8 changes: 4 additions & 4 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Install the package:
poetry install
```

Now you can run ST Pipeline:
Now you can run the ST Pipeline:

``` console
poetry run st_pipeline_run --help
Expand All @@ -70,7 +70,7 @@ Install the package:
pip install .
```

You can also use the official PYPY repositories:
You can also use the official PyPy repositories:

``` console
pip install stpipeline
Expand All @@ -95,7 +95,7 @@ docker buildx build --platform linux/amd64 -t stpipeline .

Then, you can run ST Pipeline using Docker:

To run `ST Pipeline` command:
To run `ST Pipeline` commands:

``` console
docker run --rm stpipeline st_pipeline_run --help
Expand All @@ -109,7 +109,7 @@ or [Miniconda](https://docs.conda.io/en/latest/miniconda.html) installed in your
First, create the environment:

``` console
conda env create -f environment.yml
conda env create -n stpipeline python=3.10
```

Then, activate the environment:
Expand Down
36 changes: 23 additions & 13 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,9 @@ st_pipeline_run [options] fastq_file_fw fastq_file_rv
strict}
--htseq-no-ambiguous When using htseq-count discard reads annotating
ambiguous genes (default False)
--htseq-features HTSEQ_FEATURES [HTSEQ_FEATURES ...]
Which feature types to use from the GTF/GFF file in the annotation.
Can be given more than one type (default exon)
--strandness [STRING]
What strandness mode to use when annotating with
htseq-count [no, yes(default), reverse]
Expand Down Expand Up @@ -254,20 +257,25 @@ To process Visium datasets it is recommended to use these options:
--umi-end-position 28
```

## Emsembl ids
## Extra tools

The ST Pipeline ships many scripts that will be installed automatically and that
can be very useful to pre/post process the data and general QC stats and plots.

### Emsembl ids

If you used an Ensembl annotation file and you would like change
the output file so it contains gene ids/names instead of Ensembl ids.
You can use this tool that comes with the ST Pipeline
You can use the script `convertEnsemblToNames` that comes with the ST Pipeline

```bash
convertEnsemblToNames --annotation path_to_annotation_file --output st_data_updated.tsv st_data.tsv
```

## Merge demultiplexed FASTQ files
### Merge demultiplexed FASTQ files

If you used different indexes to sequence and need to merge the files
you can use the script `merge_fastq.py` that comes with the ST Pipeline
If you used different indexes to sequence and need to merge the fastq files
you can use the script `merge_fastq` that comes with the ST Pipeline

```bash
merge_fastq --run-path path_to_run_folder --out-path path_to_output --identifiers S1 S2 S3 S4
Expand All @@ -279,25 +287,27 @@ Where `--identifiers` will be strings that identify each demultiplexed sample.

If you want to remove from the dataset (matrix in TSV) genes corresponding
to certain gene types (For instance to keep only protein_coding). You can do
so with the script `filter_gene_type_matrix.py` that comes with the ST Pipeline
so with the script `filter_gene_type_matrix` that comes with the ST Pipeline

```bash
filter_gene_type_matrix --gene-types-keep protein-coding --annotation path_to_annotation_file stdata.tsv
```

You may include the parameter `--ensembl-ids` if your genes are represented as emsembl ids instead.
You may include the parameter `--ensembl-ids` if your genes are represented as Emsembl ids instead.

The value of `--gene-types-keep` must match the annotation file provided.

## Remove spots from dataset
### Remove spots from dataset

If you want to remove spots from a dataset (matrix in TSV) for instance
to keep only spots inside the tissue. You can do so with the script `adjust_matrix_coordinates.py`
to keep only spots inside the tissue. You can do so with the script `adjust_matrix_coordinates`
that comes with the ST Pipeline

```bash
adjust_matrix_coordinates --outfile new_stdata.tsv --coordinates-file coordinates.txt stdata.tsv
```

Where `coordinates.txt` will be a tab delimited file with 6 columns:
Where `coordinates.txt` must be a tab delimited file with 6 columns:

```console
orig_x orig_y new_x new_y new_pixel_x new_pixel_y
Expand All @@ -306,7 +316,7 @@ orig_x orig_y new_x new_y new_pixel_x new_pixel_y
Only spots whose coordinates in the file will be kept and then optionally you
can update the coordinates in the matrix choosing for the new array or pixel coordinates.

## Quality stats
### Quality stats

The ST Pipeline generate useful stats/QC information in the LOG file but if you
want to obtain more detailed information about the quality of the data, you can run the following script:
Expand All @@ -321,5 +331,5 @@ If you want to perform quality stats on multiple datasets you can run:
multi_qa stdata1.tsv stadata2.tsv stdata3.tsv stdata4.tsv
```

Multi_qa.py generates violing plots, correlation plots/tables and more useful information and
it allows to log the counts for the correlation.
`multi_qa` generates violing plots, correlation plots/tables and more useful information and
it allows to log the counts for the correlation analaysis.
5 changes: 5 additions & 0 deletions stpipeline/core/mapping.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,7 @@ def barcodeDemultiplexing(
cores: int,
outputFilePrefix: str,
keep_discarded_files: bool = False,
taggd_chunk_size: int = 10000,
) -> int:
"""
Perform demultiplexing using Taggd.
Expand All @@ -207,6 +208,7 @@ def barcodeDemultiplexing(
cores: Number of subprocesses for Taggd.
outputFilePrefix: Prefix for output files.
keep_discarded_files: If True, generate files for unmatched reads.
taggd_chunk_size: Number of reads to process in each thread.
Returns:
The total number of reads demultiplexed.
Expand All @@ -230,6 +232,7 @@ def barcodeDemultiplexing(
# --estimate-min-edit-distance is set estimate the min edit distance among true barcodes
# --no-offset-speedup turns off speed up, it might yield more hits (exactly as findIndexes)
# --homopolymer-filter if set excludes reads where the barcode
# --chunk-size number of reads to use in each thread for parallel processing
# contains a homolopymer of the given length (0 no filter), default 8

args = [
Expand All @@ -246,6 +249,8 @@ def barcodeDemultiplexing(
str(cores),
"--metric",
taggd_metric,
"--chunk-size",
str(taggd_chunk_size),
"--overhang",
str(over_hang if taggd_metric != "Hamming" else 0),
]
Expand Down
17 changes: 16 additions & 1 deletion stpipeline/core/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ def __init__(self) -> None:
self.taggd_metric = "Subglobal"
self.taggd_multiple_hits_keep_one = False
self.taggd_trim_sequences = None
self.taggd_chunk_size = 10000
self.adaptor_missmatches = 0
self.star_genome_loading = "NoSharedMemory"
self.star_sort_mem_limit = 0
Expand Down Expand Up @@ -284,6 +285,11 @@ def sanityCheck(self) -> None:
logger.error(error)
raise RuntimeError(error)

if self.taggd_chunk_size < 100:
error = "Error starting the pipeline.\n" "The chunk size for the demultiplexing step is too small"
logger.error(error)
raise RuntimeError(error)

# Test the presence of the required tools
required_scripts = ["STAR"]
unavailable_scripts = set()
Expand Down Expand Up @@ -621,6 +627,13 @@ def __call__(self, parser, namespace, values, option_string=None) -> None: # ty
"or when the barcode needs to be trimmed out.\n"
"Trimmng sequences can be given several times.",
)
parser.add_argument(
"--demultiplexing-chunk-size",
default=10000,
metavar="[INT]",
type=int,
help="Chunk size for parallel processing (number of reads assigned to each thread) (default: %(default)s)",
)
parser.add_argument(
"--htseq-mode",
default="intersection-nonempty",
Expand Down Expand Up @@ -851,6 +864,7 @@ def load_parameters(self, options: argparse.Namespace) -> None:
self.taggd_metric = options.demultiplexing_metric
self.taggd_multiple_hits_keep_one = options.demultiplexing_multiple_hits_keep_one
self.taggd_trim_sequences = options.demultiplexing_trim_sequences
self.taggd_chunk_size = options.demultiplexing_chunk_size
self.adaptor_missmatches = options.homopolymer_mismatches
self.star_genome_loading = options.star_genome_loading
self.star_sort_mem_limit = options.star_sort_mem_limit
Expand Down Expand Up @@ -908,7 +922,7 @@ def createLogger(self) -> None:
logger.info("TaggD multiple hits keep one (random) is enabled")
if self.taggd_trim_sequences is not None:
logger.info(f"TaggD trimming from the barcodes: {'-'.join(str(x) for x in self.taggd_trim_sequences)}")

logger.info(f"TaggD chunk size: {self.taggd_chunk_size }")
if not self.disable_mapping:
logger.info(f"Mapping reverse trimming: {self.trimming_rv}")
logger.info(f"Mapping inverse reverse trimming: {self.inverse_trimming_rv}")
Expand Down Expand Up @@ -1147,6 +1161,7 @@ def run(self) -> None:
self.threads,
FILENAMES["demultiplexed_prefix"], # Prefix for output files
self.keep_discarded_files,
self.taggd_chunk_size,
)
# pdate qa_stats
self.qa_stats.reads_after_demultiplexing = stats # type: ignore
Expand Down
3 changes: 3 additions & 0 deletions tests/mapping_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ def test_barcodeDemultiplexing():
cores=4,
outputFilePrefix="output/test",
keep_discarded_files=False,
taggd_chunk_size=100,
)

expected_args = [
Expand All @@ -160,6 +161,8 @@ def test_barcodeDemultiplexing():
"4",
"--metric",
"Levenshtein",
"--chunk-size",
"100",
"--overhang",
"2",
"--trim-sequences",
Expand Down

0 comments on commit 30a1cd6

Please sign in to comment.