wombat-p pipelines is a bioinformatics analysis pipeline that bundles different workflow for the analysis of label-free proteomics data with the purpose of comparison and benchmarking. It allows using files from the proteomics metadata standard SDRF.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. We used one of the nf-core templates.
This work contains four major different workflows for the analysis or label-free proteomics data, originating from LC-MS experiments.
- MaxQuant + NormalyzerDE
- SearchGui + Proline + PolySTest
- Compomics tools + FlashLFQ + MSqRob
- Tools from the Trans-Proteomic Pipeline + ROTS
Initialization and parameterization of the workflows is based on tools from the SDRF pipelines, the ThermoRawFileParser with our own contributions and additional programs from the wombat-p organizaion [https://github.com/wombat-p/Utilities] as well as our fork. This includes setting a generalized set of data analysis parameters and the calculation of a multiple benchmarks.
-
Install
Nextflow
(>=21.10.3
) -
Install
Docker
orSingularity
(you can follow this tutorial) (you can useConda
both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs). -
Download the pipeline and test it on a minimal dataset with a single command:
wget https://github.com/wombat-p/WOMBAT-Pipelines nextflow run main.nf -profile test,YOURPROFILE
Substitute wget
with curl
or alike.
- Setup of system for running the analysis
Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE
in the example command above). You can chain multiple config profiles in a comma-separated string.
- The pipeline comes with config profiles called
docker
,singularity
andconda
which instruct the pipeline to use the named tool for software management. For example,-profile test,docker
.- If you are using
docker
, your host system might need to set a parameters that should stop the mono-based programs from failing when running large data sets on multiple threads. For that please setsudo sysctl -w vm.max_map_count=262144
- If you are using
singularity
, setting theNXF_SINGULARITY_CACHEDIR
orsingularity.cacheDir
Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.- If you are using
conda
, it is highly recommended to use theNXF_CONDA_CACHEDIR
orconda.cacheDir
settings to store the environments in a central location for future pipeline runs.
- Start running your own analysis!
For a detailed explanation of the parameters, see below. Not all parameters are needed.
nextflow run main.nf --sdrf experimental_metadata.sdrf --fasta your_fasta_file.fasta --parameters your_parameters_yaml --raws thermo_raw_files --exp_design simple_experimental_design --workflow [other more specific parameters] -profile <docker/singularity/conda>
WOMBAT-P can run workflows using different (minimal) input, such as 1) with SDRF file (raw files can be given as parameter or are download from the location specified in the sdrf file): a) SDRF file + fasta file b) SDRF file + fasta file + experimental design file (will overwrite experimental design in sdrf) c) SDRF file + fasta file + experimental design file + yaml parameter file (will overwrite default and sdrf parameters)
2) without SDRF file: a) Raw files + fasta file + yaml parameter file b) Raw file + fasta file + yaml parameter file + experimental design file
-profile Set the profile and environment as described above
--sdrf This is a tab-delimited file containng details about experimental design and can also include all paramters given in the --parameters
yaml file. Several data sets on the PRIDE repository come with an sdrf file which is can then be found toghether with the other deposited files. For the PXD001819, this would be https://ftp.pride.ebi.ac.uk/pride/data/archive/2015/12/PXD001819/sdrf.tsv
See also the URL for SDRF files: https://github.com/bigbio/proteomics-metadata-standard/tree/master/annotated-projects and the description of the extended SDRF including data analysis parameters: https://github.com/bigbio/proteomics-metadata-standard/blob/master/sdrf-proteomics/Data-analysis-metadata.adoc
--fasta You also need a fasta database to run the database search in the workflows. Standard databases can be downloaded from UniProt
--parameters When deviating from the standard settings, use a yaml file containing new parameters settings. For more details about the different parameters and an example file, see https://github.com/bigbio/proteomics-metadata-standard/blob/master/sdrf-proteomics/Data-analysis-metadata.adoc As not all of these parameters are available for all workflows, see this table for an overview
--raws Without given sdrf file containing the paths to the raw data files (Thermo raw format) or if you have the files already downloaded, specify the wildcard (e.g. "*" or "?") to access the files on your system. We recommend putting this parameters in 'single quotes' as you might run into an error when using wildcards.
--exp_design An experimental design is automatically calculated from differences in the samples in the SDRF file. Alternatively, provide a tab-separated file with the two columns raw_file and exp_condition. raw_file: raw file names without path. Incorrect or incomplete names will lead to errors. exp_condition: arbitrary names for the sample groups. Files with the same sample group name will be considered replicates. See example.
--workflow Instead of running 'all' workflows (default), run only one of 'maxquant', 'proline', 'compomics' or 'tpp'
other parameters:
--comps (only maxquant workflow): Provide contrasts (specific comparisons) for the statistical tests. This is a list of comma-separated group names, e.g. "B-A,C-A" when having the three sample groups A, B and C
--proline_engine (only proline workflow): Define the search engine for the database search. Can be one or multiples of "xtandem", "msgf", "ms-amanda", "tide", "comet", "myrimatch", "meta_morpheus" and "andromeda". Note that not all engines necessarily work well with each data set.
You can add other NextFlow parameters as described extensively here
Here TRUE means that the parameter can be changed.
parameter | type | sdrf name | default | maxquant | proline | compomics | tpp |
---|---|---|---|---|---|---|---|
fixed_mods | ontology | modification parameters | NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4 | TRUE | TRUE | TRUE | TRUE |
variable mods | ontology | modification parameters | NT=oxidation;MT=variable;TA=M;AC=UNIMOD:35 | TRUE | TRUE | TRUE | TRUE |
precursor_mass_tolerance | string | precursor mass tolerance | 30 ppm | TRUE | TRUE | TRUE | TRUE |
fragment_mass_tolerance | string | fragment mass tolerance | 0.05 Da | TRUE | TRUE | TRUE | TRUE |
enzyme | ontology | cleavage agent details | Trypsin | TRUE | TRUE | TRUE | TRUE |
fions | class | forward ions | b | FALSE | TRUE | TRUE | TRUE |
rions | class | reverse ions | y | FALSE | TRUE | TRUE | TRUE |
isotope_error_range | integer | isotope error range | 0 | FALSE | TRUE | TRUE | TRUE |
add_decoys | boolean | add decoys | true | FALSE | TRUE | TRUE | TRUE |
num_hits | integer | num peptide hits | 1 | FALSE | FALSE | FALSE | FALSE |
allowed_miscleavages | integer | allowed miscleavages | 1 | TRUE | TRUE | TRUE | TRUE |
min_precursor_charge | integer | minimum precursor charge | 2 | FALSE | TRUE | TRUE | TRUE |
max_precursor_charge | integer | maximum precursor charge | 3 | TRUE | TRUE | TRUE | TRUE |
min_peptide_length | integer | minimum peptide length | 8 | TRUE | TRUE | TRUE | TRUE |
max_peptide_length | integer | maximum peptide length | 12 | FALSE | TRUE | TRUE | TRUE |
max_mods | integer | maximum allowed modification | 4 | TRUE | TRUE | TRUE | TRUE |
ident_fdr_psm | float | fdr on psm level | 0.01 | TRUE | TRUE | TRUE | TRUE |
ident_fdr_peptide | float | fdr on peptide level | 0.01 | TRUE | TRUE | TRUE | TRUE |
ident_fdr_protein | float | fdr on protein level | 0.01 | TRUE | TRUE | Not clear | Not clear |
match_between_runs | boolean | run match between runs | true | TRUE | FALSE | TRUE | Not available |
protein_inference | class | protein inference method | unique | TRUE | FALSE | TRUE | TRUE |
quantification_method | class | quantification method | intensity | FALSE | FALSE | FALSE | FALSE |
summarization_proteins | class | summarization of proteins method | sum_abs | FALSE | FALSE | FALSE | FALSE |
min_num_peptides | integer | minimum number of peptides per protein | 2 | TRUE | TRUE | TRUE | TRUE |
summarization_psms | class | summarization of psms method | sum_abs | FALSE | FALSE | FALSE | FALSE |
quant_transformation | class | transformation of quantitative values | log | FALSE | FALSE | FALSE | FALSE |
normalization_method | class | normalization method | median | TRUE | FALSE | FALSE | FALSE |
run_statistics | boolean | run statistical tests | true | TRUE | TRUE | TRUE | TRUE |
fdr_method | class | method for correction of multiple testing | benjamini-hochberg | FALSE | FALSE | FALSE | FALSE |
fdr_threshold | float | threshold for statistical test fdr | 0.01 | By filtering the results | By filtering the results | By filtering the results | By filtering the results |
Intermediate and final files are provided in the results folder or the folder specified via the outdir
parameter.
On top of the workflow-specific output, a standardized tabular format on both peptide (stand_pep_quant_merged.csv) and protein (stand_prot_quant_merged.csv) level is given.
For each of the workflows, WOMBAT-Pipelines calculated the same set of benchmarks for more systematic and thorough comparison between workflows and/or between different values of the data analysis parameters. For details about the benchmarks, see the following table:
Category | Aspect | Subgroup | Name | Name in JSON file | Definition | Value |
---|---|---|---|---|---|---|
Functionality | Traceability | Spectra | Tracable spectra | TraceableSpectra | Results tracable to original spectra | Y/N |
Functionality | Traceability | Spectra | Universal spectrum identifiers | UniversalSpectumIdentifiers | Workflow generates USIs (Universal Spectrum Identifier) | Y/N |
Functionality | Traceability | Spectra | Peptide to spectra | PeptideToSpectra | Corresponding spectrum numbers/ids available from peptide level | Y/N |
Functionality | Traceability | Spectra | Protein to spectra | ProteinToSpectra | Corresponding spectrum numbers/ids available from protein level | Y/N |
Functionality | Traceability | File names | Results to raw files | ResultsToRawFiles | Raw input file names preserved in tables on PSM/peptide/protein level | Y/N |
Functionality | Traceability | File names | Public raw files | PublicRawFiles | Raw files publicly available | Y/N |
Functionality | Traceability | Parameters | Experimental design | ExperimentalDesign | Biological and technical replicates can be identified in results | Y/N |
Functionality | Performance | Identification | PSM number | PSMNumber | Number of identified PSMs passing preset FDR | Integer |
Functionality | Performance | Identification | Peptide number | PeptideNumber | Number of uniquely identified peptide identifications passing preset FDR | Integer |
Functionality | Performance | Identification | Protein number | ProteinNumber | Number of uniquely identified protein identifications passing preset FDR | Integer |
Functionality | Performance | Identification | Protein group number | ProteinGroupNumber | Number of different protein groups passing preset FDR | Integer |
Functionality | Performance | Identification | Peptide coverage | PeptideCoverage | Percentage of peptides identified in all samples | Double |
Functionality | Performance | Identification | Protein coverage | ProteinCoverage | Percentage of proteins identified in all samples | Double |
Functionality | Performance | Identification | Peptides per protein | PeptidesPerProtein | Distribution of peptides per protein group | Set of Integer |
Functionality | Performance | Quantification | Correlation peptides | CorrelationPeptides | Mean of Pearsson correlation of protein abundances between replicates (log2-scale) | Double |
Functionality | Performance | Quantification | Correlation proteins | CorrelationProteins | Mean of Pearsson correlation of peptide abundances between replicates (log2-scale) | Double |
Functionality | Performance | Quantification | Number peptides | NumberOfPeptides | Number of quantified peptides with at least 50% coverage | Integer |
Functionality | Performance | Quantification | Number protein groups | NumberOfProteinGroups | Number of quantified proteins groups with at least 50% coverage | Integer |
Functionality | Performance | Quantification | Dynamic peptide range | DynamicPeptideRange | Difference of peptide abundance (top 5% versus bottom 5% quantile) | Double |
Functionality | Performance | Quantification | Dynamic protein range | DynamicProteinRange | Difference of protein abundance (top 5% versus bottom 5% quantile) | Double |
Functionality | Performance | Statistics | Differentially regulated peptides 5% | DifferentialRegulatedPeptides5Perc | Number of differentially regulated peptides with FDR below 5% | Set of Double |
Functionality | Performance | Statistics | Differentially regulated proteins 5% | DifferentialRegulatedProteins5Perc | Number of differentially regulated proteins with FDR below 5% | Set of Double |
Functionality | Performance | Statistics | Differentially regulated peptides 1% | DifferentialRegulatedPeptides1Perc | Number of differentially regulated peptides with FDR below 1% | Set of Double |
Functionality | Performance | Statistics | Differentially regulated proteins 1% | DifferentialRegulatedProteins1Perc | Number of differentially regulated proteins with FDR below 1% | Set of Double |
Functionality | Performance | Statistics | Missing peptide values | MissingPeptideValues | Percentage of missing values in entire peptide set | Double |
Functionality | Performance | Statistics | Missing protein values | MissingProteinValues | Percentage of missing values in entire protein set | Double |
Functionality | Performance | Digestion | Digestion efficiency | Efficiency | Distribution of number of miscleavages | Set of Double |
Functionality | Performance | PTMs | PTM Distribution | PTMDistribution | Percentage of peptides with PTM xyz | Set of Double |
Functionality | Performance | PTMs | PTM Occupancy | PTMOccupancy | Distribution of peptides with 1,2,... PTMs | Set of Double |
Functionality | Parameter | Identification | Database size | DatabaseSize | Number of entries in fasta file | Integer |
Functionality | Parameter | Identification | Canonical sequences | CanonicalSequences | Database includes canonical sequences | Y/N |
Functionality | Parameter | identification | PTM localization | PTMLocalization | Is PTM localization scoring software included in the workflow | Y/N |
nf-core/wombat was originally written by the members of the ELIXIR Implementation study Comparison, benchmarking and dissemination of proteomics data analysis pipelines under the lead of Veit Schwämmle and major participation of David Bouyssié and Fredrik Levander.
Manuscript in preparation
As the workflows are using an nf-core template, we refer to the publication:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.