Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steps to implement checkm2 quality report to dRep #220

Open
mpdoane2 opened this issue Dec 15, 2023 · 13 comments
Open

Steps to implement checkm2 quality report to dRep #220

mpdoane2 opened this issue Dec 15, 2023 · 13 comments

Comments

@mpdoane2
Copy link

Checkm2 quality report can be used. You will need to convert the Checkm2 quality report to a .csv file using:

awk -F'\t' 'BEGIN {OFS=","} {print $1, $2, $3}' quality_report.tsv > new_file_name.csv

In the new file convert headings to: genome,completeness,contamination

dRep command for using checkm2 output instead of checkm_genome which is default currently,

dRep dereplicate output --genomeInfo new_file_name.csv -g bins/*.fna

Just thought I would write it out in case others were facing similar issues.

-Mike

@MrOlm
Copy link
Owner

MrOlm commented Dec 15, 2023

Thanks for this, @mpdoane2 When I get a chance I'll add this to the documentation and cite you / this issue

@achenderson
Copy link

Hi, thanks for the information for converting the checkm output to a csv!

I tried running drep with the --genomeInfo flag and the csv, but when I check the log for the job, it is still running checkm. Is there another flag I need to add? Thanks!

@MrOlm
Copy link
Owner

MrOlm commented May 8, 2024

Hi @achenderson - my guess is that there is a mismatch between the "genome" names provided in the genomeInfo file and the "genome" names loaded by dRep. If you check the file Bdb.csv in the dRep run (even if the run isn't complete) you'll see the names that dRep wants you to use.

Best,
Matt

@achenderson
Copy link

That's it! Thank you :)

@CJREID
Copy link

CJREID commented Jul 25, 2024

Hi there,

I've think I've run into this issue but as I have ~60k MAGs, CheckM is taking a long time and the Bdb.csv file is not present.

As an example, my genome files look like /scratch/usr/SBsP_T2_sr_metabat2_refined.002.fna and the corresponding cell in the .csv is /scratch/usr/SBsP_T2_sr_metabat2_refined.002 but it is still running checkM. Is it possibly due to the '.' before 002? Are the full paths unnecessary? Or is the .fna extension an issue?

Thanks,
Cam

@MrOlm
Copy link
Owner

MrOlm commented Jul 25, 2024

Hi @CJREID - are you running checkM within dRep, or are you running checkM2 outside of dRep?

@CJREID
Copy link

CJREID commented Jul 25, 2024

Hi Matt,

I ran checkM2 outside of dRep and formatted it as described above for dRep. It worked on once I added the .fna extension to the names in the genomeInfo file. I was confused because the help message says this file must contain "genome"(basename of .fasta file of that genome) so I assumed this was the name without the .fna extension. Perhaps the 'basename' bit could be changed in the help message?

Thanks,
Cam

@MrOlm
Copy link
Owner

MrOlm commented Jul 25, 2024

Hi @CJREID - thanks for the update and for the suggestion. I'll update in the next verison of dRep.

Best,
Matt

@paulzierep
Copy link

Hi @CJREID - thanks for the update and for the suggestion. I'll update in the next verison of dRep.

Best, Matt

While you are at it, it would be nice to add checkM2 directly to dRep.

@paulzierep
Copy link

Checkm2 quality report can be used. You will need to convert the Checkm2 quality report to a .csv file using:

awk -F'\t' 'BEGIN {OFS=","} {print $1, $2, $3}' quality_report.tsv > new_file_name.csv

In the new file convert headings to: genome,completeness,contamination

dRep command for using checkm2 output instead of checkm_genome which is default currently,

dRep dereplicate output --genomeInfo new_file_name.csv -g bins/*.fna

Just thought I would write it out in case others were facing similar issues.

-Mike

How about awk -F'\t' 'BEGIN {OFS=","; print "genome,completeness,contamination"} {print $1, $2, $3}' quality_report.tsv > new_file_name.csv for a one liner ?

@MrOlm
Copy link
Owner

MrOlm commented Mar 24, 2025

Hi @CJREID - thanks for the update and for the suggestion. I'll update in the next verison of dRep.
Best, Matt

While you are at it, it would be nice to add checkM2 directly to dRep.

I looked into it and there were dependency clashes. I believe checkM2 requires older versions of python or something like that.

Matt

@paulzierep
Copy link

paulzierep commented Mar 25, 2025

I have a similar issue with:

genome,completeness,contamination
50contig_reads_bin_1.fasta,5.68,0.0
50contig_reads_bin_2.fasta,6.41,0.0
50contig_reads_bin_28.fasta,16.77,0.02
50contig_reads_bin_8.fasta,6.65,0.0

Error: KeyError: 'completeness missing from GenomeInfo'

Log:

03-25 09:50 DEBUG    Validating filter arguments
03-25 09:50 INFO     Will filter the genome list
03-25 09:50 INFO     4 genomes were input to dRep
03-25 09:50 INFO     Calculating genome info of genomes
03-25 09:50 DEBUG    Loading provided genome quality information
03-25 09:50 DEBUG    HERE IS GENOME INFO:
03-25 09:50 DEBUG    
        genome,completeness,contamination
0     50contig_reads_bin_1.fasta,5.68,0.0
1     50contig_reads_bin_2.fasta,6.41,0.0
2  50contig_reads_bin_28.fasta,16.77,0.02
3     50contig_reads_bin_8.fasta,6.65,0.0
03-25 09:50 DEBUG    There are the columns: ['genome,completeness,contamination']

Command:
dRep dereplicate outdir -g '50contig_reads_bin_1.fasta' '50contig_reads_bin_2.fasta' '50contig_reads_bin_28.fasta' '50contig_reads_bin_8.fasta' --length 1 --completeness 0 --contamination 100 --genomeInfo '/data/dnb11/galaxy_db/files/1/7/4/dataset_17457a32-8727-423e-a06c-2f7cee3189be.dat' --MASH_sketch '1000' --P_ani 0.9 --primary_chunksize 5000 --S_algorithm 'ANImf' --n_PRESET 'normal' --coverage_method 'larger' --S_ani 0.95 --cov_thresh 0.1 --clusterAlg 'average' --completeness_weight 1.0 --contamination_weight 5.0 --strain_heterogeneity_weight 1.0 --N50_weight 0.5 --size_weight 0.0 --centrality_weight 1.0 --warn_dist 0.25 --warn_sim 0.98 --warn_aln 0.25

It seems, that drep is parsing the header as one column: ['genome,completeness,contamination']
Any idea ?

@paulzierep
Copy link

Hi @CJREID - thanks for the update and for the suggestion. I'll update in the next verison of dRep.
Best, Matt

While you are at it, it would be nice to add checkM2 directly to dRep.

I looked into it and there were dependency clashes. I believe checkM2 requires older versions of python or something like that.

Matt

mamba create --dry-run -n test_env checkm2 drep works and it can be installed, so I assume the dependency clash is solved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants