You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The MIDAS2 DB has gene content for UHGG genomes aggregated to the centroid_99 through centroid_75 levels. One useful database artifact would be a list of "species genes" that are found across (nearly) all high-quality genomes of each species.
This list of centroids for each species would allow for more precise quantification of species abundance by taking the mean or median depth across them. As a result, copy number estimation would also be more accurate.
While the granularity of c99 clusters may make it challenging to find such genes, you could select a list of centroids at each ANI threshold, allowing users to themselves pick which set to use for quantification.
The MIDAS2 DB has gene content for UHGG genomes aggregated to the centroid_99 through centroid_75 levels. One useful database artifact would be a list of "species genes" that are found across (nearly) all high-quality genomes of each species.
This list of centroids for each species would allow for more precise quantification of species abundance by taking the mean or median depth across them. As a result, copy number estimation would also be more accurate.
While the granularity of c99 clusters may make it challenging to find such genes, you could select a list of centroids at each ANI threshold, allowing users to themselves pick which set to use for quantification.
For an example of what I mean, see the script
select_high_prevalence_species_genes.py
and it's usage in my snakemake pipeline: L57-67.The text was updated successfully, but these errors were encountered: