Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand bin/rake metrics:species to include detailed report statistics #17

Open
johrstrom opened this issue Nov 8, 2021 · 0 comments
Open

Comments

@johrstrom
Copy link
Contributor

Previously we had a script that produced detailed metrics utilizing descriptive_statistics gem to provide stats that Bryan asked for here:

Here are the stats that we'd like.

Breakdown of the number of species represented in each Kingdom, Phylum, and Class.
For each of the above, Mean, Medium, Mode on the number of sequences per entry
Find species with the largest number of sequences in each of these categories and report that number.

Here, "entry" is a single .fa file. So imagine for Phylum “Foo” there were only 2 species, each with two genes (thus two fa files). Species A having 10,12 sequences, species B having 14,16 sequences. The results would be:

mean: 13
mode: 10
median: 13
species with largest number of sequences in Foo: B

This is a script that generated that (though below currently needs access to the un-tarred gene file data).
require 'csv'

records = []

puts Benchmark.measure {
  Occurrence.where(species_aligned: true).distinct.pluck(:taxon_class).each do |taxon_class|
    species = Occurrence.where(taxon_class: taxon_class).where(species_aligned: true).distinct.pluck(:species_path)
    seqs = species.map {|p| Species.new(Configuration.genbank_root.join(p)).unaligned_fasta_sequence_counts }.flatten
    stats = DescriptiveStatistics::Stats.new(seqs.sort)
    species_max = species.map {|p| Species.new(Configuration.genbank_root.join(p)) }.max { |a,b|
      a.unaligned_fasta_sequence_counts.sum <=> b.unaligned_fasta_sequence_counts.sum
    }
    records << [
      "taxon_class",
      taxon_class,
      species.count,
      stats.mean,
      stats.mode,
      stats.median,
      species_max.name,
      species_max.unaligned_fasta_sequence_counts.sum
    ]
  end
}



puts Benchmark.measure {
  Occurrence.where(species_aligned: true).distinct.pluck(:taxon_phylum).each do |taxon_phylum|
    species = Occurrence.where(taxon_phylum: taxon_phylum).where(species_aligned: true).distinct.pluck(:species_path)
    seqs = species.map {|p| Species.new(Configuration.genbank_root.join(p)).unaligned_fasta_sequence_counts }.flatten
    stats = DescriptiveStatistics::Stats.new(seqs.sort)
    species_max = species.map {|p| Species.new(Configuration.genbank_root.join(p)) }.max { |a,b|
      a.unaligned_fasta_sequence_counts.sum <=> b.unaligned_fasta_sequence_counts.sum
    }
    records << [
      "taxon_phylum",
      taxon_phylum,
      species.count,
      stats.mean,
      stats.mode,
      stats.median,
      species_max.name,
      species_max.unaligned_fasta_sequence_counts.sum
    ]
  end
}
puts Benchmark.measure {
  Occurrence.where(species_aligned: true).distinct.pluck(:taxon_kingdom).each do |taxon_kingdom|
    species = Occurrence.where(taxon_kingdom: taxon_kingdom).where(species_aligned: true).distinct.pluck(:species_path)
    seqs = species.map {|p| Species.new(Configuration.genbank_root.join(p)).unaligned_fasta_sequence_counts }.flatten
    stats = DescriptiveStatistics::Stats.new(seqs.sort)
    species_max = species.map {|p| Species.new(Configuration.genbank_root.join(p)) }.max { |a,b|
      a.unaligned_fasta_sequence_counts.sum <=> b.unaligned_fasta_sequence_counts.sum
    }
    records << [
      "taxon_kingdom",
      taxon_kingdom,
      species.count,
      stats.mean,
      stats.mode,
      stats.median,
      species_max.name,
      species_max.unaligned_fasta_sequence_counts.sum
    ]
  end
}

CSV.open("report.csv", "wb") do |csv|
  records.each do |record|
    csv << record
  end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant