Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different_genbank_species column in a genes index file doesn't capture all the variations #15

Open
johrstrom opened this issue Nov 8, 2021 · 0 comments

Comments

@johrstrom
Copy link
Contributor

[duplicated the UCR repo]

Right now the pipeline writes the "different_genbank_species" attribute to a Species the first instance where the difference is observed between an Occurrence and a Genbank record (this is where Genbank's taxonomy has a Species name that differs from the GBIF Occurrence taxonomy, becuase that is the taxonomy that is used). This is by design.

However, it is observed that:
Some Species have multiple variations though some of those variations could be collapsed with a clever rule (one case had 70 variations).

I left the different_genbank_species captured in each Occurrence, so you can do in the rails console something like:

Species.where.not(different_genbank_species: nil).each do |s|
  x = s.occurrences.pluck(:different_genbank_species).uniq
  puts x if x.count > 1
end

to see where this is a problem.

Another issue is we don't differentiate between "different_genbank_species" per gene, but at the species level.
It is likely that its good enough for the user to know there are differences and see an example difference though, since they can go back to the original records and view the details if they are really interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant