-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
only fetch UIDs and not full records from WoS #1660
Conversation
@@ -6,7 +6,7 @@ module WebOfScience | |||
# - the "next_batch?" is like "next?" | |||
# - the "next_batch" is like "next" | |||
class BaseRestRetriever | |||
attr_reader :records_found, :records_retrieved | |||
attr_reader :records_found, :records_retrieved, :query_id | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need the query_id to be available to our new UidsQueryRestsRetriever
class
uids += next_batch.uids while next_batch? | ||
uids | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this inefficient method is going away, to be supplanted by our new more efficient method in UidsQueryRestsRetriever
the only consumer of it will now use our new method
|
||
delegate :client, to: WebOfScience | ||
|
||
# @return [Boolean] all records retrieved? | ||
def records_retrieved? | ||
@batch_one.nil? ? false : records_retrieved == records_found | ||
@batch_one.blank? ? false : records_retrieved == records_found | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching the nil?
to blank?
since our new method deals with arrays, and we want the same logic that applies here to work for an empty array too
@@ -45,13 +37,13 @@ def next_batch | |||
# this is the maximum number that can be returned in a single query by WoS | |||
MAX_RECORDS = 100 | |||
|
|||
attr_reader :batch_size, :query, :query_id, :path, :params | |||
attr_reader :batch_size, :query, :path, :params | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made this attr_reader public
WebOfScience::UidsQueryRestRetriever.new(query, query_params:, batch_size:) | ||
end | ||
|
||
# Convenience method, does the params_for_search expansion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our new method to query just for uids (we will use it from QueryName
and QueryOrcid
)
describe WebOfScience::RestRetriever, :vcr do | ||
let(:retriever) { described_class.new(params) } | ||
describe WebOfScience::BaseRestRetriever, :vcr do | ||
let(:retriever) { described_class.new('/', params) } | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I can tell, the RestRetriever
class is not used anywhere and instead we use the BaseRestRetriever
class ... just updating the existing spec for the unused class for the one we use
5684bb5
to
478c992
Compare
@@ -16,7 +16,7 @@ def initialize(author, options = {}) | |||
def uids | |||
return [] unless valid? | |||
|
|||
queries.user_query(name_query, query_params:).merged_uids | |||
queries.user_query_uids(name_query, query_params:).merged_uids | |||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hits our new efficient method instead of the old one (which is now removed)
@@ -16,7 +16,7 @@ def initialize(author, options = {}) | |||
def uids | |||
return [] unless valid? | |||
|
|||
queries.user_query(orcid_query, query_params:).merged_uids | |||
queries.user_query_uids(orcid_query, query_params:).merged_uids | |||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hits our new efficient method instead of the old one (which is now removed)
lib/web_of_science/rest_retriever.rb
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class deleted, didn't seem to be used anywhere.
records | ||
end | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our new query class whose job it is to return just UIDs efficiently (by only requesting the UIDs and not the full records).
605e20a
to
97693ed
Compare
Why was this change made?
To start dealing with issues in #1642
Basically, it is very inefficient to request the entire record from WoS when we only need the WOS_UIDs to determine if we have new publications or not. This alters the logic so we use a different endpoint provided by WoS when harvesting (because for each harvest, we really only need to request UIDs to verify if we have new publications to add or not).
This won't fix the problem in #1642 completely, since if you have new publications to add, you still need to fetch the full records (which will still blow up if there are many many authors for many publications in the set of 100 you can request at once), but not only should it greatly increase the performance of each harvest but should also cut down on these exceptions, since those will only occur if and when we need to fetch actual new publications (instead of happening on each harvest as they do now).
It also eliminates what I believe to be an unused class (
WebOfScience::RestRetriever
) and it moves it's spec to theWebOfScience::BaseRestRetriever
class (which is very similar, and is subclassed and thus used). While theWebOfScience::RestRetriever
class was added in the PR that switched APIs #1612, it doesn't appear to be used or subclassed anywhere.Finally, I removed (and thus allowed to be recreated) all of the VCR cassettes for the WoS API to be sure they the tests would still pass given these code changes. Note that a few expectations had to be updated because of the new cassettes (i.e. publication counts, which increased since the last time the VCR cassettes were created).
You may ask...why is this is a new problem given the switch the REST based API. And the answer is that it is not entirely a new problem. The old SOAP based API has similar issues when requesting full publication records for publications with lots and lots of authors. BUT: the older SOAP based API had an easier mechanism to request only UIDs on harvesting, and that's what we were doing. The new REST based API makes a bit more work to replicate this functionality. So this PR basically makes the new REST based API code work more like the older SOAP based API code worked. Namely, it only requests UIDs on each harvest instead of full publication records, and then goes back to fetch the full records only for any publications we actually need to add to our database. Which should reduce the occurrence of the problem.
How was this change tested?
Existing specs (allowing VCR cassettes to be re-created)