Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only fetch UIDs and not full records from WoS #1660

Merged
merged 3 commits into from
Nov 15, 2023
Merged

Conversation

peetucket
Copy link
Member

@peetucket peetucket commented Oct 19, 2023

Why was this change made?

To start dealing with issues in #1642

Basically, it is very inefficient to request the entire record from WoS when we only need the WOS_UIDs to determine if we have new publications or not. This alters the logic so we use a different endpoint provided by WoS when harvesting (because for each harvest, we really only need to request UIDs to verify if we have new publications to add or not).

This won't fix the problem in #1642 completely, since if you have new publications to add, you still need to fetch the full records (which will still blow up if there are many many authors for many publications in the set of 100 you can request at once), but not only should it greatly increase the performance of each harvest but should also cut down on these exceptions, since those will only occur if and when we need to fetch actual new publications (instead of happening on each harvest as they do now).

It also eliminates what I believe to be an unused class (WebOfScience::RestRetriever) and it moves it's spec to the WebOfScience::BaseRestRetriever class (which is very similar, and is subclassed and thus used). While the WebOfScience::RestRetriever class was added in the PR that switched APIs #1612, it doesn't appear to be used or subclassed anywhere.

Finally, I removed (and thus allowed to be recreated) all of the VCR cassettes for the WoS API to be sure they the tests would still pass given these code changes. Note that a few expectations had to be updated because of the new cassettes (i.e. publication counts, which increased since the last time the VCR cassettes were created).

You may ask...why is this is a new problem given the switch the REST based API. And the answer is that it is not entirely a new problem. The old SOAP based API has similar issues when requesting full publication records for publications with lots and lots of authors. BUT: the older SOAP based API had an easier mechanism to request only UIDs on harvesting, and that's what we were doing. The new REST based API makes a bit more work to replicate this functionality. So this PR basically makes the new REST based API code work more like the older SOAP based API code worked. Namely, it only requests UIDs on each harvest instead of full publication records, and then goes back to fetch the full records only for any publications we actually need to add to our database. Which should reduce the occurrence of the problem.

How was this change tested?

Existing specs (allowing VCR cassettes to be re-created)

@peetucket peetucket changed the title [WIP] remove unused class, only fetch UIDs and not full records from WoS [WIP] only fetch UIDs and not full records from WoS Oct 19, 2023
@@ -6,7 +6,7 @@ module WebOfScience
# - the "next_batch?" is like "next?"
# - the "next_batch" is like "next"
class BaseRestRetriever
attr_reader :records_found, :records_retrieved
attr_reader :records_found, :records_retrieved, :query_id

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need the query_id to be available to our new UidsQueryRestsRetriever class

uids += next_batch.uids while next_batch?
uids
end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this inefficient method is going away, to be supplanted by our new more efficient method in UidsQueryRestsRetriever

the only consumer of it will now use our new method


delegate :client, to: WebOfScience

# @return [Boolean] all records retrieved?
def records_retrieved?
@batch_one.nil? ? false : records_retrieved == records_found
@batch_one.blank? ? false : records_retrieved == records_found
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching the nil? to blank? since our new method deals with arrays, and we want the same logic that applies here to work for an empty array too

@@ -45,13 +37,13 @@ def next_batch
# this is the maximum number that can be returned in a single query by WoS
MAX_RECORDS = 100

attr_reader :batch_size, :query, :query_id, :path, :params
attr_reader :batch_size, :query, :path, :params

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made this attr_reader public

WebOfScience::UidsQueryRestRetriever.new(query, query_params:, batch_size:)
end

# Convenience method, does the params_for_search expansion
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our new method to query just for uids (we will use it from QueryName and QueryOrcid)

describe WebOfScience::RestRetriever, :vcr do
let(:retriever) { described_class.new(params) }
describe WebOfScience::BaseRestRetriever, :vcr do
let(:retriever) { described_class.new('/', params) }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I can tell, the RestRetriever class is not used anywhere and instead we use the BaseRestRetriever class ... just updating the existing spec for the unused class for the one we use

@@ -16,7 +16,7 @@ def initialize(author, options = {})
def uids
return [] unless valid?

queries.user_query(name_query, query_params:).merged_uids
queries.user_query_uids(name_query, query_params:).merged_uids
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hits our new efficient method instead of the old one (which is now removed)

@@ -16,7 +16,7 @@ def initialize(author, options = {})
def uids
return [] unless valid?

queries.user_query(orcid_query, query_params:).merged_uids
queries.user_query_uids(orcid_query, query_params:).merged_uids
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hits our new efficient method instead of the old one (which is now removed)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class deleted, didn't seem to be used anywhere.

records
end
end
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our new query class whose job it is to return just UIDs efficiently (by only requesting the UIDs and not the full records).

@peetucket peetucket marked this pull request as ready for review October 31, 2023 18:17
@peetucket peetucket changed the title [WIP] only fetch UIDs and not full records from WoS only fetch UIDs and not full records from WoS Nov 15, 2023
@peetucket peetucket merged commit d07e5ab into main Nov 15, 2023
@peetucket peetucket deleted the retrieve-ids-only branch November 15, 2023 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants