-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Needed: Map missing user_id -> user_handle with remote services #79
Comments
The remote lookup is of course much slower than any local operations, therefore I would suggest that it happens as an optional preprocessing step. My suggestion is a separate remote lookup script with arguments for what you want to retrieve, depending on your use case, available bandwidth, disk space, time, etc. |
We currently have four available ways to resolve user IDs to user handles:
Footnotes:
My intuition would be to focus on Search in existing archive data plus via Guest API, and keep the other two approaches as a possible fallback if the Guest API should fail in the future. The Guest API seems extremely powerful, since it gives us access to full profile data as well as to full tweets, which is also useful for #6, #20, #22, #39, #72, #73. |
(Thank you @lenaschimmel! I was just about to ask these questions and try to collate everything, and you've done it much better than I would have.) Is the Guest API documented somewhere? @press-rouch's branch seems to have an entire module for twitter_guest_api - is that published as a package? |
I couldn't find much information about the Twitter Guest API, but @nogira has a TypeScript and a Rust implementation of it. Interesting bit in their README, writte three months ago:
I did not check if this is still true, but that would be huge! |
twitter_guest_api is my heavily adapted version of twitter-video-dl (kudos to @inteoryx) - I stuck it in a module so it wouldn't clutter the root. I believe it is reverse engineered from the website javascript, so it is likely to be fragile in the medium to long term (although the first version was written over a year ago, so there hasn't been much churn in that period). |
@lenaschimmel Confirmed - |
I'd be interested to see a minimal code snippet for retrieving a handle. |
Here's my latest twitter_guest_api implementation. Example:
I was just about to push the change into my fork but then I noticed the metadata for tweets was missing image URLs and the alt text, so I'm looking into that now. User account data seems to be complete though. |
Hooray for undocumented features! Adding |
As stand-alone code, with some debug prints:
Output shows 7 get/posts are needed:
Is this as simple as it gets, do we think? I don't have much experience in this area. |
The first 6 requests are one-offs; if you're getting multiple users, each further one only requires one more request. I think we could hard-code the bearer token. It looks like it hasn't changed in 2 years. That would mean we then only have to make one initialisation request (getting the guest token), then a request per user. I've just noticed that we could batch up 100 user lookups at a time via this endpoint. There's a matching one for tweets. |
Okay, that lookup endpoint is a game-changer. I can get 250 accounts in about a second. Will try the same for tweets and get it all pushed. |
With a hard-coded bearer token (and single-user per query access) as minimal stand-alone code:
@press-rouch If we can keep the code in the PR down to this vague sort of size (all in parser.py) then I will be very happy. Let's take small steps and not have more sophistication than we need. |
Here's the batch version:
which is a bit longer, but orders of magnitude faster when |
Curious if the work being done in this issue will affect how many of my followers in Example:
This is the account for thoughtbot: https://twitter.com/thoughtbot |
@press-rouch It says it's rate limited to 900 (or 300?) requests per 15-minute window. Does that mean with the batch call this limits us to 90,000 users (or 30,000) per 15-minute window? For many of us that will be plenty and for now it's fine but might have to consider that later. (I only mention because Neil Gaiman likely had lots of followers and is looking for a tool to parse his Twitter archive. :) ) @stepheneb Yes! That and DMs.md. We should be able to resolve many of those |
When I was making thousands of individual tweet requests, I found I would eventually hit a 429 "rate limit exceeded" status code. The trivial workaround for this was to just request a new guest token (which seems to make guest access superior to a registered user!). |
to be clearer, when i say guest token i mean the guest bearer token rather than the this is the user timeline endpoint i tested it on const token = "AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA";
const fetchTweetsFromUser = async (screenName: string, count: number) => {
const response = await fetch(
`https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=${screenName}&count=${count}`,
{
headers: {
Authorization: `Bearer ${token}`,
},
}
);
const json = await response.json();
return json;
}
await fetchTweetsFromUser("elonmusk", 10).then(console.log); |
@nogira Interesting, thanks. I just had a quick try and it appears that the user timeline endpoint works using only the bearer token, but the user and tweet lookup endpoints return a 429 without |
This will improve
followers.txt
,following.txt
,DMs.md
where currently many handles are missing.Suggestions and initial work for how to retrieve these handles has happened in several places recently. Thank you!
Ping: @flauschzelle, @lenaschimmel, @press-rouch, @n1ckfg (but of course anyone can contribute)
[Edit: the PRs that were mentioned are now merged]
The text was updated successfully, but these errors were encountered: