Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Needed: Map missing user_id -> user_handle with remote services #79

Closed
timhutton opened this issue Nov 19, 2022 · 19 comments · Fixed by #87
Closed

Needed: Map missing user_id -> user_handle with remote services #79

timhutton opened this issue Nov 19, 2022 · 19 comments · Fixed by #87

Comments

@timhutton
Copy link
Owner

timhutton commented Nov 19, 2022

This will improve followers.txt, following.txt, DMs.md where currently many handles are missing.

Suggestions and initial work for how to retrieve these handles has happened in several places recently. Thank you!

Ping: @flauschzelle, @lenaschimmel, @press-rouch, @n1ckfg (but of course anyone can contribute)

[Edit: the PRs that were mentioned are now merged]

@timhutton timhutton added the help wanted Extra attention is needed label Nov 19, 2022
@press-rouch
Copy link
Collaborator

press-rouch commented Nov 19, 2022

The remote lookup is of course much slower than any local operations, therefore I would suggest that it happens as an optional preprocessing step. My suggestion is a separate remote lookup script with arguments for what you want to retrieve, depending on your use case, available bandwidth, disk space, time, etc.
e.g.
remote_lookup.py --dm_users --following --likes --liked_photos --liked_videos --thread_replies
This would use the guest api to build a database of all the data on disk, which parser.py can then optionally use to populate its output.

@lenaschimmel
Copy link
Collaborator

lenaschimmel commented Nov 19, 2022

We currently have four available ways to resolve user IDs to user handles:

Tested Reliability Available data Standalone script Integration into main script Speed
Search in existing archive data yes works offline -> perfect handle, sometimes display name yes yes * instant
via tweeterid.com yes frequent outages **, *** only handle yes yes * 1,5 sec per id
via Standard API no should be good *** probably full profile data yes no >= 1 sec per id
via Guest API yes should be good *** full profile data yes no <= 0.5 sec per id

Footnotes:

  • * in my fork, already functioning well, but needs some cleanup
  • ** often works for a few minutes, fails for a few minutes, works...
  • *** as long as twitter.com still works, and as they don't change or switch off the API

My intuition would be to focus on Search in existing archive data plus via Guest API, and keep the other two approaches as a possible fallback if the Guest API should fail in the future.

The Guest API seems extremely powerful, since it gives us access to full profile data as well as to full tweets, which is also useful for #6, #20, #22, #39, #72, #73.

@timhutton
Copy link
Owner Author

(Thank you @lenaschimmel! I was just about to ask these questions and try to collate everything, and you've done it much better than I would have.)

Is the Guest API documented somewhere? @press-rouch's branch seems to have an entire module for twitter_guest_api - is that published as a package?

@lenaschimmel
Copy link
Collaborator

I couldn't find much information about the Twitter Guest API, but @nogira has a TypeScript and a Rust implementation of it. Interesting bit in their README, writte three months ago:

-- iT SEEMS THE TWITTER STANDARD V1.1 API ACTUALLY WORKS WITH GUEST TOKEN TOO --

I did not check if this is still true, but that would be huge!

@press-rouch
Copy link
Collaborator

press-rouch commented Nov 19, 2022

twitter_guest_api is my heavily adapted version of twitter-video-dl (kudos to @inteoryx) - I stuck it in a module so it wouldn't clutter the root. I believe it is reverse engineered from the website javascript, so it is likely to be fragile in the medium to long term (although the first version was written over a year ago, so there hasn't been much churn in that period).
The original code did a one-shot query - I refactored it to do one-time initialisation of the headers, refreshing of the guest token when it expires, and multiple query types.

@press-rouch
Copy link
Collaborator

@lenaschimmel Confirmed - https://api.twitter.com/1.1/users/show.json?user_id=<id> works with the same bearer token and guest token headers as my current implementation! It gets more fields, including the most recent post, but not such an overwhelming amount that it hits bandwidth too badly. This should allow me to strip down the twitter_guest_api implementation significantly. It still needs to get hold of the tokens, but it doesn't have to muck around with the exploratory requests and query mapping.

@timhutton
Copy link
Owner Author

I'd be interested to see a minimal code snippet for retrieving a handle.

@press-rouch
Copy link
Collaborator

press-rouch commented Nov 19, 2022

Here's my latest twitter_guest_api implementation.

Example:

import requests
import twitter_guest_api

with requests.Session() as session:
    api = twitter_guest_api.TwitterGuestAPI(session)
    user_id = "1389666600341544960"
    user = api.get_account(session, user_id)
    print(f"{user_id} mapped to {user['name']} ({user['screen_name']})")

I was just about to push the change into my fork but then I noticed the metadata for tweets was missing image URLs and the alt text, so I'm looking into that now. User account data seems to be complete though.

@press-rouch
Copy link
Collaborator

Hooray for undocumented features! Adding tweet_mode=extended appears to get me the image URLs and alt text.

@timhutton
Copy link
Owner Author

As stand-alone code, with some debug prints:

"""Utilities for downloading from Twitter"""

import json
import logging
import re
import requests

# https://developer.twitter.com/en/docs/twitter-api/v1/tweets/post-and-engage/api-reference/get-statuses-show-id
SHOW_STATUS_ENDPOINT = "https://api.twitter.com/1.1/statuses/show.json"
# https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-users/follow-search-get-users/api-reference/get-users-show
SHOW_USER_ENDPOINT = "https://api.twitter.com/1.1/users/show.json"
# Undocumented!
GUEST_TOKEN_ENDPOINT = "https://api.twitter.com/1.1/guest/activate.json"
BEARER_TOKEN_PATTERN = re.compile(r'"(AAA\w+%\w+)"')

def send_request(url, session_method, headers):
    """Attempt an http request"""
    print('Sending request:', url, session_method.__name__, headers)
    response = session_method(url, headers=headers, stream=True)
    if response.status_code != 200:
        raise Exception(f"Failed request to {url}: {response.status_code} {response.reason}")
    return response.content.decode("utf-8")

def get_guest_token(session, headers):
    """Request a guest token and add it to the headers"""
    print('post:', GUEST_TOKEN_ENDPOINT, headers)
    guest_token_response = session.post(GUEST_TOKEN_ENDPOINT, headers=headers, stream=True)
    guest_token_json = json.loads(guest_token_response.content)
    guest_token = guest_token_json['guest_token']
    if not guest_token:
        raise Exception(f"Failed to retrieve guest token")
    logging.info("Retrieved guest token %s", guest_token)
    headers['x-guest-token'] = guest_token

def get_response(url, session, headers):
    """Attempt to get the requested url. If the guest token has expired, get a new one and retry."""
    print('get:', url, headers)
    response = session.get(url, headers=headers, stream=True)
    if response.status_code == 429:
        # rate limit exceeded?
        logging.warning("Error %i: %s", response.status_code, response.text.strip())
        logging.info("Trying new guest token")
        get_guest_token(session, headers)
        print('get:', url, headers)
        response = session.get(url, headers=headers, stream=True)
    return response

def initialise_headers(session, url):
    """Populate http headers with necessary information for Twitter queries"""
    headers = {}

    # One of the js files from original url holds the bearer token and query id.
    container = send_request(url, session.get, headers)
    js_files = re.findall("src=['\"]([^'\"()]*js)['\"]", container)

    bearer_token = None
    # Search the javascript files for a bearer token and query ids
    for jsfile in js_files:
        logging.debug("Processing %s", jsfile)
        file_content = send_request(jsfile, session.get, headers)
        find_bearer_token = BEARER_TOKEN_PATTERN.search(file_content)

        if find_bearer_token:
            bearer_token = find_bearer_token.group(1)
            logging.info("Retrieved bearer token: %s", bearer_token)
            break

    if not bearer_token:
        raise Exception("Did not find bearer token.")

    headers['authorization'] = f"Bearer {bearer_token}"

    get_guest_token(session, headers)
    return headers

class TwitterGuestAPI:
    """Class to query Twitter API without a developer account"""
    def __init__(self, session):
        self.headers = initialise_headers(session, "https://www.twitter.com")

    def get_account(self, session, account_id):
        """Get the json metadata for a user account"""
        query_url = f"{SHOW_USER_ENDPOINT}?user_id={account_id}"
        response = get_response(query_url, session, self.headers)
        if response.status_code == 200:
            status_json = json.loads(response.content)
            return status_json
        logging.error("Failed to get account %s: (%i) %s",
                      account_id, response.status_code, response.reason)
        return None

    def get_tweet(self, session, tweet_id, include_user=True, include_alt_text=True):
        """
        Get the json metadata for a single tweet.
        If include_user is False, you will only get a numerical id for the user.
        """
        query_url = f"{SHOW_STATUS_ENDPOINT}?id={tweet_id}"
        if not include_user:
            query_url += "&trim_user=1"
        if include_alt_text:
            query_url += "&include_ext_alt_text=1"
        response = get_response(query_url, session, self.headers)
        if response.status_code == 200:
            status_json = json.loads(response.content)
            return status_json
        logging.error("Failed to get tweet %s: (%i) %s",
                      tweet_id, response.status_code, response.reason)
        return None


with requests.Session() as session:
    api = TwitterGuestAPI(session)
    user_id = "1389666600341544960"
    user = api.get_account(session, user_id)
    print(f"{user_id} mapped to {user['name']} ({user['screen_name']})")

Output shows 7 get/posts are needed:

Sending request: https://www.twitter.com get {}
Sending request: https://abs.twimg.com/responsive-web/client-web-legacy/polyfills.c7dfc719.js get {}
Sending request: https://abs.twimg.com/responsive-web/client-web-legacy/vendor.d9a7d629.js get {}
Sending request: https://abs.twimg.com/responsive-web/client-web-legacy/i18n/en.89426ec9.js get {}
Sending request: https://abs.twimg.com/responsive-web/client-web-legacy/main.6de340c9.js get {}
post: https://api.twitter.com/1.1/guest/activate.json {'authorization': 'Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'}
get: https://api.twitter.com/1.1/users/show.json?user_id=1389666600341544960 {'authorization': 'Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA', 'x-guest-token': '1594029370401869825'}
1389666600341544960 mapped to Dan Luu (altluu)

Is this as simple as it gets, do we think? I don't have much experience in this area.

@press-rouch
Copy link
Collaborator

The first 6 requests are one-offs; if you're getting multiple users, each further one only requires one more request.

I think we could hard-code the bearer token. It looks like it hasn't changed in 2 years. That would mean we then only have to make one initialisation request (getting the guest token), then a request per user.

I've just noticed that we could batch up 100 user lookups at a time via this endpoint. There's a matching one for tweets.

@press-rouch
Copy link
Collaborator

Okay, that lookup endpoint is a game-changer. I can get 250 accounts in about a second. Will try the same for tweets and get it all pushed.

@timhutton
Copy link
Owner Author

timhutton commented Nov 19, 2022

With a hard-coded bearer token (and single-user per query access) as minimal stand-alone code:

import json
import requests


def get_twitter_api_guest_token(session, bearer_token):
    """Returns a Twitter API guest token for the current session."""
    guest_token_response = session.post("https://api.twitter.com/1.1/guest/activate.json",
                                        headers={'authorization': f'Bearer {bearer_token}'})
    if not guest_token_response.status_code == 200:
        raise Exception(f'Failed to retrieve guest token from Twitter API: {guest_token_response}')
    return json.loads(guest_token_response.content)['guest_token']


def get_twitter_user(session, bearer_token, guest_token, user_id):
    """Asks the Twitter API for the user details associated with this user_id."""
    query_url = f"https://api.twitter.com/1.1/users/show.json?user_id={user_id}"
    response = session.get(query_url,
                           headers={'authorization': f'Bearer {bearer_token}', 'x-guest-token': guest_token})
    if not response.status_code == 200:
        raise Exception(f'Failed to retrieve user from Twitter API: {response}')
    return json.loads(response.content)


with requests.Session() as session:
    bearer_token = 'AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'
    guest_token = get_twitter_api_guest_token(session, bearer_token)
    user_ids = ['1389666600341544960', '2246902119']
    for user_id in user_ids:
        user = get_twitter_user(session, bearer_token, guest_token, user_id)
        print(f"{user_id} = {user['screen_name']}")

@press-rouch If we can keep the code in the PR down to this vague sort of size (all in parser.py) then I will be very happy. Let's take small steps and not have more sophistication than we need.

@press-rouch
Copy link
Collaborator

Here's the batch version:

import json
import requests

def get_twitter_api_guest_token(session, bearer_token):
    """Returns a Twitter API guest token for the current session."""
    guest_token_response = session.post("https://api.twitter.com/1.1/guest/activate.json",
                                        headers={'authorization': f'Bearer {bearer_token}'})
    guest_token = json.loads(guest_token_response.content)['guest_token']
    if not guest_token:
        raise Exception(f"Failed to retrieve guest token")
    return guest_token

def get_twitter_users(session, bearer_token, guest_token, user_ids):
    """Asks Twitter for all metadata associated with user_ids."""
    users = {}
    while user_ids:
        max_batch = 100
        user_id_batch = user_ids[:max_batch]
        user_ids = user_ids[max_batch:]
        user_id_list = ",".join(user_id_batch)
        query_url = f"https://api.twitter.com/1.1/users/lookup.json?user_id={user_id_list}"
        response = session.get(query_url,
                               headers={'authorization': f'Bearer {bearer_token}', 'x-guest-token': guest_token})
        if not response.status_code == 200:
            raise Exception(f'Failed to get user handle: {response}')
        response_json = json.loads(response.content)
        for user in response_json:
            users[user["id_str"]] = user
    return users

with requests.Session() as session:
    bearer_token = 'AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'
    guest_token = get_twitter_api_guest_token(session, bearer_token)
    user_ids = ['1389666600341544960', '2246902119']
    users = get_twitter_users(session, bearer_token, guest_token, user_ids)
    for user_id in user_ids:
        print(f"{user_id} = {users[user_id]['screen_name']}")

which is a bit longer, but orders of magnitude faster when len(user_ids) > 100, and less over-engineered than my module.

@stepheneb
Copy link

Curious if the work being done in this issue will affect how many of my followers in followers.txt resolve to ~unknown~handle~ (perhaps they are protected accounts) -- or if resolvable should I document in a new issue?

Example:

~unknown~handle~ https://twitter.com/i/user/14114392

This is the account for thoughtbot: https://twitter.com/thoughtbot

@timhutton
Copy link
Owner Author

timhutton commented Nov 19, 2022

@press-rouch It says it's rate limited to 900 (or 300?) requests per 15-minute window. Does that mean with the batch call this limits us to 90,000 users (or 30,000) per 15-minute window? For many of us that will be plenty and for now it's fine but might have to consider that later. (I only mention because Neil Gaiman likely had lots of followers and is looking for a tool to parse his Twitter archive. :) )

@stepheneb Yes! That and DMs.md. We should be able to resolve many of those ~unknown~handle~ placeholders with this work. It's not that they're protected - the archive itself doesn't contain that information. And actually protected accounts will likely not be resolved by this change I suspect.

@press-rouch
Copy link
Collaborator

When I was making thousands of individual tweet requests, I found I would eventually hit a 429 "rate limit exceeded" status code. The trivial workaround for this was to just request a new guest token (which seems to make guest access superior to a registered user!).
I would suggest that users with massive followings might want the option to skip processing their follower lists, as I expect they are probably not particularly interested in the names of every one of their fans.

@timhutton timhutton added in progress and removed help wanted Extra attention is needed labels Nov 19, 2022
@ghost
Copy link

ghost commented Nov 20, 2022

I couldn't find much information about the Twitter Guest API, but @nogira has a TypeScript and a Rust implementation of it. Interesting bit in their README, writte three months ago:

-- iT SEEMS THE TWITTER STANDARD V1.1 API ACTUALLY WORKS WITH GUEST TOKEN TOO --

I did not check if this is still true, but that would be huge!

@press-rouch

to be clearer, when i say guest token i mean the guest bearer token rather than the x-guest-token. i don't believe the x-guest-token is needed for the standard V1.1 API

this is the user timeline endpoint i tested it on

const token = "AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA";
const fetchTweetsFromUser = async (screenName: string, count: number) => {
  const response = await fetch(
    `https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=${screenName}&count=${count}`,
    {
      headers: {
        Authorization: `Bearer ${token}`,
      },
    }
  );
  const json = await response.json();
  return json;
}
await fetchTweetsFromUser("elonmusk", 10).then(console.log);

@press-rouch
Copy link
Collaborator

@nogira Interesting, thanks. I just had a quick try and it appears that the user timeline endpoint works using only the bearer token, but the user and tweet lookup endpoints return a 429 without x-guest-token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants