Skip to content

TweetSets Data Dictionary

Laura Wrubel edited this page Sep 28, 2021 · 6 revisions

For more information, see the full version of the schema used by TweetSets in parsing the tweet JSON.

When store_tweet is True, all fields present in the tweet's JSON representation are available in the JSON extract. A subset of those fields are exposed in the CSV extract; an overlapping but slightly different set of fields are searchable in the Elasticsearch (ES) index.

CSV fields are derived (with some exceptions, as noted below) from the twarc.json2csv tool, v. 1.12.1.

In the following definitions, elements are presumed to lie at the root of the Tweet schema if not prefixed by another element.

CSV ElasticSearch Tweet JSON or description
id tweet_id id_str
tweet_url persistent URL to the Tweet, using the format https://twitter.com/{user.screen_name}/status/{id_str}
created_at the unparsed timestamp of the Tweet
parsed_created_at created_at the timestamp of the Tweet, parsed according to [ISO 8601](https://en.wikipedia.org /wiki/ISO_8601#Time_zone_designators). \ The TS representation differs slightly from twarc's, though both are valid implementations: twarc 1.12.1: 2020-03-05 01:54:04+00:00; TweetSets 2.2: 2020-03-05T01:54:04Z
user_screen_name user_screen_name from user.screen_name
text text Uses extended_tweet.full_text (or retweeted_status.extended_tweet.full_text for retweets) when present; otherwise, defaults to the text field. CSV field has been normalized by removal of newline characters.
tweet_type tweet_type one of ['reply', 'retweet', 'quote', 'original'], depending on the presence of the in_reply_to_status_id, retweeted_status, or quoted_status fields, respectively.
coordinates from coordinates.coordinates.
hashtags hashtags derived from one of the following fields (in order of preference): retweeted_status.extended_tweet.entities.hashtags, extended_tweet.entities.hashtags, retweeted_status.entities.hashtags, entities.hashtags. Follows json2csv 1.12.1 in the inclusion of retweeted_status but adds the extended_tweet field to retrieve hashtags that would otherwise be truncated. (ES field is normalized to lowercase.)
media from either extended_entities.media (if present) or entities.media, a list of media_urls_https elements.
urls urls a list of the expanded_url elements from both the extended_tweet.entities.urls and entities.urls fields. These fields, when both present, do not always have the same contents. The TS field represents the union of both (when both or present), or else the expanded_url elements of the entities.urls. In ES, the elements are normalized to lowercase and have https:// replaced with http://.
favorite_count favorite_count from either retweeted_status.favorite_count or favorite_count.
in_reply_to_user_id in_reply_to_user_id in_reply_to_user_id_str
in_reply_to_screen_name in_reply_to_screen_name
in_reply_to_status_id in_reply_to_status_id in_reply_to_status_id_str
lang language lang
place place.full_name
possibly_sensitive possibly_sensitive
retweet_count retweet_count
retweet_or_quote_id retweet_quoted_status_id either retweeted_status.id_str (if retweet), quoted_status.id_str (if quote), or null.
retweet_or_quote_screen_name retweeted_quoted_screen_name either retweeted_status.user.screen_name (if retweet), quoted_status.user.screen_name (if quote), or null.
retweet_or_quote_user_id retweeted_quoted_user_id either retweeted_status.user.id_str (if retweet), quoted_status.user.id_str (if quote), or null.
source source
user_id user_id user.id_str
user_created_at user.created_at
user_default_profile_image user.default_profile_image
user_description user.description, with newline characters removed.
user_favourites_count user.favourites_count
user_followers_count user_follower_count user.followers_count
user_friends_count user.friends_count
user_listed_count user.listed_count
user_location user.location CSV field has been normalized by removal of newline characters.
user_name user.name, with newline characters removed.
user_statuses_count user.statuses_count
user_verified user_verified user.verified
user_utc_offset user_utc_offset
user_time_zone user_time_zone user.time_zone
user_language user.lang
mention_user_ids the id_str elements from either extended_tweet.entities.user_mentions (if present) or entities.user_mentions.
mention_screen_names the screen_name elements from either extended_tweet.entities.user_mentions (if present) or entities.user_mentions.
has_geo Boolean field; true if any of the following fields are present: geo, place, coordinates.
has_media Boolean; true if any of the following are present: entities.media, extended_tweet.entities.media.
Clone this wiki locally