-
Notifications
You must be signed in to change notification settings - Fork 2
TweetSets Data Dictionary
Laura Wrubel edited this page Sep 28, 2021
·
6 revisions
For more information, see the full version of the schema used by TweetSets in parsing the tweet JSON.
When store_tweet
is True
, all fields present in the tweet's JSON representation are available in the JSON extract. A subset of those fields are exposed in the CSV extract; an overlapping but slightly different set of fields are searchable in the Elasticsearch (ES) index.
CSV fields are derived (with some exceptions, as noted below) from the twarc.json2csv tool, v. 1.12.1.
In the following definitions, elements are presumed to lie at the root of the Tweet schema if not prefixed by another element.
CSV | ElasticSearch | Tweet JSON or description |
---|---|---|
id |
tweet_id |
id_str |
tweet_url |
persistent URL to the Tweet, using the format https://twitter.com/{user.screen_name}/status/{id_str}
|
|
created_at |
the unparsed timestamp of the Tweet | |
parsed_created_at |
created_at |
the timestamp of the Tweet, parsed according to [ISO 8601](https://en.wikipedia.org /wiki/ISO_8601#Time_zone_designators). \ The TS representation differs slightly from twarc's, though both are valid implementations: twarc 1.12.1: 2020-03-05 01:54:04+00:00 ; TweetSets 2.2: 2020-03-05T01:54:04Z
|
user_screen_name |
user_screen_name |
from user.screen_name
|
text |
text |
Uses extended_tweet.full_text (or retweeted_status.extended_tweet.full_text for retweets) when present; otherwise, defaults to the text field. CSV field has been normalized by removal of newline characters. |
tweet_type |
tweet_type |
one of ['reply', 'retweet', 'quote', 'original'] , depending on the presence of the in_reply_to_status_id , retweeted_status , or quoted_status fields, respectively. |
coordinates |
from coordinates.coordinates . |
|
hashtags |
hashtags |
derived from one of the following fields (in order of preference): retweeted_status.extended_tweet.entities.hashtags , extended_tweet.entities.hashtags , retweeted_status.entities.hashtags , entities.hashtags . Follows json2csv 1.12.1 in the inclusion of retweeted_status but adds the extended_tweet field to retrieve hashtags that would otherwise be truncated. (ES field is normalized to lowercase.) |
media |
from either extended_entities.media (if present) or entities.media , a list of media_urls_https elements. |
|
urls |
urls |
a list of the expanded_url elements from both the extended_tweet.entities.urls and entities.urls fields. These fields, when both present, do not always have the same contents. The TS field represents the union of both (when both or present), or else the expanded_url elements of the entities.urls . In ES, the elements are normalized to lowercase and have https:// replaced with http:// . |
favorite_count |
favorite_count |
from either retweeted_status.favorite_count or favorite_count . |
in_reply_to_user_id |
in_reply_to_user_id |
in_reply_to_user_id_str |
in_reply_to_screen_name |
in_reply_to_screen_name |
|
in_reply_to_status_id |
in_reply_to_status_id |
in_reply_to_status_id_str |
lang |
language |
lang |
place |
place.full_name |
|
possibly_sensitive |
possibly_sensitive |
|
retweet_count |
retweet_count |
|
retweet_or_quote_id |
retweet_quoted_status_id |
either retweeted_status.id_str (if retweet), quoted_status.id_str (if quote), or null. |
retweet_or_quote_screen_name |
retweeted_quoted_screen_name |
either retweeted_status.user.screen_name (if retweet), quoted_status.user.screen_name (if quote), or null. |
retweet_or_quote_user_id |
retweeted_quoted_user_id |
either retweeted_status.user.id_str (if retweet), quoted_status.user.id_str (if quote), or null. |
source |
source |
|
user_id |
user_id |
user.id_str |
user_created_at |
user.created_at |
|
user_default_profile_image |
user.default_profile_image |
|
user_description |
user.description , with newline characters removed. |
|
user_favourites_count |
user.favourites_count |
|
user_followers_count |
user_follower_count |
user.followers_count |
user_friends_count |
user.friends_count |
|
user_listed_count |
user.listed_count |
|
user_location |
user.location |
CSV field has been normalized by removal of newline characters. |
user_name |
user.name , with newline characters removed. |
|
user_statuses_count |
user.statuses_count |
|
user_verified |
user_verified |
user.verified |
user_utc_offset |
user_utc_offset |
|
user_time_zone |
user_time_zone |
user.time_zone |
user_language |
user.lang |
|
mention_user_ids |
the id_str elements from either extended_tweet.entities.user_mentions (if present) or entities.user_mentions . |
|
mention_screen_names |
the screen_name elements from either extended_tweet.entities.user_mentions (if present) or entities.user_mentions . |
|
has_geo |
Boolean field; true if any of the following fields are present: geo , place , coordinates . |
|
has_media |
Boolean; true if any of the following are present: entities.media , extended_tweet.entities.media . |