forked from adrianco/meGPT
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME_extract_conversations.txt
94 lines (67 loc) · 4.82 KB
/
README_extract_conversations.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
In a Twitter archive, the presence of files like `tweets.js`, `tweets-part1.js`, `tweets-part2.js`, etc., indicates that your tweet data has been split into multiple parts due to the large number of tweets. Twitter splits these files to make it easier to manage and process the data without hitting file size limits.
Here's a brief explanation of these files:
- **`tweets.js`**: Contains the first portion of your tweet data. This file is usually named `tweets-part0.js` in the context of multiple parts but is sometimes simply called `tweets.js` if there's only one part or if it is the first file.
- **`tweets-part1.js`**, **`tweets-part2.js`**, etc.: These are additional parts containing subsequent tweets. They follow the same structure as `tweets.js` but contain different tweet data.
### Understanding File Structure
Each file typically has the following structure:
```js
window.YTD.tweets.part0 = [ ... ];
window.YTD.tweets.part1 = [ ... ];
window.YTD.tweets.part2 = [ ... ];
```
The variable `window.YTD.tweets.partX` indicates which part of the tweet data is being represented. The JSON array following the assignment contains tweet objects.
### Updated Script to Handle Multiple Files
Here’s an updated Python script to process all tweet parts (`tweets.js`, `tweets-part1.js`, `tweets-part2.js`, etc.) and extract only conversation tweets:
```python
import os
import json
import glob
def extract_conversations(archive_dir, output_file):
tweet_files = glob.glob(os.path.join(archive_dir, 'data', 'tweets*.js'))
conversations = []
for tweet_file in tweet_files:
if os.path.exists(tweet_file):
with open(tweet_file, 'r') as f:
content = f.read()
# Find the JSON array within the JavaScript file
json_content = content[content.index('['): content.rindex(']') + 1]
tweets = json.loads(json_content)
# Extract tweets that are part of conversations
conversation_tweets = [
tweet for tweet in tweets
if tweet.get('tweet', {}).get('in_reply_to_status_id') or tweet.get('tweet', {}).get('in_reply_to_user_id')
]
conversations.extend(conversation_tweets)
print(f"Processed {tweet_file}: Found {len(conversation_tweets)} conversation tweets")
# Save conversations to a new file
with open(output_file, 'w') as f:
json.dump(conversations, f, indent=2)
print(f"Extracted a total of {len(conversations)} conversation tweets to {output_file}")
if __name__ == "__main__":
archive_directory = input("Enter the path to your Twitter archive directory: ")
output_filename = "conversations.json"
if os.path.exists(archive_directory) and os.path.isdir(archive_directory):
output_path = os.path.join(archive_directory, 'data', output_filename)
extract_conversations(archive_directory, output_path)
else:
print("Invalid directory path")
```
### Explanation
1. **Using `glob`**: The script uses the `glob` module to find all files matching the pattern `tweets*.js` in the archive's data directory. This ensures it processes all tweet parts, including `tweets.js`, `tweets-part1.js`, `tweets-part2.js`, etc.
2. **Parsing JSON Content**: The script reads each file's content, strips out the JavaScript variable assignment, and extracts the JSON array.
3. **Identifying Conversations**: It filters tweets that are part of conversations. This includes:
- Tweets with a non-null `in_reply_to_status_id` (indicating they are replies).
- Tweets with a non-null `in_reply_to_user_id` (indicating they are replies to another user's tweet).
4. **Storing Conversations**: The filtered conversation tweets are appended to a list, which is then saved to `conversations.json`.
### How to Run
1. Save the script as `extract_conversations.py`.
2. Open a terminal or command prompt.
3. Navigate to the directory where you saved the script.
4. Run the script using `python extract_conversations.py`.
5. Enter the path to your Twitter archive directory when prompted.
This script will effectively extract all conversation tweets from multiple parts of your Twitter archive, ensuring you have a consolidated list of conversations across all your tweet data files.
% python3 extract_conversations.py
Enter the path to your Twitter archive directory: twitter-2022-11-19-3fb11190ae1570bef909209bdfa34ff2ed541ca6129356332c3f89b904f771ad
Processed twitter-2022-11-19-3fb11190ae1570bef909209bdfa34ff2ed541ca6129356332c3f89b904f771ad/data/tweets.js: Found 12931 conversation tweets
Processed twitter-2022-11-19-3fb11190ae1570bef909209bdfa34ff2ed541ca6129356332c3f89b904f771ad/data/tweets-part1.js: Found 1487 conversation tweets
Extracted a total of 14418 conversation tweets to twitter-2022-11-19-3fb11190ae1570bef909209bdfa34ff2ed541ca6129356332c3f89b904f771ad/data/conversations.json