diff --git a/conf/helix.yml b/conf/helix.yml index 779d4b52..3a049411 100644 --- a/conf/helix.yml +++ b/conf/helix.yml @@ -51,33 +51,25 @@ mailing_list: mod_mbox: project_key_1: mailing_list: https://lists.apache.org/list.html?announce@apache.org - start_year_month: 202310 - end_year_month: 202405 - save_folder_path: "../extdata/save_mbox_mail" - # mbox_path is for use only with parse_mbox() function. It is the file to parse. - mbox_file_path: "../extdata/save_mbox_mail/kaiaulu_202410.mbox" + save_folder_path: ../../rawdata/kaiaulu/mod_mbox/save_mbox_mail/ + # mbox_file_path is for use only with parse_mbox() function. It is the file to parse + mbox_file_path: ../../rawdata/kaiaulu/mod_mbox/save_mbox_mail/kaiaulu.mbox project_key_2: mailing_list: https://lists.apache.org/list.html?dev@felix.apache.org - start_year_month: 202201 - end_year_month: 202401 - save_folder_path: "../extdata/save_mbox_mail" - # mbox_path is for use only with parse_mbox() function. It is the file to parse. - mbox_file_path: "../extdata/save_mbox_mail/kaiaulu_202210.mbox" + save_folder_path: ../../rawdata/kaiaulu/mod_mbox/save_mbox_mail_2/ + # mbox_file_path is for use only with parse_mbox() function. It is the file to parse + mbox_file_path: ../../rawdata/kaiaulu/mod_mbox/save_mbox_mail_2/kaiaulu.mbox pipermail: project_key_1: mailing_list: https://mta.openssl.org/pipermail/openssl-users/ - start_year_month: 202310 - end_year_month: 202405 - save_folder_path: "../extdata/save_folder_mail" - # mbox_file_path is for use only with parse_mbox() function. It is the file to parse. - mbox_file_path: "../extdata/save_mbox_mail/kaiaulu_202310.mbox" + save_folder_path: ../../rawdata/kaiaulu/pipermail/save_mbox_mail/ + # mbox_file_path is for use only with parse_mbox() function. It is the file to parse + mbox_file_path: ../../rawdata/kaiaulu/pipermail/save_mbox_mail/kaiaulu.mbox project_key_2: mailing_list: https://mta.openssl.org/pipermail/openssl-project/ - start_year_month: 202203 - end_year_month: 202303 - save_folder_path: "../extdata/save_folder_mail_2" - # mbox_file_path is for use only with parse_mbox() function. It is the file to parse. - mbox_file_path: "../extdata/save_mbox_mail/kaiaulu_202210.mbox" + save_folder_path: ../../rawdata/kaiaulu/pipermail/save_mbox_mail_2/ + # mbox_file_path is for use only with parse_mbox() function. It is the file to parse + mbox_file_path: ../../rawdata/kaiaulu/pipermail/save_mbox_mail_2/kaiaulu.mbox issue_tracker: jira: diff --git a/vignettes/download_mail.Rmd b/vignettes/download_mail.Rmd index 28d135a6..194c3d82 100644 --- a/vignettes/download_mail.Rmd +++ b/vignettes/download_mail.Rmd @@ -31,30 +31,7 @@ set.seed(seed) Open source projects require a means for developers to communicate. These may include mailing lists, issue trackers, discord, etc. This notebooks showcases how to download data from mailing list archives. Two often used archive types are [mod_mbox](https://httpd.apache.org/mod_mbox/) and [pipermail](https://en.wikipedia.org/wiki/GNU_Mailman#cite_note-9), which Kaiaulu offer functions to download data from. The former is commonly used by the Apache Software Foundation projects. The latter, is more commonly use in GNU related projects, but this can vary. -# Project Configuration File - -Mailing List archives are hosted by their respective open source projects. Therefore, in order to use Kaiaulu downloaders to obtain mail data, you will need to access the respective open source project, and find out the URL tied to the archive you are interested. Generally, that is the developer mailing list, if your interest is to understand communication patterns among developers. Alternatively, if the focus of the research is Q&A from the userbase, then a user mailing list may make more sense. - -Because project lifetime can go as far as a few decades, to have the full picture of what communication took place in the project, if your analysis include a long period of time, you may need to download multiple archives to combine them after turning them into tables using Kaiaulu parser. - -The information you need to find out for each open source project is documented in Kaiaulu using a project configuration file format. For pipermail and mod_mbox this is as follows: - -``` -# top-level key for mailing list config -mailing_list: - # for pipermail - pipermail: - project_key_1: - mailing_list: https://mta.openssl.org/pipermail/openssl-users/ - start_year_month: 202310 - end_year_month: 202405 - save_folder_path: "../extdata/save_folder_mail" - -``` - -Regardless of which mail archive you choose, the downloaders will store the mail data in monthly files, in a `.mbox` format. This is a simple text file that contains some markings to identify the header of the e-mail containing title, authors, etc. You can open any of the .mbox downloaded files with any text editor. - -#### Edit below +# Mailing List Organization Mailing list data is stored in a variety of archives. See: - Mod Mbox: [Apache Geronimo](https://geronimo.apache.org/mailing-lists.html)). @@ -72,12 +49,13 @@ Mod Mbox archives also organize mailing lists by topic. The apache mailing list Each mailing list maintains archives of past messages, often organized by month and year. These archives can be accessed and downloaded for analysis. However, it is important to note that mailing list archives may be split into multiple formats or locations, and not all archives contain the same information. Different archives can differ in completeness, date ranges, and the data they contain. Some archives might lack important fields like "In-Reply-To," which is important for reconstructing message threads. It is, therefore, important the archive being used is carefully selected, since this effects the quality and completeness of analysis. -# Pipermail +# Project Configuration File + +Mailing List archives are hosted by their respective open source projects. Therefore, in order to use Kaiaulu downloaders to obtain mail data, you will need to access the respective open source project, and find out the URL tied to the archive you are interested in. Generally, that is the developer mailing list, if your interest is to understand communication patterns among developers. Alternatively, if the focus of the research is Q&A from the user base, then a user mailing list may make more sense. -## Project Configuration File +Because project lifetime can go as far as a few decades, to have the full picture of what communication took place in the project you may need to download multiple archives and combine them, after turning them into tables using the Kaiaulu parser. -To start, we load the project configuration file, which contains parameters for downloading the mailing list archives. Instead of hard-coding these values in the notebook, we store them in a project configuration file in YAML format. This makes the parameters easier to manage. -Here is an example of the pipermail mailing list section from the configuration file (conf/helix.yml): +The information you need to find out for each open source project is documented in Kaiaulu using a project configuration file format. For pipermail and mod_mbox this is as follows: ``` # top-level key for mailing list config @@ -86,24 +64,48 @@ mailing_list: pipermail: project_key_1: mailing_list: https://mta.openssl.org/pipermail/openssl-users/ - save_folder_path: "../extdata/save_folder_mail" - + start_year_month: 202310 + end_year_month: 202405 + save_folder_path: ../../rawdata/kaiaulu/mod_mbox/save_mbox_mail/ + # mbox_file_path is for use only with parse_mbox() function. It is the file to parse + mbox_file_path: ../../rawdata/kaiaulu/mod_mbox/save_mbox_mail/kaiaulu.mbox + # for mod mbox + mod_mbox: + apache_announce: + mailing_list: https://lists.apache.org/list.html?announce@apache.org + start_year_month: 202310 + end_year_month: 202405 + save_folder_path: ../../rawdata/kaiaulu/mod_mbox/save_mbox_mail/ + # mbox_file_path is for use only with parse_mbox() function. It is the file to parse + mbox_file_path: ../../rawdata/kaiaulu/pipermail/save_mbox_mail_2/kaiaulu.mbox ``` +Explanation: -The configuration file contains the following parameters for each mailing list archive: - +- mailing_list: The top-level key for mailing list configurations. - project_key_1: A unique key for the project. There can be multiple projects in both the pipermail and mod mbox sections. - pipermail/ mod_mbox: Indicates whether the setting are for pipermail or mod mbox. Although the parameters are the same, this helps to differentiate between the two types of mailing list archives. - mailing_list: The URL of the mailing list archive page. Note that this URL should point to the page containing links to the monthly archives (e.g. https://mta.openssl.org/pipermail/openssl-users/), not the top-level mailing list page that contains all the different types of archives (e.g. https://mta.openssl.org/mailman/listinfo/). - start_year_month: The starting date for downloading archives (in YYYYMM format). - end_year_month: The ending date for downloading archives (in YYYYMM format). - save_folder_path: The local directory where the downloaded archives will be saved (if you run the code in this notebook, the archives will be saved in a folder 'extdata', located in the parent directory of kaiaulu (wherever your kaiaulu folder is kept)). +- mbox_file_path: The path to the .mbox file used by the parse_mbox() function. + +Note: It is important that the paths specified in save_folder_path and mbox_file_path are accurate and do not conflict between projects. By organizing the configuration in this way, you can manage multiple projects and mailing lists easily. The notebook reads these parameters and uses them to download and process the archives. -## Pipermail Downloader +Regardless of which mail archive you choose, the downloaders will store the mail data in monthly files, in a `.mbox` format. This is a simple text file that contains some markings to identify the header of the e-mail containing title, authors, etc. You can open any of the .mbox downloaded files with any text editor. + +## Pipermail Configuration -The following code reads the configuration parameters for project_key_1 of pipermail: +For Pipermail, we need to specify the project key, which is used to retrieve the configuration parameters for the specific project. The project key is used to identify the project in the configuration file. + +```{r} +# Define the project key +project_key <- "project_key_1" +``` + +Now, we can use the getter functions to retrieve the configuration parameters for the specified project key. ```{r} conf <- parse_config("conf/helix.yml") @@ -113,7 +115,86 @@ end_year_month <- 202405 save_folder_path <- get_pipermail_path(conf, "project_key_1") ``` -After setting the configurations above, you can download the archives using the download_pipermail() function, which downloads and saves .mbox files to the specified directory (save_folder_path). The .mbox files are named with the format kaiaulu_YYYYMM.mbox, where YYYYMM refers to the year and month of the archive. +Note that the date range is not set with a getter. The range for downloads changes often, and should be set manually using the YYYYMM format. + +Explanation of Getters: + +- get_pipermail_domain(config_file, project_key_index): Retrieves the mailing list URL. +- get_pipermail_path(config_file, project_key_index): Retrieves the local folder path for saving archives. +- get_pipermail_input_file(config_file, project_key_index): Retrieves the .mbox file path for parsing (parse_mbox function). + +## Mbox Configuration + +Similarly to Pipermail, we need to specify the project key for Mod Mbox. The project key is used to retrieve the configuration parameters for the specific project. + +```{r} +# Define the project key +project_key <- "project_key_1" +``` + +Use the getters to extract the parameters: + +```{r eval=FALSE} +conf <- parse_config("conf/helix.yml") +mailing_list <- get_mbox_domain(conf, "project_key_1") +start_year_month <- 202310 +end_year_month <- 202405 +save_folder_path <- get_mbox_path(conf, "project_key_1") +``` + +Explanation of Getters: + +get_mbox_domain(config_file, project_key_index): Retrieves the mailing list URL. +get_mbox_path(config_file, project_key_index): Retrieves the local folder path for saving archives. +get_mbox_input_file(config_file, project_key_index): Retrieves the .mbox file path for parsing. + +start_year_month and end_year_month should be set manually, as with pipermail. + +## Tools Configuration + +In addition to the mailing list configurations, you need to specify the path to the perceval binary in tools.yml, which is used by the parse_mbox() function to parse .mbox files.It should look something like this: + +```{r} +perceval: /usr/local/bin/perceval +``` + +Now, you can load the configurations in your R script or notebook using the following code: + +```{r} +# Load tools configuration +tools <- parse_config("tools.yml") +parse_perceval_path <- get_tool("perceval", tools) + +# Load project configuration +conf <- parse_config("conf/helix.yml") +mbox_file_path <- get_mbox_input_file(conf, "project_key_1") +``` + +Explanation of Getters: + +parse_config(): Function to parse the YAML configuration files. +get_tool("perceval", tools): Retrieves the Perceval path from the tools configuration. +get_mbox_input_file(conf, "project_key_1"): Retrieves the .mbox file path for project_key_1 from the helix configuration. + +Now that you have loaded the configurations, you can proceed to use them in downloading and parsing the mailing list archives. + +# Downloaders and Refreshers + +## Pipermail Downloader + +### How download_pipermail() Works +The download_pipermail() function downloads Pipermail archives from a specified mailing list within a given date range. Here's how it operates: + +- Archive Index Retrieval: It begins by downloading an HTML page that lists the URLs for the monthly archives, which are typically available in .txt or .gz formats. +- File Downloading: The function attempts to download the .txt file for each month. If the .txt file is unavailable, it falls back to downloading the .gz (gzipped) file. +- File Processing: If a .gz file is downloaded, the function unzips it and converts it into an .mbox file. The original .gz file is deleted after extraction to save space. +- File Saving: The downloaded .mbox files are saved in the specified folder with the naming convention kaiaulu_YYYYMM.mbox, where YYYYMM represents the year and month. +- Date Range Filtering: Only files within the specified start_year_month and end_year_month are downloaded. +- Error Handling: If both .txt and .gz formats fail to download for a particular month, a warning is issued indicating the missing month. +- Summary Output: At the end of the process, the function summarizes the downloads, indicating the range of dates present and any missing months. +- Set verbose to TRUE to see status updates and detailed output. + +### Example Usage ```{r eval=FALSE} # Download archives @@ -137,7 +218,8 @@ How refresh_pipermail Works 1. Checks if the folder is empty: If the folder is empty, it downloads archives starting from start_year_month to the current month using download_pipermail(). 2. Finds the most recent file: If the folder is not empty, the function checks for the most recent month’s file (based on the filename) and deletes it. 3. Redownloads from the most recent month: The function then redownloads the archive from the most recent month up to the current month. -# add warning for files do not exist + +### Example Usage ```{r eval=FALSE} # Refresh archives @@ -152,46 +234,22 @@ refresh_pipermail( This function will ensure that the most recent archives are always up-to-date by redownloading the current month's archive and adding any new months that have been added to the mailing list. -# Mod Mbox - -## Project Configuration File - -Like in Pipermail, we load the configuration for Mod Mbox from the YAML file, which includes the mailing list URL, the date range, and the save folder path. - -Here's an example of the relevant section in the configuration file (conf/helix.yml): - -``` -# top-level key for mailing list config -mailing_list: - # for mod mbox - mod_mbox: - project_key_1: - mailing_list: https://lists.apache.org/list.html?announce@apache.org - save_folder_path: "../../extdata/save_mbox_mail" - -``` - -The configuration parameters are the same as the ones explained in the section at the top of this notebook, except that the mailing_list should point to a Mod Mbox mailing list URL. - -The following code reads the configuration parameters: - -```{r eval=FALSE} -conf <- parse_config("conf/helix.yml") -mailing_list <- get_mbox_domain(conf, "project_key_1") -start_year_month <- 202310 -end_year_month <- 202405 -save_folder_path <- get_mbox_path(conf, "project_key_1") -``` +## Mod Mbox Downloader -- mailing_list: The URL of the Mod Mbox mailing list (e.g., https://lists.apache.org/list.html?announce@apache.org). -- start_year_month: The first month to download (format: YYYYMM). -- end_year_month: The last month to download (format: YYYYMM). -- save_folder_path: The directory where the downloaded .mbox files will be saved. +### How download_mod_mbox() Works +The download_mod_mbox() function downloads Mod Mbox archives from a specified Apache Pony Mail mailing list over a given date range: -## Mod Mbox Downloader +- URL Construction: It constructs the download URLs for each month based on the mailing list URL and the date range. +- File Downloading: Downloads the .mbox file for each month in the format "YYYY-MM". +- File Saving: Saves the downloaded .mbox files in the specified folder with the naming convention kaiaulu_YYYYMM.mbox. +- Date Range Looping: Iterates through each month between start_year_month and end_year_month. +- Error Handling: Issues a warning if a download fails for a specific month, indicating that the month's data may not exist. +- Summary Output: Provides a summary of the downloads, including any missing months. The download_mod_mbox() function downloads Mod Mbox archives by constructing URLs based on the mailing list and date range, saving them as .mbox files named kaiaulu_YYYYMM.mbox. +### Example Usage + ```{r eval=FALSE} download_mod_mbox( mailing_list = mailing_list, @@ -214,6 +272,8 @@ How refresh_mod_mbox Works 1. Checks if the folder is empty and, if so, downloads the archives starting from start_year_month to the current month by calling download_mod_mbox(). 2. If the folder contains files, it identifies the most recent one using the YYYYMM found in the filename. This file is deleted, and then redownloaded along with all future months. +### Example Usage + ```{r eval=FALSE} refresh_mod_mbox( mailing_list = mailing_list, @@ -225,46 +285,20 @@ refresh_mod_mbox( This ensures your archive is up-to-date, accounting for new data that may have been added to the mailing list since the last download. -# Parser +# Parsers After downloading the mailing list archives as .mbox files, the next step is to parse these files to extract meaningful information for analysis. The parse_mbox() function utilizes the Perceval library to parse .mbox files and convert them into structured data tables. This enables easier manipulation and analysis of mailing list data. ## Mbox Parser -The parse_mbox() function takes an .mbox file and parses it into a structured data.table using the Perceval library. +### ow parse_mbox() Works +- Perceval Integration: Interfaces with the Perceval library to parse the .mbox file. +- Flexible Parsing: Handles variations in .mbox file structures, which may have inconsistent fields due to different email headers. +- Data Extraction: Extracts key information such as email content, sender, recipients, dates, and threading information. +- Consistent Column Naming: Ensures that columns of interest are consistently renamed for clarity, even if the raw data varies. -For the configuration, make sure you have the correct path to the Perceval library in the conf file. -Here's an example of the relevant section in the tools.yml file: - -``` -perceval: /usr/local/bin/perceval -``` - -And in the helix.yml configuration file: - -``` -mailing_list: - # for mod mbox - mod_mbox: - project_key_1: - mbox_file_path: "../../extdata/save_mbox_mail.kaiaulu_202310.mbox" -``` - -perceval: found in tools.yml, this should be set to your local path to the perceval binary (use > which perceval to locate the path). -mbox_file_path: should point to the saved .mbox file that will be parsed. See the mbox_path in the mailing_list sections of helix.yml. - -Load the configuration: - -```{r eval=FALSE} -tools <- parse_config("tools.yml") -parse_perceval_path <- get_tool_project("perceval", tools) - -conf <- parse_config("conf/helix.yml") -mbox_file_path <- get_mbox_input_file(conf, "project_key_1") -``` - -Run the parser: +### Example Usage ```{r eval=FALSE} parsed_mail <- parse_mbox( @@ -273,7 +307,7 @@ parsed_mail <- parse_mbox( ) ``` -This will store the parsed data into the parsed_mail variable. To view the table, use: +This will store the parsed data into the parsed_mail variable. You can use the gt package to display the parsed data in a readable format: ```{r eval=FALSE} # Display the first 10 rows of the parsed data using gt @@ -283,20 +317,19 @@ parsed_mail %>% gt() ``` +Note: Displaying the entire dataset may not be practical if it's large. Showing a sample provides a glimpse of the structure. + ## Retrieve the Latest Mbox File We can use the parse_mbox_latest_date() function to identify the most recent .mbox file in the specified folder. This can be useful when you want to automate the parsing of the latest data without manually specifying the file name. First, make sure that the save_folder_path is correctly set to the directory where your .mbox files are stored. -```{r eval=FALSE} -# Get the latest mbox file -latest_mbox_file <- parse_mbox_latest_date(save_folder_path = save_folder_path) -print(latest_mbox_file) -``` This will output the name of the latest .mbox file based on the YYYYMM pattern in the filename. We can use this to update mbox_file_path to point to the latest file, and call the parse_mbox() function to parse the latest data. +### Example Usage + ```{r eval=FALSE} # Update mbox_file_path to use the latest file mbox_file_path <- file.path(save_folder_path, latest_mbox_file)