Skip to content

Commit

Permalink
i #284 Refactored parse_mbox_latest_date and Fixed Roxygen Errors
Browse files Browse the repository at this point in the history
- parse_mbox_lateset_date() now uses new naming convention for files
- Added to download_mail.Rmd
- Fixed documentation for download_pipermail()

Signed-off-by: Dao McGill <dmcgill@hawaii.edu>
  • Loading branch information
daomcgill committed Oct 3, 2024
1 parent 2a1ba98 commit 7bf8ba6
Show file tree
Hide file tree
Showing 8 changed files with 73 additions and 37 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -50,4 +50,4 @@ Imports:
VignetteBuilder: knitr
URL: https://github.com/sailuh/kaiaulu
BugReports: https://github.com/sailuh/kaiaulu/issues
RoxygenNote: 7.2.3
RoxygenNote: 7.3.2
45 changes: 27 additions & 18 deletions R/mail.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@
#' The downloaded .mbox files are saved in the specified folder following the naming convention kaiaulu_YYYYMM.mbox.
#' The function only downloads files that fall between the specified start_year_month and end_year_month.
#'
#' @param mailing_list The name of the mailing list being downloaded (e.g. "https://mta.openssl.org/pipermail/openssl-announce/")
#' @param start_year_month The year and month of the first file to be downloaded (format: 'YYYYMM')
#' @param end_year_month The year and month of the last file to be downloaded (format: 'YYYYMM', or use 'format(Sys.Date(), "%Y%m")' for the current month)
#' @param mailing_list The name of the mailing list being downloaded e.g. "https://mta.openssl.org/pipermail/openssl-announce/"
#' @param start_year_month The year and month of the first file to be downloaded format: 'YYYYMM'
#' @param end_year_month The year and month of the last file to be downloaded format: 'YYYYMM', or use Sys.Date
#' @param save_folder_path The folder path in which all the downloaded pipermail files will be stored
#' @param verbose if TRUE, prints diagnostic messages during the download process
#' @return Returns `downloaded_files`, a vector of the downloaded files in the current working directory
Expand Down Expand Up @@ -501,25 +501,34 @@ parse_mbox <- function(perceval_path, mbox_path){

#' Parse mbox latest date
#'
#' Returns the name of the latest mod_mbox file downloaded in the specified folder
#' @description This function returns the name of the latest mod_mbox file downloaded in the specified folder
#' based on the naming convention `kaiaulu_YYYYMM.mbox`. For example: `kaiaulu_202401.mbox`.
#'
#' The folder assumes the following convention: "(mailing_list)_(archive_type)_yearmonth.mbox"
#' For example: "geronimo-dev_apache_202401.mbox". This nomenclature is defined by \code{\link{download_mod_mbox_per_month}}
#'
#' @param mbox path to mbox archive file (ends in .mbox)
#' @return Returns the name of the latest mod_mbox file
#' @param save_folder_path path to the folder containing the mbox files
#' @return `latest_mbox_file` the name of the latest mod_mbox file
#' @export
#' @family parsers
parse_mbox_latest_date <- function(mbox) {
file_list <- list.files(mbox)
date_list <- list()
for(i in file_list){
i <- sub(".mbox", "", i)
i <- sub("[^_]*_[^_]*_", "", i)
date_list <- append(date_list, i)
parse_mbox_latest_date <- function(save_folder_path) {
# List all .mbox files in the folder with the expected naming pattern
file_list <- list.files(save_folder_path, pattern = "kaiaulu_\\d{6}\\.mbox$")

if (length(file_list) == 0) {
warning("No .mbox files found in the folder.")
return(invisible(NULL))
}
latest_date <- as.character(max(unlist(date_list)))
latest_mbox_file <- grep(latest_date, file_list, value = TRUE)

# Extract the dates from the filenames
date_list <- sub("kaiaulu_(\\d{6})\\.mbox$", "\\1", file_list)

# Convert dates to numeric for comparison
date_numeric <- as.numeric(date_list)

# Find the latest date
latest_date <- max(date_numeric, na.rm = TRUE)

# Find the file corresponding to the latest date
latest_mbox_file <- file_list[date_numeric == latest_date]

return(latest_mbox_file)
}

Expand Down
2 changes: 1 addition & 1 deletion man/commit_message_id_coverage.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/download_mod_mbox.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions man/download_pipermail.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions man/parse_mbox.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

17 changes: 7 additions & 10 deletions man/parse_mbox_latest_date.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

32 changes: 31 additions & 1 deletion vignettes/download_mail.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ mailing_list:
mailing_list: https://mta.openssl.org/pipermail/openssl-users/
start_year_month: 202310
end_year_month: 202405
save_folder_path: "../../extdata/save_folder_mail"
save_folder_path: "../extdata/save_folder_mail"
```

Expand Down Expand Up @@ -240,3 +240,33 @@ This will store the parsed data into the parsed_mail variable. To view the table
```{r}
View(parsed_mail)
```

## Retrieve the Latest Mbox File
We can use the parse_mbox_latest_date() function to identify the most recent .mbox file in the specified folder. This can be useful when you want to automate the parsing of the latest data without manually specifying the file name.

First, make sure that the save_folder_path is correctly set to the directory where your .mbox files are stored.
```{r}
# Get the latest mbox file
latest_mbox_file <- parse_mbox_latest_date(save_folder_path = save_folder_path)
print(latest_mbox_file)
```
This will output the name of the latest .mbox file based on the YYYYMM pattern in the filename.
We can use this to update mbox_path to point to the latest file, and call the parse_mbox() function to parse the latest data.
```{r}
# Update mbox_path to use the latest file
mbox_path <- file.path(save_folder_path, latest_mbox_file)
print(mbox_path)
```
To parse this file:
```{r}
# Parse the latest mbox file
parsed_mail <- parse_mbox(
perceval_path = parse_perceval_path,
mbox_path = mbox_path
)
```
Now, parsed_mail contains the parsed data from the latest .mbox file.
```{r}
# View the parsed data
View(parsed_mail)
```

0 comments on commit 7bf8ba6

Please sign in to comment.