Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

284 MBox Refresher #295

Open
wants to merge 86 commits into
base: master
Choose a base branch
from
Open

284 MBox Refresher #295

wants to merge 86 commits into from

Conversation

ian-lastname
Copy link
Collaborator

… helix config in accordance to new save file structure

I have created the parse_mbox_latest_date and refresh_mbox functions. The latter function deletes the latest year and month mbox file that is currently downloaded (identified by parse_mbox_latest_date), and redownloads that along with any file beyond up until the current year. The naming convention of the downloaded files are also changed to what we have agreed on. Just to note, download_mod_mbox REMAINS UNCHANGED since I'm only using download_mod_mbox_per_month.

… helix config in accordance to new save file structure

I have created the parse_mbox_latest_date and refresh_mbox functions. The latter function deletes the latest year and month mbox file that is currently downloaded (identified by parse_mbox_latest_date), and redownloads that along with any file beyond up until the current year. The naming convention of the downloaded files are also changed to what we have agreed on. Just to note, download_mod_mbox REMAINS UNCHANGED since I'm only using download_mod_mbox_per_month.
Copy link

codecov bot commented Apr 19, 2024

Codecov Report

Attention: Patch coverage is 0% with 206 lines in your changes are missing coverage. Please review.

Project coverage is 36.42%. Comparing base (2bc8d14) to head (b5be04e).
Report is 3 commits behind head on master.

Files Patch % Lines
R/mail.R 0.00% 206 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #295      +/-   ##
==========================================
- Coverage   39.79%   36.42%   -3.37%     
==========================================
  Files          20       20              
  Lines        3091     3495     +404     
==========================================
+ Hits         1230     1273      +43     
- Misses       1861     2222     +361     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@carlosparadis
Copy link
Member

Thank you @ian-lastname. I will try to make a pass before our meeting!

…ted refresh_pipermail, updated news

Found out that the pipermail downloader function already downloads the files by month and year, so all I really needed to do was change it so that it downloads the files as mbox files (change the extension from .txt to .mbox). Created the refresher for pipermail. I had no need to create a parse latest pipermail since they were mbox files anyway.
@ian-lastname ian-lastname changed the title Created parse_mbox_latest_date and refresh_mbox functions and updated… 284 MBox Refresher Apr 24, 2024
ian-lastname and others added 4 commits April 24, 2024 18:23
…to ensure it does not download files past current year and month

Added checks in the aforementioned functions so that the refreshers won't download "mail from the future"
@carlosparadis
Copy link
Member

@ian-lastname thanks!

- Remove archive_url and archive_type parameters from download_pipermail().
- Add start_year_month and end_year_month parameters for date filtering.
- Remove convert_pipermail_to_mbox() function, as download_pipermail() now handles file conversion automatically.
- Change file naming convention to 'kaiaulu_'YYYYMM.mbox'.
- Attempt to download and decompress files directly without saving .gz to disk, but could not establish a valid connection.

Signed-off-by: Dao McGill <[email protected]>
@daomcgill
Copy link
Collaborator

daomcgill commented Sep 15, 2024

Hi @carlosparadis,

I've refactored the download_pipermail() function.

Proposed Changes

  • Removed archive_url and archive_type parameters to simplify the function interface.
  • Added start_year_month and end_year_month parameters for date filtering.
  • Removed convert_pipermail_to_mbox(), since download_pipermail() handles file conversion automatically.
  • Changed the file naming convention to ''kaiaulu_'YYYYMM.mbox' for consistency.

I've added temporary configuration entries in helix.yml for testing purposes:

conf <- yaml::read_yaml("conf/helix.yml")
mailing_list <- conf[["mailing_list"]][["mod_mbox"]][["pipermail_key"]][["mailing_list"]]
start_year_month <- conf[["mailing_list"]][["mod_mbox"]][["pipermail_key"]][["start_year_month"]]
end_year_month <- conf[["mailing_list"]][["mod_mbox"]][["pipermail_key"]][["end_year_month"]]
save_folder_path <- conf[["mailing_list"]][["mod_mbox"]][["pipermail_key"]][["save_folder_path"]]

And this function call:

download_pipermail(
  mailing_list = mailing_list,
  start_year_month = start_year_month,
  end_year_month = end_year_month,
  save_folder_path = save_folder_path
)

Testing Results

  • Retrieves links according to date range
  • Downloads and renames files from links
  • Unzips and cleans gz files
  • Saves files as mbox with new naming convention
  • Note: I attempted to download and decompress the files directly without saving the .gz to disk, but could not establish a valid connection. The current implementation still downloads the gzipped file and then removes it after unzipping.

…mail()

- Modified helix.yml to use [[“mailing_list”]][[“pipermail”]][[“project_key_1”]]
- Added project_key_2 to helix.yml
- Created /vignettes/download_mail.Rmd to document information about pipermail downloader
- Made function calls explicit for external libraries
- ISSUE: Build -> Check is not passing. Seems to be having issues with utags_path, even though I changed the path to the one for universal-ctags in tools.yml
@carlosparadis
Copy link
Member

@daomcgill I made an inline comment to reply to your question of changes, since there may be a bit of misunderstanding. Let me know if you can't find it.

Two sanity checks:

  • Did you try running parse_mbox() to see if it works on the renamed files yet?
  • Does running the refresh function pointed at a empty folder downloads the files up to date?
  • Does deleting the most recent mbox file download only the newest file back?
  • Does running the refresh without deleting any file also update the more recent file only? (to add more recent day's worth of data to the current file)
  • Suppose you ran the current refresh today, Sep 16. Then also supposed the next time you executed again the refresh was Oct 2. Would it overwrite the Sep file, and then download a Oct file?

…process_gz_to_mbox_in_folder()

- download_pipermail: Attempts to download .txt file first. If unavailable fallback to .gz. If using .gz file, unzips and writes output in .mbox
- Added log messages
- download_pipermail: Added timeout parameter to deal with case that server takes too long to respond
- Added refresh_pipermail function
- Updated vignettes/download_mail.Rmd to include refresh_pipermail
- Added process_gz_to_mbox_in_folder function
Copy link
Member

@carlosparadis carlosparadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daomcgill i've added some inline comments, for sanity sake you can reply to each comment directly in line on these since they are more specific.

R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Show resolved Hide resolved
daomcgill and others added 5 commits September 18, 2024 14:18
…il refresher.

- Replaced paste0 with stringi::stri_c
- Removed create directory if does not exist
- Added more verbose descriptions/comments
- Added dividers within functions
- Added verbose parameter
- Added else block for refresher
- Added call to process_gz_to_mbox_in_folder at end of refresher
- parse_mbox: stri_replace_last was not working, changed it to stringi::stri_replace_last_regex
- Tested parse_mbox. Perceval was not returning any output. I will look further into why this is happening.
…il refresher.

- Replaced paste0 with stringi::stri_c
- Removed create directory if does not exist
- Added more verbose descriptions/comments
- Added dividers within functions
- Added verbose parameter
- Added else block for refresher
- Added call to process_gz_to_mbox_in_folder at end of refresher
- parse_mbox: stri_replace_last was not working, changed it to stringi::stri_replace_last_regex
- Tested parse_mbox. Perceval was not returning any output. I will look further into why this is happening.

Signed-off-by: Dao McGill <[email protected]>
Updated parameters for download_mod_mbox to use Apache Pony Mail links as Apache lists now redirect there
- Modified downloads to use YYYYMM  instead of YYYY
- Removed the option for downloading by year for clearer functionality.
- Updated vignette/download_mail.Rmd

Signed-off-by: Dao McGill <[email protected]>
- Created `refresh_mod_mbox` function to automatically refresh mailing list archives downloaded using Mod Mbox.
- The function checks for the latest downloaded file, deletes it, and redownloads the archive from that month to the current date.
- Added documentation for `refresh_mod_mbox` to the notebook.

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this pull request Oct 1, 2024
- Updated vignettes/download_mail.Rmd to working version
- Fixed errors in helix.yml
- Minor edits in mail.R
- Updated vignettes/download_mail.Rmd to working version
- Fixed errors in helix.yml
- Minor edits in mail.R

Signed-off-by: Dao McGill <[email protected]>
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
R/mail.R Outdated Show resolved Hide resolved
the flag caused errors of perceval being unable
to parse json files.

Signed-off-by: Carlos Paradis <[email protected]>
@carlosparadis
Copy link
Member

@daomcgill i am still editing so don't commit anything, but this is what I meant: d3dd232

@carlosparadis
Copy link
Member

Note: Documentation and all notebooks of master compile on commit d3dd232.

Simplified some of the notebook language, reduced
title of functions, removed some of the sub-headers
pound symbols as it was creating too many sections
on the code blocks.

Added parser tables after downloaders and remove their
eval so example tables of what can be downloaded are
shown on the generated notebook.

Commit passes check, tests, and downloaders, refresh
and parsers work.

Signed-off-by: Carlos Paradis <[email protected]>
@carlosparadis
Copy link
Member

@daomcgill hi Dao,

I think this is mostly ready for merge, but there is one thing missing I will need to pass back to you: The execs require a start_date. This will basically kill its main purpose, which is to run server-side on a cron job.

The logic of the refresh should make the start_date optional (which means the refresh function itself should make the parameter optional). If no start_date is specified, then an attempt to read an existing file to the folder should be made (since there is no other way to know the start date). IF that fails, then an error should issue saying "No existing data is located. Please specify a start date". Otherwise, the function should proceed as usual.

We need this because server-side the start date is not going to be specified manually every time: The first is specified with a date, but thereafter it is the server itself that will call the script every day to keep the data up to date. If I got confused and this is already implemented let me know.

Please feel free to commit to this PR, I will not make any other changes to avoid conflict.

You should only need to edit the refresh of both pipermaill and mod mbox (do not pass the start_date optional logic to the actual downloader, this is refresher behavior only). Avoid editing anything else to keep the last revision focused so I can merge this right after.

I made a couple comments on the commits if you would like to see what I modified and what I was hoping for in the end. Note I edited the file names too, they are no longer kaiaulu_*. The prefix was confusing since it makes look like every mbox is for Kaiaulu. Since there is no way to obtain the project name here without passing more parameters, only YYYYMM.mbox remains.

Thank you for your hard work on this too. I am glad to finally have downloaders and refreshers for these.

@carlosparadis
Copy link
Member

carlosparadis commented Dec 8, 2024

@crepesAlot @beydlern please confirm if this file path was added to every config project_key. I am referring to get_mbox_input_file is there any equivalent on the other downloaders? If so, they should be removed. The folder path should be the one used instead. I will eventually remove the functions, so I need to know them.

Similar question for the project config file, are they specified anywhere else other htan openssl.yml and helix.yml?

@daomcgill
Copy link
Collaborator

@carlosparadis
So far, I have changed the refreshers to look for a start_year_month in the existing files, and return an error if there are no existing files and no start_year_month.

I have a question about what you said here:
"Your exec and parallel should be file only, or you can't parallelize so that is correct."
and here:
"To be clear: The exec/parallel also should only assume the folder containing the files on the folder, and then figure out from there how to access the files. No single filepath should be hardcoded to the config."

In the Python script for parallelization, I call the parse functions (parse exec) one time per file, since this is the easiest way to separate jobs. If I were to pass a folder to the parse exec, and handle the logic for parsing each file within there, obviously this would not work. How should I deal with this?

@daomcgill
Copy link
Collaborator

@carlosparadis Only solution I can think of is to have the exec check whether the input is file or folder. Would that work for you?

- Make start_year_month optional
- Determine start_year_month from existing files if they exist
- Return error if no existing files, and no date specified

Signed-off-by: Dao McGill <[email protected]>
@carlosparadis
Copy link
Member

@daomcgill Sorry for the confusion, what I meant to say was that I agreed with you: The parallel should be file only. It is already correct, I worded it wrongly. The parse function should be at file level too. The only thing I want to remove is the hardcoded file from the config, that was the part that did not make sense to me.

The notebook should list the files and parse. One thing I'd appreciate you checking is the behavior of parse_mbox when pointed to a folder: I have added code in one of the notebooks that properly look through each file and call parse_mbox() on each file. I have also left a few notebooks that call it just calling the function on the folder. In this latter case, when I inspected the table it seemed to me it did not parse all files. Could you check what it is doing?

@daomcgill
Copy link
Collaborator

@carlosparadis when I call parse_mbox from a loop in the notebook, all files appear to be parsed. I could not run the loop in the social smells notebook you edited, since I am missing data for it. Would you like me to add a loop example in the mail notebook?

- Takes file path for mbox file to parse
- No longer need to pass project_conf

Signed-off-by: Dao McGill <[email protected]>
@carlosparadis
Copy link
Member

@daomcgill In the new notebook download_mail I included the parse functions to showcase the downloader results through the parser. If you confirm pointing parse_mbox() to a folder lead to weird results not taking all files, I would recommend you add the loop not only to both parsers, but also to the other notebooks that use parse_mbox() in Kaiaulu, as they will be abiding by the new format. See the loop I added on d3dd232.

For the notebooks, i'd say search parse_mbox() on GitHub Kaiaulu repo for hits. The notebooks that had _ are a good start (I changed in one of them already).

@daomcgill
Copy link
Collaborator

@carlosparadis Oh! Now I understand. I just tried passing a folder into parse_mbox() from the mail notebook, and it does actually parse all files. I checked the parsed rows against the number of unique reply IDs, and they were equal.

Copy link
Collaborator

@beydlern beydlern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Barring the minor changes requested, the notebook is clear in its presentation and runs as expected.

vignettes/download_mail.Rmd Outdated Show resolved Hide resolved
vignettes/download_mail.Rmd Show resolved Hide resolved
@carlosparadis
Copy link
Member

@daomcgill could you resolve the merge conflicts?

@daomcgill
Copy link
Collaborator

@carlosparadis resolved!

Signed-off-by: Dao McGill <[email protected]>
@daomcgill
Copy link
Collaborator

@carlosparadis fixed a few small issues I overlooked. I think this one should be good?

Copy link
Collaborator

@beydlern beydlern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants