Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

324 GitHub discussions with graph ql api #332

Open
wants to merge 44 commits into
base: master
Choose a base branch
from

Conversation

crepesAlot
Copy link
Collaborator

No description provided.

anthonyjlau and others added 30 commits March 22, 2024 17:47
Added get  functions for mbox, jira,        and github.
Functions that require something from a list (ie. conf[["version_control"]][["branch"]][4]) now have an index parameter. (ie. conf[["version_control"]][["branch"]][branch_index]) Also finished the rest of the get methods for all notebooks.
I normalized the comments by having it follow a format. I added if wrappers to catch null values.
I changed the documentation of the JIRA functions to return a list of project keys and storage paths (HADOOP case).

I also changed the documentation of the mbox function to return a list of storage paths.
Confirming setup for committing and PR reflection.
Confirming setup for committing and PR reflection.
In each notebook, the working directory is the vignettes/ directory, however, the getwd() command states that the working directory is the kaiaulu/ folder itself. This is because each R code chunk considers its directory to be the working directory. To test the functionality of the get functions, I edited the config file to reflect the relative directory of vignettes/, thus parent directory ../ file path changes were added to each function.
The source("../R/config.R") reference was previously used to import the get() functions from config.R, however I wasn't building the project correctly. Now, this incorrect reference can be removed.
Due to a minor misunderstanding on my part, I added in relative paths in the config.R file from the notebook's directory, this was due to the IDE RStudio changing the working directory to the current running notebook directory rather than staying at the project directory.
- Function titles, descriptions, and parameter documentation were reviewed and redefined.
- Removed inefficient hardcoding of tools.yml and duplicated "config" path code.
- Appropriate warning messages were added to each function in the case of the return value possessing the NULL value.
- Notebook blamed_line_types_showcase.Rmd was updated to reflect follow the redefined R/config.R get functions.
- config.R was modified to possess descriptive get function titles and descriptions as well as a separation of getter functions into groups for clarity.
- Some appropriate notebooks (every notebook in vignettes/ that doesn't have an "_" prefix) were decoupled to use the get() functions from config.R in lieu of hard-coded paths to the configuration files.
- Updated mailing_list for conf/ files
- Updated issue_tracker for conf/ files
- Refactored most of the notebooks to replace the hard coded paths with the getter functions in R/config.R
- Updated a few of the configuration files (.yml extension) in conf/ fixing some syntax and indentation errors
- Added another getter function in R/config.r called get_github_issue_event_path
- Edited DESCRIPTION to incorporate myself as a contributor
- Edited NEWS.md to describe refactoring getter function feature.
- config_file parameter description was incorrect, a reference to how it is obtained was added to its description.
- renamed get_parsed() to parse_config() and changed the appropriate getter calls in all of the notebooks.
- reverted RoxygenNote version in DESCRIPTION
- config.R functions are organized to follow the same formatting as the conf files
- All conf files issue_trackers have been updated.
- get_mbox_mailing_list() and get_mbox_archive_type() were removed as the keys that they are grabbing are scheduled for removal.
- added two new bugzilla getter functions get_bugzilla_issue_comment_path and get_bugzilla_issue_path
- Implemented two pipermail mailing list getters get_pipermail_domain and get_pipermail_path
- Implemented two more getter functions: get_mbox_input_file() and get_pipermail_input_file()
- Added Scitools understand section to tools
- Updated mailing_list
- Implemented 4 Getter functions: get_understand_code_language(), get_understand_keep_dependencies_type(), get_understand_project_path(), and get_understand_output_path().
- Changed get_code_language() and get_keep_dependencies_type() to get_depends_code_language() and get_depends_keep_dependencies_type(), respectively as well as updated the appropriate notebooks that used these getters.
beydlern and others added 12 commits October 10, 2024 11:20
- Updated NEWS.md
- Minor change to mailing_list
- The project configuration sections of each notebook was incorrectly using the project directory (kaiaulu/) as their working directory rather than the directory that they reside in (/vignettes/) as their working directory.
- Uncommented the output_filepath line for pattern4 tool in junit5.yml configuration file.
- Uncommented the mod_mbox mailing list section for mailing_list in openssl.yml configuration file.
For github.R:
- github_api_discussions
- github_parse_discussions
- github_parse_discussion_comments
For config.R:
- get_github_discussions_path
For conf/kaiaulu.yml:
- Added new discussion field in github issue_tracker for a save filepath for discussions JSON file
For github_api_showcase.Rmd:
- Demonstrating usage for the newly added functions.
Changed the function to paginate as needed to download all available entries.
- Added refresher function for discussions
- Non-fixed warning, refresher function could cause error if JSON file is improperly named.
- Added `create_file_directory` function
- Added `create_file_path` helper function
- Added @export to functions
- Added some more detail to function description
- Added discussion filepath under github issue tracker
- Moved the Download Discussions function and explanations from `vignettes/github_api_showcase.Rmd` to the `vignettes/download_github_comments.Rmd` notebook
…' into 324-github-discussions-with-GraphQL-API
- Updated filepaths to unify formatting
- Added discussion save filepath to github issue tracker
- Updated `vignettes/download_github_comments.Rmd` notebook
- Added documentation for refresh function
- Set create_file_directory function verbose param to FALSE as default
@carlosparadis
Copy link
Member

@crepesAlot You mentioned a few times something was not working on this PR but I don't remember ever going in detail with you. Do you want to share what is it that you found was not working to see if I can help?

@crepesAlot
Copy link
Collaborator Author

crepesAlot commented Dec 8, 2024

@carlosparadis Yes, I'm having difficulties getting the refresher to work properly.
The main problem is this line:

  # Read the JSON file
  json_data <- fromJSON(file.path(save_folder_path, latest_discussion), simplifyVector = FALSE)

The error I get is

Error: lexical error: invalid char in json text.
                                       NA
                     (right here) ------^

The problem seems to stem from the way I check the folder the downloaded JSON files should be in.

  # If there are no json files, download all discussions
  if (length(contents) == 0 ) {
    # Run the regular downloader
    discussions <- github_api_discussions(token, owner, repo, save_folder_path)
    return (discussions)
  }

However, if I delete the JSON files in the save_folder_path (thus making it empty) it does not run the regular downloader.
running:

contents <- list.files(save_folder_path, all.files = TRUE)
print(contents)
print(length(contents)==0)
[1] "."         ".."        ".DS_Store"
[1] FALSE

I believe I need to somehow make the function explicitly check for JSON files, rather than if the folder is empty.
My attempts at this were:

json_contents <- list.files(save_folder_path, pattern= "\\.json$", full.names = TRUE, recursive= TRUE)

However this also causes some problems as it adds an extra slash to the folder_path like so:

"../../rawdata/deep-live-cam/github/hacksider_deep-live-cam/discussion//hacksider_Deep-Live-Cam_1714152212_1732289412.json"

I know this isn't caused by an incorrect save_folder_path because printing the save_folder_path returs

"../../rawdata/deep-live-cam/github/hacksider_deep-live-cam/discussion/"

Explicitly checking for the JSON files solves the issue of the refresh not running the regular downloader if the folder is empty, but throws the same error if the folder is not empty, (thus necessitating it to check if the downloaded files are up to date)

@carlosparadis
Copy link
Member

@crepesAlot have you checked how the JIRA downloader handles this? It also downloads json files. You may even be able to reuse the entire code logic from there.

@crepesAlot
Copy link
Collaborator Author

@carlosparadis I found out the solution was simple all along. I simply needed to remove the all.files=TRUE from the list.files.
Unfortunately, I found some new errors that I need to fix as well.
When I delete one of the JSON files (to simulate the folder being out-of-date) it doesn't properly update and download the most recent discussions.
I tried to follow the jira and github refresh functions as closely as I could, but because the graphql functions off of query, the current problem with the code is with the way I am forming said query, which is something I cannot copy from the other refresh functions.

@carlosparadis
Copy link
Member

Just so I am clear, when you say it does not update: Are you trying to edit the existing file? You don't need to edit the existing file, it is ok the new changes are added as a new file, provided the timestamp of the file is correct.

The files do not need to be precisely per month. I do not believe JIRA nor GitHub obey that either. Dao's is per month only because mbox files are downloaded per month only.

@crepesAlot
Copy link
Collaborator Author

@carlosparadis No, I am testing with a repository that the downloader creates 2 new JSON files, then deleting 1 of the JSON files with the more recent discussions. I then try to run the refresher to see if it will pick up that there are more recent discussion than the remaining JSON file in the folder.
But the refresh function doesn't detect this, and returns that there are no new discussions to download.

I strongly believe this is because I structured the query in the refresh incorrectly.
I am having some difficulties with trying to fix this due to some confusion with how the GraphQL retrieves the discussions from Github.

discussions(
  after: String,
  before: String,
  first: Int,
  last: Int,
  categoryId: ID = null,
  answered: Boolean = null,
  orderBy: DiscussionOrder = {field: UPDATED_AT, direction: DESC}
) : Discussion

I had thought that the query would work by providing it the date at the "after" field, but that doesn't seem to work.
The queries seem to rely on a cursor, a type of string retrieved from a different query.

@carlosparadis
Copy link
Member

@crepesAlot You should not have delete file logic! This will overly complicate things. Just have it so the new file starts of where the last file began, which information can be obtained from the file name. Trying to edit an existing JSON to add more data is the road to madness!

I'm still trying to understand the cursor part. Are you saying we can't query discussion starting from a created at date? So long that is possible, then you can just create the new file. Please remove the delete file logic, it will be hard to maintain with file deletion, which is why GitHub, JIRA and Bugzilla just add new files.

@crepesAlot
Copy link
Collaborator Author

@carlosparadis The function doesn't delete or modify any JSON files. The downloader acts exactly as you say, it downloads 1 page, saves it as a JSON, then goes to page 2 and saves that as a new, separate JSON file.

The act of manually deleting a file (me going into my Finder and deleting it) was my attempt to forcibly try to see if it could check if the dates of the JSON file were not the latest discussion, then download the discussions from where the remaining JSON left off. Which is what didn't work.
I wasn't sure of another way to test it without posting random things in this repo's discussion, but I was able to find that the query in the refresh function doesn't work the way I thought it did regardless.

We cannot directly form a query using a createdAt date.
I am currently trying to see if there's still a way to implement a refresher function by the way the query orders the discussions and saves the cursor.
From my understanding the cursor acts as a sort of page number for the query. (Since the number of discussions a query can get per page can vary from 1 to 100). But it currently is up in air as to whether I can get it working or not.

- Reconfigured how `create_file_directory` function obtains paths from config file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants