Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanding the Syntax Extractor #313

Open
7 of 8 tasks
daomcgill opened this issue Oct 9, 2024 · 10 comments · May be fixed by #320
Open
7 of 8 tasks

Expanding the Syntax Extractor #313

daomcgill opened this issue Oct 9, 2024 · 10 comments · May be fixed by #320
Assignees

Comments

@daomcgill
Copy link
Collaborator

daomcgill commented Oct 9, 2024


Purpose

The Syntax Extractor in Kaiaulu is used to extract meaningful information from source code using srcML. The purpose of this task is to extend the syntax extraction capabilities by adding new functions to extract file-level and class-level documentation using XPath queries. The extracted data will be used in future stages, where the goal is to combine the data with NLP to create semantic representations of the code.

Process

  1. Understand the current syntax extractor functions, and how annotations are represented in srcML XML files.
  2. Understand XPath queries, and create queries for the new functions (file and class level documentation).
  3. Implement new functions with the custom XPath queries.
  4. Test on various examples.
  5. Consider other syntactic elements that could be useful for the bigger picture. Example: Functions in git.R that retrieve commit messages or issue discussions tied to a file.
  6. Create a notebook for Syntax Extraction, and maintain it with existing/ new functionality.

Existing Functions

New Functions

  • query_src_text_file_documentation(): Extract file-level documentation (e.g. comments in the file header).
  • query_src_text_class_documentation(): Extract class-level documentation (e.g. comments before class declaration).

Task List

  • Install and set up srcML. This takes source code and add XML annotations (e.g. classes, methods, variables, etc.). Include the path to srcML in tools.yml.
  • Use the existing functions to generate XML files. Inspect the files to understand the output.
  • Gain an understanding of how XPath queries are formed, and figure out how to create queries for the new functions.
  • Write the new XPath queries.
  • Write the new functions.
  • Verify that the functions work correctly.
  • Consider what other functions could be useful, and implement those.
  • Maintain a notebook for Syntax Extraction.

References


@daomcgill daomcgill self-assigned this Oct 9, 2024
@daomcgill
Copy link
Collaborator Author

@carlosparadis here is my first pass at the issue for part I.

@carlosparadis
Copy link
Member

Yes, this is great!

The one thing I would emphasize is:

Understand XPath queries, and create queries for the new functions (file and class level documentation).

The better mindset is to consider:

  1. Imagine a table that every row is a file in the project. The first column is the filepath, the second column contain in the cell all of its content as words.
  2. If we wrote programs purely in english, then every word would be meaningful. But the words "for", "if", etc, do not carry a lot of meaning on its own. So, in another column, we could consider creating a cell without them (let's skip this for now).
  3. Then in a subsequent column we could say, "let's just focus on docstrings since that's actual english". Now you have another column that only contain words coming from docstrings.
  4. You could also have another column that in addition to the docstrings, include the variable names, since they carry meaning. Or function names, etc.

Every column is a different "extract" from all the words in each file. The functions you will add, will extract these words from a project files. This would enable the rest of the pipeline to experiment with which set of words is more meaningful.

daomcgill added a commit that referenced this issue Oct 15, 2024
- Documented current syntax extraction functions
- Overview on syntax extraction and XPath
- Placeholder for syntax.yml config file

Signed-off-by: Dao McGill <[email protected]>
@daomcgill
Copy link
Collaborator Author

@carlosparadis I added a notebook for syntax extraction. Could you skim it and see if this is what you are looking for? The section on XPath queries is definitely a WIP, it was more to gather my own understanding. It will be updated as I work on getting the actual queries needed for the new functions. New config file is a placeholder as of now.

@carlosparadis
Copy link
Member

@daomcgill did you open a PR for it?

@daomcgill
Copy link
Collaborator Author

daomcgill commented Oct 15, 2024

@carlosparadis I had not. I just opened one here.

daomcgill added a commit that referenced this issue Oct 18, 2024
- Added new functions
- New configuration file
- Updated documentation

Signed-off-by: Dao McGill <[email protected]>
@daomcgill daomcgill linked a pull request Oct 18, 2024 that will close this issue
daomcgill added a commit that referenced this issue Oct 18, 2024
- Remove unused settings
- Change ../ to ../../
- Update notebook to reflect changes

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Oct 24, 2024
- Added parameter for excluding licenses in class and file-level comment extraction
- Implemented function extraction for function names with optional parameters
- Implemented variable extraction with optional types
- Added examples for removing empty comments and/or comment delimiters

Signed-off-by: Dao McGill <[email protected]>
@daomcgill
Copy link
Collaborator Author

@carlosparadis I made the changes we discussed. I also added functions for extracting function names (opt. params) and variable names (opt. types). I do need to do a more thorough manual check to make sure the results are accurate, but it appears to be working so far.
Should I continue adding functions--maybe imports or some way of determining inheritance structure (I would need to think about this some more)--or move on and come back once the other pieces start to come together?

@carlosparadis
Copy link
Member

@daomcgill the code that @RavenMarQ did already determines dependency types, so no need to go after that. If you already covered variables, functions, classes, is there anything else on the source code you thing carries semantic meaning?

Also, you did this for Java, right? What were the other available languages srcML covered again?

And lastly, did you do the parser functions for these too or just the command to create the XML?

@daomcgill
Copy link
Collaborator Author

@carlosparadis I do not think operator, specifier or control statements would be particularly meaningful. Do you have any ideas of what might be useful? The only thought I had was imports.

I did this for Java. It looks like srcML is also available for C, C++ and C#.

I am not sure what you mean by parser functions vs command to create the XML. I created functions that query the XML file generated by the preexisting function, and extract the class/file level comments, functions and variables.

@carlosparadis
Copy link
Member

Hi Dao,

I honestly can't think of anything else. I would say go ahead and add the imports. We can just stick with Java for now. Python would have been nice if they covered.

Once this issue is done, I suggest we also skip ahead to play with the Python Notebook that process this data on the other repo, since this will give you a better idea how this data is being used, and then we can cycle back to the issue that represent files as commit and comments associated to it.

Have you been using your functions to parse a table of 1 file, or the entire project out? We will want to ideally get a table that contains the filepaths the classes, methods, docstring belongs so it helps with subsetting when we move to the Python script.

I recall in the Text GoF notebook I had to create a filepath and a classpath to be able to connect both together, but maybe the xpath can give you both.

@daomcgill
Copy link
Collaborator Author

daomcgill commented Oct 24, 2024

@carlosparadis I will add the imports. It looks like srcML should be adding support for more languages soon, but we do not know when.

Also, I realized what you meant by parser functions. I did create those as well.

So far, I have been using the functions on one specific src folder in the Maven repo. It contains multiple files, but not the entire project. I can work on that. The tables generated show filepath + the element that specific function is querying, e.g. filepath and classes. I could make a function that calls each of the query functions and compiles that extracted data in a single table.

daomcgill added a commit that referenced this issue Oct 25, 2024
- Added function for imports
- Reformatted new query functions
- Added Notebook Example for Joined Queries

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Oct 25, 2024
daomcgill added a commit that referenced this issue Oct 29, 2024
- Fix for issue with namespaces in certain queries
- TO DO: Package function currently missing filepath

Signed-off-by: Dao McGill
daomcgill added a commit that referenced this issue Oct 30, 2024
- Now displays filenames correctly

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Oct 30, 2024
daomcgill added a commit that referenced this issue Oct 31, 2024
daomcgill added a commit that referenced this issue Oct 31, 2024
daomcgill added a commit that referenced this issue Oct 31, 2024
daomcgill added a commit that referenced this issue Nov 1, 2024
daomcgill added a commit that referenced this issue Nov 2, 2024
- TO DO: Cheatsheet for this work thread

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Nov 5, 2024
@carlosparadis carlosparadis added this to the ics496-fall24-m3 milestone Nov 11, 2024
daomcgill added a commit that referenced this issue Dec 9, 2024
- Added getter for src_folder
- Updated notebook to use getters

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Dec 9, 2024
Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Dec 9, 2024
Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Dec 9, 2024
- remove print statement
- gt displays head(10)

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Dec 9, 2024
Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Dec 9, 2024
- Added back filters using get()

Signed-off-by: Dao McGill <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants