Expanding the Syntax Extractor #313

daomcgill · 2024-10-09T23:54:28Z

Purpose

The Syntax Extractor in Kaiaulu is used to extract meaningful information from source code using srcML. The purpose of this task is to extend the syntax extraction capabilities by adding new functions to extract file-level and class-level documentation using XPath queries. The extracted data will be used in future stages, where the goal is to combine the data with NLP to create semantic representations of the code.

Process

Understand the current syntax extractor functions, and how annotations are represented in srcML XML files.
Understand XPath queries, and create queries for the new functions (file and class level documentation).
Implement new functions with the custom XPath queries.
Test on various examples.
Consider other syntactic elements that could be useful for the bigger picture. Example: Functions in git.R that retrieve commit messages or issue discussions tied to a file.
Create a notebook for Syntax Extraction, and maintain it with existing/ new functionality.

Existing Functions

annotate_src_text(): Runs srcML on a folder of source code and outputs the annotated XML.
query_src_text(): Runs an XPath query on the annotated XML generated from annotate_src_text().
query_src_text_class_names(): Extracts class names from the annotated source code.
query_src_text_namespace(): Extracts the namespace (file path) from the annotated source code.

New Functions

query_src_text_file_documentation(): Extract file-level documentation (e.g. comments in the file header).
query_src_text_class_documentation(): Extract class-level documentation (e.g. comments before class declaration).

Task List

Install and set up srcML. This takes source code and add XML annotations (e.g. classes, methods, variables, etc.). Include the path to srcML in tools.yml.
Use the existing functions to generate XML files. Inspect the files to understand the output.
Gain an understanding of how XPath queries are formed, and figure out how to create queries for the new functions.
Write the new XPath queries.
Write the new functions.
Verify that the functions work correctly.
Consider what other functions could be useful, and implement those.
Maintain a notebook for Syntax Extraction.

References

daomcgill · 2024-10-09T23:56:16Z

@carlosparadis here is my first pass at the issue for part I.

carlosparadis · 2024-10-10T00:37:49Z

Yes, this is great!

The one thing I would emphasize is:

Understand XPath queries, and create queries for the new functions (file and class level documentation).

The better mindset is to consider:

Imagine a table that every row is a file in the project. The first column is the filepath, the second column contain in the cell all of its content as words.
If we wrote programs purely in english, then every word would be meaningful. But the words "for", "if", etc, do not carry a lot of meaning on its own. So, in another column, we could consider creating a cell without them (let's skip this for now).
Then in a subsequent column we could say, "let's just focus on docstrings since that's actual english". Now you have another column that only contain words coming from docstrings.
You could also have another column that in addition to the docstrings, include the variable names, since they carry meaning. Or function names, etc.

Every column is a different "extract" from all the words in each file. The functions you will add, will extract these words from a project files. This would enable the rest of the pipeline to experiment with which set of words is more meaningful.

- Documented current syntax extraction functions - Overview on syntax extraction and XPath - Placeholder for syntax.yml config file Signed-off-by: Dao McGill <[email protected]>

daomcgill · 2024-10-15T02:27:37Z

@carlosparadis I added a notebook for syntax extraction. Could you skim it and see if this is what you are looking for? The section on XPath queries is definitely a WIP, it was more to gather my own understanding. It will be updated as I work on getting the actual queries needed for the new functions. New config file is a placeholder as of now.

carlosparadis · 2024-10-15T03:44:17Z

@daomcgill did you open a PR for it?

daomcgill · 2024-10-15T03:52:38Z

@carlosparadis I had not. I just opened one here.

- Added new functions - New configuration file - Updated documentation Signed-off-by: Dao McGill <[email protected]>

- Remove unused settings - Change ../ to ../../ - Update notebook to reflect changes Signed-off-by: Dao McGill <[email protected]>

- Added parameter for excluding licenses in class and file-level comment extraction - Implemented function extraction for function names with optional parameters - Implemented variable extraction with optional types - Added examples for removing empty comments and/or comment delimiters Signed-off-by: Dao McGill <[email protected]>

daomcgill · 2024-10-24T01:25:55Z

@carlosparadis I made the changes we discussed. I also added functions for extracting function names (opt. params) and variable names (opt. types). I do need to do a more thorough manual check to make sure the results are accurate, but it appears to be working so far.
Should I continue adding functions--maybe imports or some way of determining inheritance structure (I would need to think about this some more)--or move on and come back once the other pieces start to come together?

carlosparadis · 2024-10-24T02:40:42Z

@daomcgill the code that @RavenMarQ did already determines dependency types, so no need to go after that. If you already covered variables, functions, classes, is there anything else on the source code you thing carries semantic meaning?

Also, you did this for Java, right? What were the other available languages srcML covered again?

And lastly, did you do the parser functions for these too or just the command to create the XML?

daomcgill · 2024-10-24T07:24:40Z

@carlosparadis I do not think operator, specifier or control statements would be particularly meaningful. Do you have any ideas of what might be useful? The only thought I had was imports.

I did this for Java. It looks like srcML is also available for C, C++ and C#.

I am not sure what you mean by parser functions vs command to create the XML. I created functions that query the XML file generated by the preexisting function, and extract the class/file level comments, functions and variables.

carlosparadis · 2024-10-24T10:59:23Z

Hi Dao,

I honestly can't think of anything else. I would say go ahead and add the imports. We can just stick with Java for now. Python would have been nice if they covered.

Once this issue is done, I suggest we also skip ahead to play with the Python Notebook that process this data on the other repo, since this will give you a better idea how this data is being used, and then we can cycle back to the issue that represent files as commit and comments associated to it.

Have you been using your functions to parse a table of 1 file, or the entire project out? We will want to ideally get a table that contains the filepaths the classes, methods, docstring belongs so it helps with subsetting when we move to the Python script.

I recall in the Text GoF notebook I had to create a filepath and a classpath to be able to connect both together, but maybe the xpath can give you both.

daomcgill · 2024-10-24T20:42:48Z

@carlosparadis I will add the imports. It looks like srcML should be adding support for more languages soon, but we do not know when.

Also, I realized what you meant by parser functions. I did create those as well.

So far, I have been using the functions on one specific src folder in the Maven repo. It contains multiple files, but not the entire project. I can work on that. The tables generated show filepath + the element that specific function is querying, e.g. filepath and classes. I could make a function that calls each of the query functions and compiles that extracted data in a single table.

- Added function for imports - Reformatted new query functions - Added Notebook Example for Joined Queries Signed-off-by: Dao McGill <[email protected]>

- Fix for issue with namespaces in certain queries - TO DO: Package function currently missing filepath Signed-off-by: Dao McGill

- Now displays filenames correctly Signed-off-by: Dao McGill <[email protected]>

This reverts commit 22d85f9.

- TO DO: Cheatsheet for this work thread Signed-off-by: Dao McGill <[email protected]>

Signed-off-by: Dao McGill <[email protected]>

- Added getter for src_folder - Updated notebook to use getters Signed-off-by: Dao McGill <[email protected]>

Signed-off-by: Dao McGill <[email protected]>

- remove print statement - gt displays head(10) Signed-off-by: Dao McGill <[email protected]>

Signed-off-by: Dao McGill <[email protected]>

- Added back filters using get() Signed-off-by: Dao McGill <[email protected]>

daomcgill self-assigned this Oct 9, 2024

This was referenced Oct 10, 2024

Making Parsed Source Code Data Available Externally #314

Open

File representation as commit and issue messages #316

Open

carlosparadis mentioned this issue Oct 15, 2024

Extraction of Gang of Four Motifs #318

Open

5 tasks

daomcgill added a commit that referenced this issue Oct 18, 2024

i #313 Added File-Level and Class-Level Doc Functions

1b1c95e

- Added new functions - New configuration file - Updated documentation Signed-off-by: Dao McGill <[email protected]>

daomcgill linked a pull request Oct 18, 2024 that will close this issue

313 Syntax Extraction #320

Open

daomcgill added a commit that referenced this issue Oct 18, 2024

i #313 Update Config File

3b84b8e

- Remove unused settings - Change ../ to ../../ - Update notebook to reflect changes Signed-off-by: Dao McGill <[email protected]>

daomcgill added a commit that referenced this issue Oct 25, 2024

i #313 Add Function for Import Query

fb5ffa4

- Added function for imports - Reformatted new query functions - Added Notebook Example for Joined Queries Signed-off-by: Dao McGill <[email protected]>

daomcgill added a commit that referenced this issue Oct 25, 2024

i #313 Add Rd File

f0f7e7e

daomcgill added a commit that referenced this issue Oct 29, 2024

i #313 Fix for Query Functions

09dbb39

- Fix for issue with namespaces in certain queries - TO DO: Package function currently missing filepath Signed-off-by: Dao McGill

daomcgill added a commit that referenced this issue Oct 30, 2024

i #313 Fix for Package Query

952bd89

- Now displays filenames correctly Signed-off-by: Dao McGill <[email protected]>

daomcgill added a commit that referenced this issue Oct 30, 2024

i #313 Fix Roxygen version

4bb669f

daomcgill added a commit that referenced this issue Oct 31, 2024

i #313 Actions Fix Attempt

22d85f9

daomcgill added a commit that referenced this issue Oct 31, 2024

Revert "i #313 Actions Fix Attempt"

055116f

This reverts commit 22d85f9.

daomcgill added a commit that referenced this issue Oct 31, 2024

i #313 Actions Fix Attempt

2a2be34

daomcgill added a commit that referenced this issue Nov 1, 2024

i #313 Actions Fix Attempt

fc5668c

daomcgill added a commit that referenced this issue Nov 2, 2024

i #313 Notebook Revision

1cc63c8

- TO DO: Cheatsheet for this work thread Signed-off-by: Dao McGill <[email protected]>

daomcgill added a commit that referenced this issue Nov 5, 2024

i #313 Minor Fix: Change srcml_path to use tools.yml

68e2331

Signed-off-by: Dao McGill <[email protected]>

carlosparadis assigned beydlern and daomcgill and unassigned daomcgill and beydlern Nov 11, 2024

carlosparadis added this to the ics496-fall24-m3 milestone Nov 11, 2024

daomcgill added a commit that referenced this issue Dec 9, 2024

i #313 Use getters for config

bcaa6b9

- Added getter for src_folder - Updated notebook to use getters Signed-off-by: Dao McGill <[email protected]>

daomcgill added a commit that referenced this issue Dec 9, 2024

i #313 workflow revert

8ed023c

Signed-off-by: Dao McGill <[email protected]>

daomcgill added a commit that referenced this issue Dec 9, 2024

i #313 Update description

984fec3

Signed-off-by: Dao McGill <[email protected]>

daomcgill added a commit that referenced this issue Dec 9, 2024

i #313 Display tables for notebook

5315903

- remove print statement - gt displays head(10) Signed-off-by: Dao McGill <[email protected]>

daomcgill added a commit that referenced this issue Dec 9, 2024

i #313 Config changes

a20e57e

Signed-off-by: Dao McGill <[email protected]>

daomcgill added a commit that referenced this issue Dec 9, 2024

i #313 Minor fixes for notebook

b73f2c1

- Added back filters using get() Signed-off-by: Dao McGill <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expanding the Syntax Extractor #313

Expanding the Syntax Extractor #313

daomcgill commented Oct 9, 2024 •

edited

Loading

daomcgill commented Oct 9, 2024

carlosparadis commented Oct 10, 2024

daomcgill commented Oct 15, 2024

carlosparadis commented Oct 15, 2024

daomcgill commented Oct 15, 2024 •

edited

Loading

daomcgill commented Oct 24, 2024

carlosparadis commented Oct 24, 2024

daomcgill commented Oct 24, 2024

carlosparadis commented Oct 24, 2024

daomcgill commented Oct 24, 2024 •

edited

Loading

Expanding the Syntax Extractor #313

Expanding the Syntax Extractor #313

Comments

daomcgill commented Oct 9, 2024 • edited Loading

Purpose

Process

Existing Functions

New Functions

Task List

References

daomcgill commented Oct 9, 2024

carlosparadis commented Oct 10, 2024

daomcgill commented Oct 15, 2024

carlosparadis commented Oct 15, 2024

daomcgill commented Oct 15, 2024 • edited Loading

daomcgill commented Oct 24, 2024

carlosparadis commented Oct 24, 2024

daomcgill commented Oct 24, 2024

carlosparadis commented Oct 24, 2024

daomcgill commented Oct 24, 2024 • edited Loading

daomcgill commented Oct 9, 2024 •

edited

Loading

daomcgill commented Oct 15, 2024 •

edited

Loading

daomcgill commented Oct 24, 2024 •

edited

Loading