-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expanding the Syntax Extractor #313
Comments
@carlosparadis here is my first pass at the issue for part I. |
Yes, this is great! The one thing I would emphasize is:
The better mindset is to consider:
Every column is a different "extract" from all the words in each file. The functions you will add, will extract these words from a project files. This would enable the rest of the pipeline to experiment with which set of words is more meaningful. |
- Documented current syntax extraction functions - Overview on syntax extraction and XPath - Placeholder for syntax.yml config file Signed-off-by: Dao McGill <[email protected]>
@carlosparadis I added a notebook for syntax extraction. Could you skim it and see if this is what you are looking for? The section on XPath queries is definitely a WIP, it was more to gather my own understanding. It will be updated as I work on getting the actual queries needed for the new functions. New config file is a placeholder as of now. |
@daomcgill did you open a PR for it? |
@carlosparadis I had not. I just opened one here. |
- Added new functions - New configuration file - Updated documentation Signed-off-by: Dao McGill <[email protected]>
- Remove unused settings - Change ../ to ../../ - Update notebook to reflect changes Signed-off-by: Dao McGill <[email protected]>
- Added parameter for excluding licenses in class and file-level comment extraction - Implemented function extraction for function names with optional parameters - Implemented variable extraction with optional types - Added examples for removing empty comments and/or comment delimiters Signed-off-by: Dao McGill <[email protected]>
@carlosparadis I made the changes we discussed. I also added functions for extracting function names (opt. params) and variable names (opt. types). I do need to do a more thorough manual check to make sure the results are accurate, but it appears to be working so far. |
@daomcgill the code that @RavenMarQ did already determines dependency types, so no need to go after that. If you already covered variables, functions, classes, is there anything else on the source code you thing carries semantic meaning? Also, you did this for Java, right? What were the other available languages srcML covered again? And lastly, did you do the parser functions for these too or just the command to create the XML? |
@carlosparadis I do not think operator, specifier or control statements would be particularly meaningful. Do you have any ideas of what might be useful? The only thought I had was imports. I did this for Java. It looks like srcML is also available for C, C++ and C#. I am not sure what you mean by parser functions vs command to create the XML. I created functions that query the XML file generated by the preexisting function, and extract the class/file level comments, functions and variables. |
Hi Dao, I honestly can't think of anything else. I would say go ahead and add the imports. We can just stick with Java for now. Python would have been nice if they covered. Once this issue is done, I suggest we also skip ahead to play with the Python Notebook that process this data on the other repo, since this will give you a better idea how this data is being used, and then we can cycle back to the issue that represent files as commit and comments associated to it. Have you been using your functions to parse a table of 1 file, or the entire project out? We will want to ideally get a table that contains the filepaths the classes, methods, docstring belongs so it helps with subsetting when we move to the Python script. I recall in the Text GoF notebook I had to create a filepath and a classpath to be able to connect both together, but maybe the xpath can give you both. |
@carlosparadis I will add the imports. It looks like srcML should be adding support for more languages soon, but we do not know when. Also, I realized what you meant by parser functions. I did create those as well. So far, I have been using the functions on one specific src folder in the Maven repo. It contains multiple files, but not the entire project. I can work on that. The tables generated show filepath + the element that specific function is querying, e.g. filepath and classes. I could make a function that calls each of the query functions and compiles that extracted data in a single table. |
- Added function for imports - Reformatted new query functions - Added Notebook Example for Joined Queries Signed-off-by: Dao McGill <[email protected]>
- Fix for issue with namespaces in certain queries - TO DO: Package function currently missing filepath Signed-off-by: Dao McGill
- Now displays filenames correctly Signed-off-by: Dao McGill <[email protected]>
- TO DO: Cheatsheet for this work thread Signed-off-by: Dao McGill <[email protected]>
Signed-off-by: Dao McGill <[email protected]>
- Added getter for src_folder - Updated notebook to use getters Signed-off-by: Dao McGill <[email protected]>
Signed-off-by: Dao McGill <[email protected]>
Signed-off-by: Dao McGill <[email protected]>
- remove print statement - gt displays head(10) Signed-off-by: Dao McGill <[email protected]>
Signed-off-by: Dao McGill <[email protected]>
- Added back filters using get() Signed-off-by: Dao McGill <[email protected]>
Purpose
The Syntax Extractor in Kaiaulu is used to extract meaningful information from source code using srcML. The purpose of this task is to extend the syntax extraction capabilities by adding new functions to extract file-level and class-level documentation using XPath queries. The extracted data will be used in future stages, where the goal is to combine the data with NLP to create semantic representations of the code.
Process
Existing Functions
New Functions
Task List
References
The text was updated successfully, but these errors were encountered: