-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract and recover structure of extras #48
Conversation
Why does the Ruby and Bash grammars use extra to represent the body part of an heredoc? |
It thought it would be clear from the example. A heredoc body starts on the line after the line that contains the heredoc opening marker. It does not necessarily start right after the marker, which can be followed by other instructions or fragments of the current instruction as a long as we stay on the same line. The Bash grammar actually doesn't do yet this but it should.
According to https://www.php.net/manual/en/language.types.string.php#language.types.string.syntax.heredoc, PHP heredocs are sane, i.e. a newline is required right after the opening marker. |
50f910e
to
837731c
Compare
I updated the PR so it works again. I had to update how we generate dumpers for extras. I'm now looking into regenerating all the grammars in ocaml-tree-sitter-semgrep. |
embedded within other extras.
We could add locations to all nodes of the CST but that would require massive (but easy) changes to the legacy code in semgrep.
837731c
to
45cb3e5
Compare
with the CST as well as allowing tricks such as merging the TypeScript and TSX CST types.
…oprietary#2260) This changes the type of tree-sitter parsing results. This required regenerating all the grammars. Passing CI tests with this pull requests validates the other two pull requests: * semgrep/ocaml-tree-sitter-core#48 * semgrep/ocaml-tree-sitter-semgrep#510 test plan: `make test` synced from Pro bc3fa2e1e2838941e3860d422569ca6a35826b15
…oprietary#2260) This changes the type of tree-sitter parsing results. This required regenerating all the grammars. Passing CI tests with this pull requests validates the other two pull requests: * semgrep/ocaml-tree-sitter-core#48 * semgrep/ocaml-tree-sitter-semgrep#510 test plan: `make test` synced from Pro bc3fa2e1e2838941e3860d422569ca6a35826b15
…oprietary#2260) This changes the type of tree-sitter parsing results. This required regenerating all the grammars. Passing CI tests with this pull requests validates the other two pull requests: * semgrep/ocaml-tree-sitter-core#48 * semgrep/ocaml-tree-sitter-semgrep#510 test plan: `make test` synced from Pro bc3fa2e1e2838941e3860d422569ca6a35826b15
Update (2024): we're seeing strange parsing errors in semgrep with the html parser. It seems unrelated to the work on extras but it's blocking us. The semgrep PR is https://github.com/semgrep/semgrep-proprietary/pull/2260
Closes #2
Extras are nodes that can occur anywhere in the CST returned by tree-sitter. They used to be completely ignored. Usually, they are not needed because they're comments. However, the heredoc syntax in Ruby and Bash calls for parsing the body of the template as an extra. The syntax is roughly "the token
<<
indicates that the contents starting on the next line is the body of the heredoc template". The main issue is that some material can follow the marker<<
on the same line. There can even be multiple<<
on the same line. In Bash, for example, we can do this:Simple case:
Extreme case:
This results in the insertion of the heredoc body where it occurs, i.e. not right after the marker
<<
but later, at a spot that can't be specified by a pure tree-sitter grammar.This PR allows the extraction and structure recovery of extra nodes independently from the grammar. When translating the CST to a more useful tree, the programmer will need to write code that attempts to match the opening heredoc marker with the next heredoc body that makes sense.
The following choices were made:
instead of
I think the latter would require either more code generation (generate code to convert the variant list to the record of lists) or would be done less efficiently (multiple passes over the whole tree).
Here's what we have to the test
extras
:Test input:
Test output (original tree-sitter output followed by the recovered typed representation):
test plan:
make && make install && make test
will catch most of the problems that are caused by bad code generation.Security