wgsl: discover template lists early (Lookahead disambiguation of less…

…-than vs template argument list (v2)) (gpuweb#3803) * Implement a tree-sitter scanner for template disambiguation Use a custom scanner to disambiguate between template-argument-lists and less-than / greater-than. * Build the treesitter shared library on our own The py-tree-sitter compilation doesn't work on macOS because it doesn't know to use -std=c++17 when compiling C++ code. * Grammar analyzer understands many extra external tokens * Only do slow actions when data is newer. * Grammar.py: Generate syntax_sym references for extra external tokens * Allow syntactic tokens to be named without backticks * Regenerate recursive grammar * type_specifier is fully_qualified_ident This is much simpler than using "expression". But note that in future we may want to have type expressions like unions, as in TypeScript. That door is still open, as the grammar was unambiguous (and recursive-descentable) even with type being "expression" * Add TODOs * Remove extraneous grammar grouping * analyze/Grammar.py: Make _disambiguate_template an empty token In the template-matching scheme, it doesn't appear in source text. Make it empty so that first and follow sets are computed correctly. * Explain the custom tokens Delete redundant syntactic token definitions * Add more TODOs in scan, and more comments * Explain why "var" is special-cased Ben answered * disambiguate token does not have to be optional * Refactor extract-grammar.py Add argument parsing Add a --flow argument to specify which step to run. * Add WGSL parsing unit tests * Fix tree-sitter parsing of bitcast Need to add _disambiguate_template before the template list so they are parsed as templates. * Add more unit tests * analyze/Grammar.py: disambiguate token treated as nonempty * scanner.cc: better comments, and dump "valids" array when dumping * "var" is followed by template list scanner.cc: Remove the "var" exception * Better explanation of synthetic token * Fix comment from merge conflict * Add explicit dependencies on the fixed wgsl include files * Support elements that are hidden spans. * Change _disambiguate_template to a hidden span token It's a Treesitter implementation detail, so it doesn't belong in the spec. * The relational and shift tokens are displayed plainly in the HTML They are remapped to custom tokens by the extract-grammar process. Pretty printing in analyze/Grammar.py has to remove leading underscores so they link correctly to definitions. * Fix formatting of bitcast rule * Start writing template disambiguation * Add TODOs in the custom scanner * Describe the template list discovery algorithm. * Add missing disambiguation for fully-qualified-ident * Custom scanner: correctly mark an ordinary greater-than code point * Better wording of CurrentPosition * scanner.cc: Show more details about expected valid tokens * Add another disambiguation token ident use inin core_lhs_expression makes simple assignments work. * scanner.cc: Add more logging * Lots more logging * Many more unit tests * Make types and type-generating names keywords again Insert disambiguating spans and template_args_start and _end where needed. Fixes: gpuweb#3770 * scanner.cc: Better comment about the sentinel value * extract-grammar.py: GCC wants -shared and -fPIC link flags * extract-grammar:.py: GCC requires link flags after, not before --------- Co-authored-by: Ben Clayton <[email protected]>
jiangzhaoming · Feb 13, 2023 · 6679a89 · 6679a89
1 parent ae8b4a2
commit 6679a89
Show file tree

Hide file tree

Showing 8 changed files with 2,037 additions and 410 deletions.
diff --git a/.clang-format b/.clang-format
@@ -0,0 +1,2 @@
+# http://clang.llvm.org/docs/ClangFormatStyleOptions.html
+BasedOnStyle: Chromium
diff --git a/wgsl/Makefile b/wgsl/Makefile
@@ -1,14 +1,14 @@
 .PHONY: all clean nfkc validate lalr validate-examples
 
 all: index.html nfkc validate test diagrams
-validate: lalr validate-examples
-validate-examples: grammar/grammar.js
+validate: lalr unit_tests validate-examples
 
 clean:
-	rm -f index.html index.pre.html grammar/grammar.js wgsl.lalr.txt
+	rm -f index.html index.pre.html index.bs.pre grammar/grammar.js grammar/build wgsl.lalr.txt
+
 
 # Generate spec HTML from Bikeshed source.
-WGSL_SOURCES:=index.bs $(wildcard wgsl.*.bs.include)
+WGSL_SOURCES:=index.bs scanner.cc wgsl.recursive.bs.include wgsl.reserved.bs.include
 index.pre.html: $(WGSL_SOURCES)
 	DIE_ON=everything bash ../tools/invoke-bikeshed.sh $@ $(WGSL_SOURCES)
 
@@ -23,14 +23,28 @@ diagrams: $(MERMAID_OUTPUTS)
 img/%.mmd.svg: diagrams/%.mmd ../tools/invoke-mermaid.sh ../tools/mermaid.json
 	sh ../tools/invoke-mermaid.sh -i $< -o $@
 
-# Extract WGSL grammar from the spec, validate it with Treesitter,
-# and use Treesitter to parse many code examples in the spec.
-grammar/grammar.js: index.bs extract-grammar.py
-	python3 ./extract-grammar.py index.bs grammar/grammar.js
+TREESITTER_GRAMMAR_INPUT := grammar/grammar.js
+TREESITTER_PARSER := grammar/build/wgsl.so
+
+# Extract WGSL grammar from the spec, validate it by building a Treesitter parser from it.
+$(TREESITTER_GRAMMAR_INPUT) $(TREESITTER_PARSER): index.bs scanner.cc extract-grammar.py
+	python3 ./extract-grammar.py --spec index.bs --scanner scanner.cc --tree-sitter-dir grammar --flow xb
+
+.PHONY: validate-examples
+# Use Treesitter to parse many code examples in the spec.
+validate-examples: $(TREESITTER_PARSER)
+	python3 ./extract-grammar.py --flow e
+
+.PHONY: unit_tests
+# Use Treesitter to parse code samples
+unit_tests: $(TREESITTER_PARSER) wgsl_unit_tests.py
+	python3 wgsl_unit_tests.py --parser $(TREESITTER_PARSER)
 
 # The grammar in JSON form, emitted by Treesitter.
 WGSL_GRAMMAR=grammar/src/grammar.json
-$(WGSL_GRAMMAR) : grammar/grammar.js
+$(WGSL_GRAMMAR) : $(TREESITTER_GRAMMAR_INPUT)
+
+wgsl_unit_tests: 
 
 .PHONY: nfkc
 nfkc:

diff --git a/wgsl/analyze/Grammar.py b/wgsl/analyze/Grammar.py
@@ -44,6 +44,7 @@
 
 import json
 import functools
+import sys
 from ObjectRegistry import RegisterableObject, ObjectRegistry
 from collections import defaultdict
 
@@ -323,8 +324,25 @@ def with_meta(phrase,metachar,print_option):
             # Print ourselves
             if print_option.bikeshed:
                 context = 'recursive descent syntax'
-                if print_option.grammar.rules[name].is_token():
+                g = print_option.grammar
+                if g.rules[name].is_token():
                     context = 'syntax'
+                if name in g.extra_externals:
+                    context = 'syntax_sym'
+                    if name == '_disambiguate_template':
+                        # This is an implementation detail, so make it invisible.
+                        return ''
+                    else:
+                        without_underscore = ['_less_than',
+                                              '_less_than_equal',
+                                              '_greater_than',
+                                              '_greater_than_equal',
+                                              '_shift_left',
+                                              '_shift_left_assign',
+                                              '_shift_right',
+                                              '_shift_right_assign']
+                        if name in without_underscore:
+                            name = name[1:]
                 return "[={}/{}=]".format(context,name)
             return name
         if isinstance(rule,Choice):
@@ -350,7 +368,7 @@ def with_meta(phrase,metachar,print_option):
                 # If it's not canonical, then it can have nesting.
                 return "(" + inside + nl + ")"
         if isinstance(rule,Seq):
-            return " ".join([i.pretty_str(print_option) for i in rule])
+            return " ".join(filter(lambda i: len(i)>0, [i.pretty_str(print_option) for i in rule]))
         if isinstance(rule,Repeat1):
             return "( " + "".join([i.pretty_str(print_option) for i in rule]) + " )+"
         raise RuntimeError("unexpected node: {}".format(str(rule)))
@@ -859,6 +877,21 @@ def is_accepting(self):
     def at_end(self):
         return self.position == len(self.items())
 
+def json_externals(json):
+    """
+    Returns the set of names of symbols in the "externals" section of the
+    Treesitter JSON grammar.
+
+    Data looks like this, for section "externals".
+        {
+          "externals": [
+            { "type": "SYMBOL", name: "_block_comment" },
+            { "type": "SYMBOL", name: "_error_sentinel" }
+          }
+        }
+    """
+    return set([ x["name"] for x in json.get("externals",[]) ])
+
 
 def json_hook(grammar,memo,tokens_only,dct):
     """
@@ -1801,6 +1834,22 @@ def __init__(self, json_text, start_symbol, ignore='_reserved'):
 
         # First decode it without any interpretation.
         pass0 = json.loads(json_text)
+
+        # Get the external tokens, these are not necessarily represented in the rules.
+        external_tokens = json_externals(pass0)
+        #print(external_tokens,file=sys.stderr)
+        defined_rules = set(pass0["rules"].keys())
+        # The set of external tokens that don't have an ordinary definition in the grammar.
+        self.extra_externals = external_tokens - defined_rules
+        for e in self.extra_externals:
+            content = "\\u200B{}".format(e)
+            if e == '_disambiguate_template':
+                # This is a zero-width token used for Treesitter's benefit
+                #content = ''
+                pass
+            # Create a placholder definition
+            pass0["rules"][e] = {"type":"TOKEN","content":{"type":"PATTERN","value":content}}
+
         # Remove any rules that should be ignored
         # The WGSL grammar has _reserved, which includes 'attribute' but
         # that is also the name of a different grammar rule.
@@ -1922,6 +1971,7 @@ def pretty_str(self,print_option=PrintOption()):
 
         token_rules = set()
 
+        # Look for defined rules that look better as absorbed into their uses.
         for name, rule in self.rules.items():
             # Star-able is also optional-able, so starrable must come first.
             starred_phrase = rule.as_starred(name)
@@ -1938,6 +1988,8 @@ def pretty_str(self,print_option=PrintOption()):
                 if len(phrase)==1 and phrase[0].is_token():
                     token_rules.add(name)
 
+        # A rule that was generated to satisfy canonicalization is better
+        # presented as absorbed in its original parent.
         for name, rule in self.rules.items():
             # We only care about rules generated during canonicalization
             if name.find('.') > 0 or name.find('/') > 0:
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# http://clang.llvm.org/docs/ClangFormatStyleOptions.html
		BasedOnStyle: Chromium