Skip to content

Commit

Permalink
wgsl: discover template lists early (Lookahead disambiguation of less…
Browse files Browse the repository at this point in the history
…-than vs template argument list (v2)) (gpuweb#3803)

* Implement a tree-sitter scanner for template disambiguation

Use a custom scanner to disambiguate between template-argument-lists and less-than / greater-than.

* Build the treesitter shared library on our own

The py-tree-sitter compilation doesn't work on macOS
because it doesn't know to use -std=c++17 when compiling C++ code.

* Grammar analyzer understands many extra external tokens

* Only do slow actions when data is newer.

* Grammar.py: Generate syntax_sym references for extra external tokens

* Allow syntactic tokens to be named without backticks

* Regenerate recursive grammar

* type_specifier is fully_qualified_ident

This is much simpler than using "expression".

But note that in future we may want to have type expressions like
unions, as in TypeScript. That door is still open, as the grammar was
unambiguous (and recursive-descentable) even with type being "expression"

* Add TODOs

* Remove extraneous grammar grouping

* analyze/Grammar.py:  Make _disambiguate_template an empty token

In the template-matching scheme, it doesn't appear in source text.
Make it empty so that first and follow sets are computed correctly.

* Explain the custom tokens

Delete redundant syntactic token definitions

* Add more TODOs in scan, and more comments

* Explain why "var" is special-cased

Ben answered

* disambiguate token does not have to be optional

* Refactor extract-grammar.py

Add argument parsing
Add a --flow argument to specify which step to run.

* Add WGSL parsing unit tests

* Fix tree-sitter parsing of bitcast

Need to add _disambiguate_template before the template list
so they are parsed as templates.

* Add more unit tests

* analyze/Grammar.py: disambiguate token treated as nonempty

* scanner.cc: better comments, and dump "valids" array when dumping

* "var" is followed by template list

scanner.cc: Remove the "var" exception

* Better explanation of synthetic token

* Fix comment from merge conflict

* Add explicit dependencies on the fixed wgsl include files

* Support elements that are hidden spans.

* Change _disambiguate_template to a hidden span token

It's a Treesitter implementation detail, so it doesn't belong in the
spec.

* The relational and shift tokens are displayed plainly in the HTML

They are remapped to custom tokens by the extract-grammar process.

Pretty printing in analyze/Grammar.py has to remove leading underscores
so they link correctly to definitions.

* Fix formatting of bitcast rule

* Start writing template disambiguation

* Add TODOs in the custom scanner

* Describe the template list discovery algorithm.

* Add missing disambiguation for fully-qualified-ident

* Custom scanner: correctly mark an ordinary greater-than code point

* Better wording of CurrentPosition

* scanner.cc: Show more details about expected valid tokens

* Add another disambiguation token ident use inin core_lhs_expression

 makes simple assignments work.

* scanner.cc: Add more logging

* Lots more logging

* Many more unit tests

* Make types and type-generating names keywords again

Insert disambiguating spans and template_args_start and _end where needed.

Fixes: gpuweb#3770

* scanner.cc: Better comment about the sentinel value

* extract-grammar.py: GCC wants -shared and -fPIC link flags

* extract-grammar:.py: GCC requires link flags after, not before

---------

Co-authored-by: Ben Clayton <[email protected]>
  • Loading branch information
dneto0 and ben-clayton authored Feb 13, 2023
1 parent ae8b4a2 commit 6679a89
Show file tree
Hide file tree
Showing 8 changed files with 2,037 additions and 410 deletions.
2 changes: 2 additions & 0 deletions .clang-format
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# http://clang.llvm.org/docs/ClangFormatStyleOptions.html
BasedOnStyle: Chromium
32 changes: 23 additions & 9 deletions wgsl/Makefile
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
.PHONY: all clean nfkc validate lalr validate-examples

all: index.html nfkc validate test diagrams
validate: lalr validate-examples
validate-examples: grammar/grammar.js
validate: lalr unit_tests validate-examples

clean:
rm -f index.html index.pre.html grammar/grammar.js wgsl.lalr.txt
rm -f index.html index.pre.html index.bs.pre grammar/grammar.js grammar/build wgsl.lalr.txt


# Generate spec HTML from Bikeshed source.
WGSL_SOURCES:=index.bs $(wildcard wgsl.*.bs.include)
WGSL_SOURCES:=index.bs scanner.cc wgsl.recursive.bs.include wgsl.reserved.bs.include
index.pre.html: $(WGSL_SOURCES)
DIE_ON=everything bash ../tools/invoke-bikeshed.sh $@ $(WGSL_SOURCES)

Expand All @@ -23,14 +23,28 @@ diagrams: $(MERMAID_OUTPUTS)
img/%.mmd.svg: diagrams/%.mmd ../tools/invoke-mermaid.sh ../tools/mermaid.json
sh ../tools/invoke-mermaid.sh -i $< -o $@

# Extract WGSL grammar from the spec, validate it with Treesitter,
# and use Treesitter to parse many code examples in the spec.
grammar/grammar.js: index.bs extract-grammar.py
python3 ./extract-grammar.py index.bs grammar/grammar.js
TREESITTER_GRAMMAR_INPUT := grammar/grammar.js
TREESITTER_PARSER := grammar/build/wgsl.so

# Extract WGSL grammar from the spec, validate it by building a Treesitter parser from it.
$(TREESITTER_GRAMMAR_INPUT) $(TREESITTER_PARSER): index.bs scanner.cc extract-grammar.py
python3 ./extract-grammar.py --spec index.bs --scanner scanner.cc --tree-sitter-dir grammar --flow xb

.PHONY: validate-examples
# Use Treesitter to parse many code examples in the spec.
validate-examples: $(TREESITTER_PARSER)
python3 ./extract-grammar.py --flow e

.PHONY: unit_tests
# Use Treesitter to parse code samples
unit_tests: $(TREESITTER_PARSER) wgsl_unit_tests.py
python3 wgsl_unit_tests.py --parser $(TREESITTER_PARSER)

# The grammar in JSON form, emitted by Treesitter.
WGSL_GRAMMAR=grammar/src/grammar.json
$(WGSL_GRAMMAR) : grammar/grammar.js
$(WGSL_GRAMMAR) : $(TREESITTER_GRAMMAR_INPUT)

wgsl_unit_tests:

.PHONY: nfkc
nfkc:
Expand Down
56 changes: 54 additions & 2 deletions wgsl/analyze/Grammar.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@

import json
import functools
import sys
from ObjectRegistry import RegisterableObject, ObjectRegistry
from collections import defaultdict

Expand Down Expand Up @@ -323,8 +324,25 @@ def with_meta(phrase,metachar,print_option):
# Print ourselves
if print_option.bikeshed:
context = 'recursive descent syntax'
if print_option.grammar.rules[name].is_token():
g = print_option.grammar
if g.rules[name].is_token():
context = 'syntax'
if name in g.extra_externals:
context = 'syntax_sym'
if name == '_disambiguate_template':
# This is an implementation detail, so make it invisible.
return ''
else:
without_underscore = ['_less_than',
'_less_than_equal',
'_greater_than',
'_greater_than_equal',
'_shift_left',
'_shift_left_assign',
'_shift_right',
'_shift_right_assign']
if name in without_underscore:
name = name[1:]
return "[={}/{}=]".format(context,name)
return name
if isinstance(rule,Choice):
Expand All @@ -350,7 +368,7 @@ def with_meta(phrase,metachar,print_option):
# If it's not canonical, then it can have nesting.
return "(" + inside + nl + ")"
if isinstance(rule,Seq):
return " ".join([i.pretty_str(print_option) for i in rule])
return " ".join(filter(lambda i: len(i)>0, [i.pretty_str(print_option) for i in rule]))
if isinstance(rule,Repeat1):
return "( " + "".join([i.pretty_str(print_option) for i in rule]) + " )+"
raise RuntimeError("unexpected node: {}".format(str(rule)))
Expand Down Expand Up @@ -859,6 +877,21 @@ def is_accepting(self):
def at_end(self):
return self.position == len(self.items())

def json_externals(json):
"""
Returns the set of names of symbols in the "externals" section of the
Treesitter JSON grammar.
Data looks like this, for section "externals".
{
"externals": [
{ "type": "SYMBOL", name: "_block_comment" },
{ "type": "SYMBOL", name: "_error_sentinel" }
}
}
"""
return set([ x["name"] for x in json.get("externals",[]) ])


def json_hook(grammar,memo,tokens_only,dct):
"""
Expand Down Expand Up @@ -1801,6 +1834,22 @@ def __init__(self, json_text, start_symbol, ignore='_reserved'):

# First decode it without any interpretation.
pass0 = json.loads(json_text)

# Get the external tokens, these are not necessarily represented in the rules.
external_tokens = json_externals(pass0)
#print(external_tokens,file=sys.stderr)
defined_rules = set(pass0["rules"].keys())
# The set of external tokens that don't have an ordinary definition in the grammar.
self.extra_externals = external_tokens - defined_rules
for e in self.extra_externals:
content = "\\u200B{}".format(e)
if e == '_disambiguate_template':
# This is a zero-width token used for Treesitter's benefit
#content = ''
pass
# Create a placholder definition
pass0["rules"][e] = {"type":"TOKEN","content":{"type":"PATTERN","value":content}}

# Remove any rules that should be ignored
# The WGSL grammar has _reserved, which includes 'attribute' but
# that is also the name of a different grammar rule.
Expand Down Expand Up @@ -1922,6 +1971,7 @@ def pretty_str(self,print_option=PrintOption()):

token_rules = set()

# Look for defined rules that look better as absorbed into their uses.
for name, rule in self.rules.items():
# Star-able is also optional-able, so starrable must come first.
starred_phrase = rule.as_starred(name)
Expand All @@ -1938,6 +1988,8 @@ def pretty_str(self,print_option=PrintOption()):
if len(phrase)==1 and phrase[0].is_token():
token_rules.add(name)

# A rule that was generated to satisfy canonicalization is better
# presented as absorbed in its original parent.
for name, rule in self.rules.items():
# We only care about rules generated during canonicalization
if name.find('.') > 0 or name.find('/') > 0:
Expand Down
Loading

0 comments on commit 6679a89

Please sign in to comment.