Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-values] Automatic parsing of value definitions #2921

Open
tidoust opened this issue Jul 13, 2018 · 18 comments
Open

[css-values] Automatic parsing of value definitions #2921

tidoust opened this issue Jul 13, 2018 · 18 comments
Labels

Comments

@tidoust
Copy link
Member

tidoust commented Jul 13, 2018

Context for this issue is that @dontcallmedom and I spent some time integrating CSS specs in the list of specs crawled by Reffy. Our goal was to extract and parse value definitions for CSS properties and descriptors from all CSS specs, first step that could then perhaps be used to detect potential anomalies, automate the creation of parsing tests, or create tools that list CSS properties (apart from the detection of a few anomalies which led to the issues I reported yesterday and today on individual specs, we haven't had time yet to look into using the result of this parsing).

This exercise was also meant as an occasion for us to take a deeper look at how CSS specs are written. It is quite possible that we misunderstood a few things, we're much more familiar with API specs in practice. Also, as opposed to API specs where the automatic extraction of IDL content allows to create tests, and actual stubs for implementation, the automatic extraction and parsing of value definitions of CSS properties may perhaps not be seen as an interesting goal or a priority, because that does not trigger major interop issues in practice.

That being said, taking for granted that the goal of the Value Definition Syntax is to ease the automatic parsing of values, we noted a few potential issues:

  1. Keyword values do not allow for some of the values actually used in specs. The syntax defines keyword values as identifiers which conform to the <ident-token> grammar. Unless we read that definition incorrectly, this does not allow keywords to start with a digit or with an @. Some values need that such as glyph-orientation-vertical, font-weight (although value was replaced by <number> in Level 4), or <feature-type> (e.g. @stylistic).

  2. The syntax does not describe the use of = to define expansion rules of non-terminals. Most specs use <non-terminal> = <actual-dfn> equations, but the parsing of that equation is not defined anywhere as far as we can tell. In practice, the <> are sometimes omitted on the left-side of the equation as in the inset() definition and content-list definition. In other cases, the = is not used at all as in the fade() definition. Some definitions also use a final semi-colon as in CSS Display, CSS Box Alignment and CSS Counter Styles.

  3. Some specs extend the definition of a property with a "New value" field, which we understand must be combined with the actual definition with a | combinator. Unless we missed something, this is not described anywhere though.

  4. The syntax talks about quotes, but does not explicitly define which quotes to use. Curly quotes get used in practice in most specs. We were rather reading "single quotes" as meaning the apostrophe ' character, still used in other specs (such as in CSS 2.2). Anything is fine, but it would be good to make that explicit in the Value Syntax Definition.

  5. Some specs mix actual value definitions with prose. I reported the fill property in [fill-stroke] Missing quotes around property ref, and not a real "value" fxtf-drafts#300 for instance. Another example is the use of dagger characters to reference footnotes in the <an+b> definition. These values are not ambiguous for human beings, but makes automated parsing more challenging.

  6. From time to time, some rules that could be defined with an = construct are actually defined in prose. For instance, <basic-shape> could perhaps be defined as <inset()> | <circle()> | <ellipse()> | <polygon()>, or <border-style> as none | hidden | dotted | dashed | solid | double | groove | ridge | inset | outset. The generic question is whether that is something that you'd like to encourage.

  7. The syntax does not allow to define apparently "simple" things such as ranges or regular expressions. We noted the discussion in [css-values] Define value syntax that limits <integer>, <number>, <length>, etc. to ranges #355, so that's probably under way.

All in all, what we're wondering is whether it could be useful to end up with a CSS Value Definition Syntax that would allow the creation of a dump similar to the IDL index that appear at the end of API specs. For instance, for CSS Display Module Level 3, this dump could be:

display = [ <display-outside> || <display-inside> ] | <display-listitem> | <display-internal> | <display-box> | <display-legacy>

<display-outside> = block | inline | run-in
<display-inside> = flow | flow-root | table | flex | grid | ruby
<display-listitem> = <display-outside>? && [ flow | flow-root ]? && list-item
<display-internal> = table-row-group | table-header-group | table-footer-group | table-row | table-cell | table-column-group | table-column | table-caption | ruby-base | ruby-text | ruby-base-container | ruby-text-container
<display-box> = contents | none
<display-legacy> = inline-block | inline-table | inline-flex | inline-grid

CSS Flexbox would then complete the definition of display with:

display |= flex | inline-flex
@tabatkins
Copy link
Member

Some values need that such as glyph-orientation-vertical

Wow, this is just written very badly. That's totally invalid syntax, and should instead be auto | <angle> | <number>, with prose manually limiting the values to just 0/90/0deg/90deg. (I have no idea why they're trying to outlaw using the other angle units, as that should actually be harder to do than just accepting them. Without some specified rounding behavior there's no way to specify 90deg precisely in rad, but grad and turn are fine.) The whole property's legacy syntax is kinda a garbage fire, tho.

font-weight

This was fixed, as you noted.

<feature-type>

This is valid; it's using the extended Value Definition Syntax for rules defined in https://drafts.csswg.org/css-syntax/#rule-defs.

The syntax does not describe the use of = to define expansion rules of non-terminals.

Yes, this is hand-wavey. I'd be happy to work out a more precise syntax for it. Updating every single spec is more work, of course. ^_^

Some specs extend the definition of a property with a "New value" field, which we understand must be combined with the actual definition with a | combinator. Unless we missed something, this is not described anywhere though.

Correct on both counts.

The syntax talks about quotes, but does not explicitly define which quotes to use. Curly quotes get used in practice in most specs. We were rather reading "single quotes" as meaning the apostrophe ' character, still used in other specs (such as in CSS 2.2). Anything is fine, but it would be good to make that explicit in the Value Syntax Definition.

They'll be single quotes in the source; curly quotes are probably coming from Bikeshed's formatting of the output.

Some specs mix actual value definitions with prose.

Hm. 'fill' is just us sketching; that line wouldn't make it into a professional spec. I suppose I can move the footnote markers out of the grammar and just refer to the productions more explicitly in prose.

From time to time, some rules that could be defined with an = construct are actually defined in prose.

I think we should encourage using =, yeah. While technically not necessary for linking purposes (Bikeshed takes care of things already), it makes things a bit clearer, I think.

The syntax does not allow to define apparently "simple" things such as ranges or regular expressions. We noted the discussion in #355, so that's probably under way.

Ranges, yeah. Regexes unlikely to show up - what's the use-case for them?

All in all, what we're wondering is whether it could be useful to end up with a CSS Value Definition Syntax that would allow the creation of a dump similar to the IDL index that appear at the end of API specs.

Hmm, maybe. It seems less directly useful than the IDL index, but I'm not opposed to such a thing.

@tidoust
Copy link
Member Author

tidoust commented Jul 17, 2018

[<feature-type>] is valid; it's using the extended Value Definition Syntax for rules defined in https://drafts.csswg.org/css-syntax/#rule-defs.

Ah, got it! We did not pay much attention to these rules because we restricted ourselves to properties and descriptors for now. From an automated parsing perspective, it's not entirely clear how to distinguish between the two when parsing a <pre> tag. The initial rule starts with an @, that's easy, but the remaining rules look like value definitions.

I guess another thing that I'm wondering about is the intended scope of these definitions. For instance, that spec contains two references to <family-name>: one in the definition of font-family, and one in the definition of @font-feature-values. The syntax of the first one is defined in prose. The syntax of the second one is defined in the at-rule definition. Automated tools are essentially dumb and will happily think that the second <family-name> definition applies to both cases. Given that the second definition links back to the first, I suspect that Bikeshed thinks equally. Using the same name in both situations seems totally fine here, but then it's a bit strange to apply different parsing rules depending on how you reach that definition. Anyway, that example is totally moot in practice, because the outcome of both parsing rules is exactly the same, I'm just trying to keep the grammar clear and simple so that tools can remain as dumb as possible.

I think we should encourage using =, yeah. While technically not necessary for linking purposes (Bikeshed takes care of things already), it makes things a bit clearer, I think.

OK. FWIW, we created a list of "missing" rules with possible values for definitions that we could not extract automatically. Time permitting, we'll report these to individual specs or prepare appropriate PR.

Ranges, yeah. Regexes unlikely to show up - what's the use-case for them?

I guess the use cases that we had in mind for regexes were:

  1. the definitions of tokens such as <ident-token> to complete the railroad diagrams.
  2. the definitions of a few constructs that currently use prose to constrain values such as <custom-property-name>, <hex-color> or <signed-integed>.

That may not be compelling enough use cases to warrant the introduction of regexes.

@tabatkins
Copy link
Member

I guess another thing that I'm wondering about is the intended scope of these definitions. For instance, that spec contains two references to <family-name>:

That's a spec bug, yes. The one for the font-family property isn't marked up quite right, so it's not registering as a "type" definition (just a "value" definition); if it was, it would have already shown up as a fatal error due to duplicate definitions.

OK. FWIW, we created a list of "missing" rules

Nice. You can just drop that into a single issue and tag all the specs that are mentioned.

I guess the use cases that we had in mind for regexes were:

Yeah, I value machine-readability here much less than I value an understandable and readable grammar definition. ^_^ The prose definitions are much more acceptable, imo, as a definition for these things, and the number of times we'd actually want to do some sophisticated matching on token representations are so small in the first place.

@AmeliaBR
Copy link
Contributor

AmeliaBR commented May 6, 2019

I'm starting to implement the changes resolved on in w3c/reffy#355, to use the bracketed range notation to indicate values with constraints on them. That's relevant to this discussion in two ways:

  • Please make sure your tools are updated to be able to parse the constrained values in syntax productions! Examples: <integer [1, 10]>, <length-percentage [0, ∞]>

  • Some values which are currently defined in non-standard ways may be able to be defined with the new syntax. E.g. glyph-orientation-vertical could be
    auto | <angle [0deg,0deg]> | <angle [90deg,90deg]> | <number [0,0]> | <number [90,90]>
    But the question is: is the standardization of syntax worth the obfuscation this syntax creates?
    Does the fact that this is an ugly legacy property make a difference one way or the other?

@tabatkins
Copy link
Member

But the question is: is the standardization of syntax worth the obfuscation this syntax creates?
Does the fact that this is an ugly legacy property make a difference one way or the other?

I don't think glyph-orientation-vertical should be switched over. It's a bizarro legacy thing, leave it as such.

dontcallmedom added a commit to w3c/reffy that referenced this issue May 8, 2019
tidoust pushed a commit to w3c/reffy that referenced this issue May 9, 2019
* Addition to the CSS Grammar in https://drafts.csswg.org/css-values-3/#numeric-ranges
See also w3c/csswg-drafts#2921 (comment)
* Add range to CSS Grammar Parsing output schema
* Test support for range restriction grammar parsing
@gsnedders gsnedders added the css-values-4 Current Work label Jul 9, 2020
@cdoublev
Copy link
Collaborator

cdoublev commented Aug 18, 2021

If you allow me to report some issues I had while implementing a library for automatic parsing of CSS values...

Context: this library aims at replacing jsdom/cssstyle (an implementation of CSSStyleDeclaration), and I only discovered csstree yesterday, after almost all the work for this library was done. I didn't look at csstree in depth, but its author might have come accross the same issues.

I will only report the issues that are related to automatic parsing of definitions (of types and properties, ie. the title of this issue), which is defined in a comment above as the goal of the CSS value syntax, defined in the specification of CSS values, because I think the issues related to parsing a CSS value by using a parse function generated from the corresponding property definition, deserve their own issues.

I was not sure where to submit the following issues. Some are more directly related to webref/css and others to the CSS specifications of the W3C.

Types written in prose and not reported by w3c/webref (ie. reffy)

I have not yet been able to run all existing tests for many CSS properties and CSS types, but for example <dimension> and <rounding-strategy>, which are not listed in the document whose link is given in a comment above (anchored on list of "missing" rules - which means a list of missing definitions? -).

I report this issue with the idea in mind that it can be fixed either by preventing a property/type definition written in prose, which I think is unlikely to happen, or that it can be fixed in w3c/webref.

Consistency in the initial values ​​of shortand properties

The "initial value" fields of the shorthand properties have values written in prose that varies accross the different specifications: most often it is see individual properties, but also N/A (see individual properties), not defined for shorthand properties, etc. The issue can be easily fixed but it's a pebble in the road towards automatic parsing using @webref/css.

To be honest, this is more about an opportunity for me to ask the following two questions:

  • are shortened properties the only properties that have no initial value?
  • does the value not applicable (initial value comes from physical property) for the properties background-position-block and background-position-inline correspond to auto?

Definition of terminal types with token types

w3c/webref reports the types <EOF-token>, <length>, and <percentage> with definitions written in prose, but that don't allow to expand those types to explicit values because obviously they are tokens or terminal CSS types. The component values ​​are often identical to the tokens that result from tokenization, which makes me think that the terminal CSS types could be defined with the corresponding type of tokens: <dimension-token>, <percentage-token>, etc...

This may be a dumb idea. I'm not sure that if other reason exists for this, but it makes sense to me.

Order of "xor" combinations

I remember that I read a property or type definition before, that defines a "greedy" parsing behavior. I can't remember which one, but perhaps the following issue is not a an issue at all. If so please forgive me.

To give an example, the value left center, when parsed for a property defined with <position>, will match only with the component value left if the parse function is not greedy and if the definition of <position> contains the combination that allow a single value before the other combinations that allow two or more values.

Either this greedy or non-greedy behavior is defined explicitly somewhere, or the order of the "xor" combinations must be to start from the combination that allows the highest number of component values first, towards the combination that allows the lowest. Implementing the former solution looks more difficult to me, but the second solution might have several drawbacks.

EDIT: I forgot to thank tidoust and its collaborator(s) for the work they have done on w3C/reffy and w3C/webref, because manually picking all types and properties would have been a pain in the a**!

@tabatkins
Copy link
Member

but for example <dimension> and <rounding-strategy>,

<dimension> is necessarily a prose definition; there's too many things that could potentially resolve to a <dimension> to be reasonably listed in an explicit grammar.

<rounding-strategy> is kinda in prose, but it's also got definition metadata identifying its components - any <dfn data-dfn-type=value data-dfn-for="<some-nonterminal>">one</dfn> is an arm of an implicit <some-nonterminal> = one | two | three grammar. Adding an explicit grammar-production block would not improve the readability of that section, so hopefully reffy can infer grammar from dfns when necessary instead.

are [shorthand] properties the only properties that have no initial value?

Yes. (Tho possibly logicals with physical equivalents also have this.)

does the value not applicable (initial value comes from physical property) for the properties background-position-block and background-position-inline correspond to auto?

Unclear. The logicals probably are the same as shorthands, and don't have an initial value at all; this isn't consistent between css-logical and css-backgrounds currently.

The component values ​​are often identical to the tokens that result from tokenization, which makes me think that the terminal CSS types could be defined with the corresponding type of tokens: , , etc...

No, they can't be - there are often many ways to produce a particular terminal value that are not just the literal tokens. For example, calc(1%) is a <percentage>, but is definitely not a <percentage-token>. The token productions should almost never be used in ordinary specs; they only deserve mention when you're discussing very low-level details of CSS, such as the definition of "dimension".

I remember that I read a property or type definition before, that defines a "greedy" parsing behavior. I can't remember which one, but perhaps the following issue is not a an issue at all. If so please forgive me.

You may have heard that CSS tokenization is greedy, which is true (for example, 1px is guaranteed to parse as a dimension, not as a number followed by an ident). But parsing is definitely non-greedy.

(Technically "greedy" doesn't apply to tokenization, because it's specified with an algorithm rather than as a grammar. But the algo is designed to implement "longest-match" greedy semantics for a theoretical equivalent grammar, because that's the semantics that CSS2 had when it did specify tokenization with a grammar.)

@tidoust
Copy link
Member Author

tidoust commented Aug 19, 2021

Types written in prose and not reported by w3c/webref (ie. reffy)

I have not yet been able to run all existing tests for many CSS properties and CSS types, but for example <dimension> and <rounding-strategy>, which are not listed in the document whose link is given in a comment above (anchored on list of "missing" rules - which means a list of missing definitions? -).

I report this issue with the idea in mind that it can be fixed either by preventing a property/type definition written in prose, which I think is unlikely to happen, or that it can be fixed in w3c/webref.

The missing-css-rules.json file is indeed outdated. It has not been maintained since we performed the initial analysis. We're slowly looking at improving support and checks for CSS properties, descriptors and value space definitions in Reffy and webref, but that remains very much in flux for now... and messy. Even the tracking is a bit messy, here are some updates that we have in mind:

@cdoublev
Copy link
Collaborator

<dimension> is necessarily a prose definition; there's too many things that could potentially resolve to a <dimension> to be reasonably listed in an explicit grammar.

Currently I define <dimension> = <angle> | <frequency> | <length> | <time> | <resolution> in order to be able to check if nodes contains any dimensions in the procedure to sort a calculation’s children nodes, and to check If the root of the calculation tree fn represents is a numeric value (number, percentage, or dimension) in the procedure to serialize a math function.

Does the value not applicable (initial value comes from physical property) for the properties background-position-block and background-position-inline correspond to auto?

Unclear. The logicals probably are the same as shorthands, and don't have an initial value at all; this isn't consistent between css-logical and css-backgrounds currently.

Sorry. I was (confusedly) thinking that auto could mean whatever the initial value of the corresponding physical property is. Anyway, the initial value of both background-position-x and background-position-y is 0%, but I guess that at some point, I may have to resolve the writing mode of a DOM element first, to resolve the initial value, instead of resolving a static initial value defined in the property definition.

the terminal CSS types could be defined with the corresponding type of tokens: <dimension-token>, <percentage-token>, etc...

No, they can't be - there are often many ways to produce a particular terminal value that are not just the literal tokens. For example, calc(1%) is a <percentage>, but is definitely not a <percentage-token>. The token productions should almost never be used in ordinary specs; they only deserve mention when you're discussing very low-level details of CSS, such as the definition of "dimension".

Ok. I guess that the many ways to procedure a particular terminal value are when consuming a function or a simple block. I didn't really understood that a math function can be many possible types, such as <length>, <number>, etc., depending on the calculations it contains, because this quote is followed by the procedure to dermine the type of a calculation. I only assign a numeric type on calculations. Obviously, calc(1%) should not be serialized as a <percentage>.

Either this greedy or non-greedy behavior is defined explicitly somewhere, or the order of the "xor" combinations must be to start from the combination that allows the highest number of component values first, towards the combination that allows the lowest.

CSS tokenization is greedy [...] but parsing is definitely non-greedy.

The most up to date definition of background-position is <position>#, which is expanded to
[ [left | ...] | [left | ...] [... | <length-percentage>] ] (truncated for brevity). Because parsing is non-greedy, left 10% will match the first combination, [left | ...], and <length-percentage> will be left unparsed. To workaround this, I rewrited the definition by modifying the combinations order to [ [left | ...] [... | <length-percentage>] | [left | ...] ]. I can figure out another workaround when parsing background-position, but it will be harder or even impossible when parsing background, conic-gradient(), radial-gradient(), or mask-layer, which contains a <position> between other types.

Thank you tidoust for giving me these links. I will certainly read everything because I'm sure to learn other things that will help me to save time.

@tabatkins
Copy link
Member

Re: parsing greediness, I'm not sure I understand your response.

As I said, parsing is non-greedy; if the first branch that starts to match eventually fails, you just move on to the second branch and try again. There's no need to order the branches in any particular way to accommodate this, so we order them for readability usually.

If you're trying to match CSS grammars against values using a greedy (non-backtracking) parser, you're gonna have a bad time. You have to be able to backtrack.

@cdoublev
Copy link
Collaborator

cdoublev commented Aug 20, 2021

I wanted to know the expected behavior when the first branch successfully matches the first component value in the input list, if that branch expects a single component value while other branches accept multiple component values. When parsing center 100% repeat against <bg-position> <repeat-style>, the first branch in the expanded definition of <bg-position>, accepts a single component value, that center matches successfully. What should I do next with 100% repeat? I think I will come back to this specific issue later, as I feel like I'm misunderstood something obvious! 🙃

EDIT: oh ok I got it, it means that I should move "one step backwards" and try with another branch, if any, even if it means matching a branch of a type that (my current implementation of) the parser tought that it already had a match.

@tabatkins
Copy link
Member

Right, that's a backtracking parser vs a greedy/first-match parser. CSS grammars are intended for use with a backtracking parser.

@cdoublev
Copy link
Collaborator

cdoublev commented Jun 28, 2022

Could I please have a clarification on definitions marked up with data-dfn-type="function"? I would like confirmation that this annotation allows the function to be referenced as a production, ie. with its name followed by (), both wrapped between < and >.

I'm asking this because I see url() and src() marked up with this attribute in CSS Values and extracted as follows in w3c/webref:

  "<url()>": {
    "prose": "Typically, a <url> is written with the url() or src() functional notations:"
  },
  "<src()>": {
    "prose": "Typically, a <url> is written with the url() or src() functional notations:"
  },

But there is no reference of these productions in any definition, not even in the definition of <url>, eg. with <url> = <url()> | <src()> instead of url(<string> <url-modifier>*) | src(<string> <url-modifier>*).

They may be isolated cases, but it makes me wonder if the only difference with data-dfn-type="type" is that data-dfn-type="function" is used to define the syntax of a component value that is a function, wheras the former defines the syntax of one (but not a function) or more component values. But then why not marking up eg. <supports-decl> with data-dfn-type="simple-block"? Why does this distinction exist between type and function?

While our functions are often uniquely named, they should in fact be treated the same as keywords - potentially context-sensitive.

I believe that the only difference with a function definition marked up with data-dfn-type="value" is that a function marked up with data-dfn-type="function" should be unique, whereas the former is similar to a keyword, as indicated by the above comment quoted from an issue about mutliple definitions of fit-content(), marked up with data-dfn-type="value". Is it correct?

@fantasai
Copy link
Collaborator

This is the function definition pattern I think we should be aiming for:

The <dfn>foo()</dfn> function blah bla.

<pre class=prod>
  <<foo()>> = foo( ... )
</pre>

@tabatkins
Copy link
Member

The syntax does not describe the use of = to define expansion rules of non-terminals.

This is now defined by a4bfe38

@cdoublev
Copy link
Collaborator

This is the function definition pattern I think we should be aiming for:

It would fix all of the issues I have with extracting function value definitions (that are not inlined in a context value definition) from the data exposed by w3c/webref. I am just a little worried that this makes the spec authors' job more difficult.

Actually, I do not really understand why the function data type is required (if its grammar cannot be assumed to be context-free), instead of only using value as a general term for a symbol included in a value definition and further defined in relation to this context.

For example, when following the reasoning justifying the need of function, if I wanted to further define (calc-sum) in <calc-value> = <number> | ... | (<calc-sum>), I would need to define it with simple-block as its dfn-type, but there is no such type in Bikeshed.

@tabatkins
Copy link
Member

Actually, I do not really understand why the function data type is required (if its grammar cannot be assumed to be context-free), instead of only using value as a general term for a symbol included in a value definition and further defined in relation to this context.

Well, I split up the CSS and IDL dfn types too finely when I started Bikeshed originally. Ideally they'd be organized by whether or not they could potentially have name clashes. (For example, in IDL the attribute and method types could/should be combined, but interface has to be separate from those.)

I would need to define it with simple-block as its dfn-type,

This definitely isn't correct, tho. The definition types aren't meant to map to any particular token type. <calc-sum> is a production (dfn-type="type").

@cdoublev
Copy link
Collaborator

I see. function is intented to avoid a name clash between <type()> and <type>, or value() and value.

w3c/reffy extracts <type()> as a type, not as a function like Bikeshed does. But it is not affected by name clashes.


When you define minmax(min, max) as a function, it is interpreted and extracted like if minmax() = minmax(min, max) were present in the specification, whereas the appropriate value definition depends on the context (<track-size> or <fixed-size>).

However, some functions are context-free but are only defined with fn() = fn(...), or just fn(...). They are extracted as functions by w3c/reffy, which means it is impossible (for me) to differentiate them from minmax(min, max).

I need to resolve <fn()> into fn(...) when I am parsing <fn()>, ie. I need to find a type definition in w3c/reffy data, whose name is <fn()>. I never need to resolve minmax().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants