Skip to content

GSoC 2025: Investigating Schema NormalizationΒ #857

Open
@Julian

Description

@Julian

Brief Description
JSON Schema is a rich language for expressing constraints on JSON data. If we strictly consider JSON Schema validation (rather than any other use of JSON Schema), in many cases there are multiple ways to express the same constraints. For example, the schema:

{
  "oneOf": [
    {"const": "foo"},
    {"const": "bar"}
  ]
}

will have the same validation outcome on all instances as the schema:

{"enum": ["foo", "bar"]}

One might say that this second schema is in some way "better" than the first one in some way that could be made precise.

The same is true for the schemas {"required": ["foo"]} and {"title": "My Schema", "required": ["foo"]}, and one might say the first one is "better" than the second for the purpose of validation.

We can define two schemas to be "equivalent" if they have this property that any instance is valid under one if and only if it is valid under the other, and if we have two equivalent schemas S and S' we might wish to define an algorithm for transforming these schemas into a form which is "canonical" or "normal" such as above.

There are existing attempts to do this for various use cases, but no central place where a self-contained set of normalization rules are written down and a self-contained tool exists to perform the procedure. Let's try and write a simple one!

Expected Outcomes

  • Investigate the existing implementations of normalization in the wild. There are at least two known ones, one being here.
  • Define a set of normalization rules, with configurability for cases where there are multple reasonable canonical forms
  • Define a set of test cases for schemas which are equivalent under these rules, and for the target canonical form for each set of schemas
  • Write a Python library which performs the normalization and emits the normalized schema
  • Empirically test our normalization procedure by running normalized schemas through Bowtie and comparing whether a given implementation returns the same results

Skills Required

  • An existing understanding of JSON Schema's keywords, which can be used to think about areas which might create possible "denormalization" (e.g. keywords which when used together overlap)
  • Familiarity writing Python, and ideally using JSON Schema from Python
  • Experience testing pieces of software by writing test cases, here likely in the form of writing JSON Schema + instance examples
  • Careful diligence in reading and understanding the existing procedures used (in the link above, as well as in a number of JSON Schema journal articles) and the ability to compare the previous work with each other

Mentors
@Julian

Expected Difficulty
medium

Expected Time Commitment
175

Metadata

Metadata

Assignees

No one assigned

    Labels

    gsocGoogle Summer of Code Project Idea

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions