Skip to content

Conversation

@WanjohiSammy
Copy link
Member

@WanjohiSammy WanjohiSammy commented Oct 13, 2025

Issues

This pull request fixes #xxx.

Description

This PR improves perf of JSON parsing by implementing vectorized whitespace skipping operations using ReadOnlySpan.IndexOfAnyExcept and SearchValues.

Before

image

After

image

Benchmark project

/WanjohiSammy/ODataJsonReaderBenchmarks-3

Input data:
https://github.com/WanjohiSammy/ODataJsonReaderBenchmarks-3/blob/main/ODataJsonReaderBenchmarks/SamplePayloadWithValues.json

Checklist (Uncheck if it is not completed)

  • Test cases added
  • Build and test with one-click build and test script passed

Additional work necessary

If documentation update is needed, please add "Docs Needed" label to the issue and provide details about the required document change in the issue.

Repository notes

Team members can start a CI build by adding a comment with the text /AzurePipelines run to a PR. A bot may respond indicating that there is no pipeline associated with the pull request. This can be ignored if the build is triggered.

Team members should not trigger a build this way for pull requests coming from forked repositories. They should instead trigger the build manually by setting the "branch" to refs/pull/{prId}/merge where {prId} is the ID of the PR.

@WanjohiSammy
Copy link
Member Author

/AzurePipelines run

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request optimizes JSON whitespace skipping by implementing vectorized operations using SIMD instructions to improve parsing performance. The optimization leverages Vector for cross-platform SIMD support to process multiple characters simultaneously instead of the previous character-by-character approach.

  • Replaces scalar whitespace checking with vectorized operations using System.Numerics.Vector
  • Introduces helper methods for vectorized and scalar whitespace processing
  • Maintains fallback to scalar operations for smaller buffers or when SIMD is not available

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@WanjohiSammy WanjohiSammy marked this pull request as ready for review October 13, 2025 08:03
@WanjohiSammy
Copy link
Member Author

/AzurePipelines run

@WanjohiSammy WanjohiSammy requested a review from Copilot October 14, 2025 18:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@WanjohiSammy
Copy link
Member Author

/AzurePipelines run

@WanjohiSammy WanjohiSammy force-pushed the fix/vectorizing-skipwhitespaces branch from b84defa to c1a27cc Compare October 14, 2025 18:34
for (int i = 0; i < vectorSize; i++)
{
// 0 means NOT whitespace
if (isWhitespace[i] == ushort.MinValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the possible values for each isWhiteSpace element? Is each element either all 1 bits or all 0 bits? Or are there scenarios where some bits 0 and some are 1?

Copy link
Member Author

@WanjohiSammy WanjohiSammy Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each position in isWhitespace will be ushort.MaxValue(65535) if the character is whitespace, or 0 otherwise.

Consider the following:

 Vector<ushort> isWhitespace = Vector.Equals(currentVector, spaceCharVector) |
                              Vector.Equals(currentVector, tabCharVector) |
                                  Vector.Equals(currentVector, newlineCharVector) |
                                  Vector.Equals(currentVector, carriageCharVector);

The vectors represent the whitespace characters ('', '\t', '\n', '\r') repeated across all the elements of the vector:

Vector<ushort> spaceCharVector = new Vector<ushort>(' '); // <32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32>
Vector<ushort> tabCharVector = new Vector<ushort>('\t'); // <9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9>
Vector<ushort> newlineCharVector = new Vector<ushort>('\n'); // <10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10>
Vector<ushort> carriageCharVector = new Vector<ushort>('\r'); // <13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13>

To check for whitespaces, we use ushort.MaxValue and 0

  • ushortMaxValueVector with Vector.EqualsAll(...): All elements are ushort.MaxValue (used to check if all positions are whitespace).
  • zeroVector with Vector.EqualsAll(...): All elements are 0 (used to check if no positions are whitespace).
  • ushortMaxValueVector with Vector.EqualsAny(...): Checks if it contains whitespace
  • zeroVector with Vector.EqualsAny(...): Checks if it contains non-whitespace
Vector<ushort> ushortMaxValueVector = new Vector<ushort>(ushort.MaxValue);
Vector<ushort> zeroVector = new Vector<ushort>(0);

Here is the full code

Vector<ushort> ushortMaxValueVector = new Vector<ushort>(ushort.MaxValue);
Vector<ushort> zeroVector = new Vector<ushort>(0);

var inputWithNoWhitespace = new string('a', 100);
var inputWithAllWhitespace = " \t\n\r \n           \t\n\r \t\n\n\n\n    \t\t\t\t\r\r\r\r     \r\r\n\r\r";
var inputWithSomeWhitespace = "Hello, \tWorld!\nThis is a test string.\r\nIt contains some whitespace characters.";

var currentVector = new Vector<ushort>(MemoryMarshal.Cast<char, ushort>(inputWithNoWhitespace.AsSpan(0, Vector<ushort>.Count)));
Vector<ushort> isWhitespaceWithNoWhitespace = GetWhitespaceMask(currentVector); // <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>

Console.WriteLine(Vector.EqualsAll(isWhitespaceWithNoWhitespace, ushortMaxValueVector)); // False because there are no whitespace characters
Console.WriteLine(Vector.EqualsAll(isWhitespaceWithNoWhitespace, zeroVector)); // True because there are no whitespace characters

currentVector = new Vector<ushort>(MemoryMarshal.Cast<char, ushort>(inputWithAllWhitespace.AsSpan(0, Vector<ushort>.Count)));
Vector<ushort> isWhitespaceWithAllWhitespace = GetWhitespaceMask(currentVector); // <65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535>

Console.WriteLine(Vector.EqualsAll(isWhitespaceWithAllWhitespace, ushortMaxValueVector)); // True because all characters are whitespace
Console.WriteLine(Vector.EqualsAll(isWhitespaceWithAllWhitespace, zeroVector)); // False because all characters are whitespace

currentVector = new Vector<ushort>(MemoryMarshal.Cast<char, ushort>(inputWithSomeWhitespace.AsSpan(0, Vector<ushort>.Count)));
Vector<ushort> isWhitespaceWithSomeWhitespace = GetWhitespaceMask(currentVector); // <0, 0, 0, 0, 0, 0, 65535, 65535, 0, 0, 0, 0, 0, 0, 65535, 0>

Console.WriteLine(Vector.EqualsAll(isWhitespaceWithSomeWhitespace, ushortMaxValueVector)); // False because not all characters are whitespace
Console.WriteLine(Vector.EqualsAll(isWhitespaceWithSomeWhitespace, zeroVector)); // False because there are some whitespace characters

static Vector<ushort> GetWhitespaceMask(Vector<ushort> currentVector)
{
    Vector<ushort> spaceCharVector = new Vector<ushort>(' ');
    Vector<ushort> tabCharVector = new Vector<ushort>('\t');
    Vector<ushort> newlineCharVector = new Vector<ushort>('\n');
    Vector<ushort> carriageCharVector = new Vector<ushort>('\r');

    return Vector.Equals(currentVector, spaceCharVector) |
        Vector.Equals(currentVector, tabCharVector) |
        Vector.Equals(currentVector, newlineCharVector) |
        Vector.Equals(currentVector, carriageCharVector);
}

Copy link
Contributor

@gathogojr gathogojr Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WanjohiSammy Is there a reason you went for handwritten SIMD rather than the built-in, vectorized primitives approach? From my reading, IndexOfAny/IndexOfAnyExcept on spans are implemented with SIMD‑aware algorithms and often outperform handwritten loops, especially across CPUs with SSE2/AVX2/ARM NEON, without occasioning the challenge of maintaining this vectorization logic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored the whitespace skipping logic to eliminate manual vectorization and replaced it with the recommended approach using SearchValues and IndexOfAnyExcept. This simplifies the code, improves maintainability, and leverages efficient runtime vectorization.

do
{
for (; this.tokenStartIndex < this.storedCharacterCount; this.tokenStartIndex++)
int remaining = this.storedCharacterCount - this.tokenStartIndex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method only called when there's a whitespace character detected?

Copy link
Member Author

@WanjohiSammy WanjohiSammy Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. The method is just called regardless of whether there is whitespace or not. It returns true if the first non-whitespace is found and false otherwise.

@WanjohiSammy WanjohiSammy requested a review from habbes October 21, 2025 08:13
@WanjohiSammy WanjohiSammy force-pushed the fix/vectorizing-skipwhitespaces branch from ff6610b to 1a462ab Compare October 21, 2025 09:23
@WanjohiSammy
Copy link
Member Author

/AzurePipelines run

@WanjohiSammy WanjohiSammy force-pushed the fix/vectorizing-skipwhitespaces branch from 1a462ab to b89446e Compare October 22, 2025 20:47
@gathogojr
Copy link
Contributor

gathogojr commented Oct 23, 2025

@WanjohiSammy My initial view about this change is that if the vectorized approach you went for materially increases code complexity, the bar for measurable wins need to be high and well-documented. You could benchmark your approach against the suggestion to use IndexOfAnyExcept and determine if that would be a cleaner alternative while still deriving the same benefits

@gathogojr
Copy link
Contributor

Especially if you're going to keep the custom vectorized code, I'd suggest benchmarks against minified, lightly spaced, highly spaced payloads just so we get an idea about the performance across the different categories. Specifically to confirm there's no regressions on minified payloads with little to know whitespace characters.

In addition, determine if any of the custom vectorized code can be put in shareable helper method that can be reused from the different location

Assert.Equal(expected, await reader.GetValueAsync());
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd make sense to add tests that specifically validate vectorization doesn't change semantics for the following edge cases:

  • Buffer boundary: \r at the end of buffer and \n at the beginning of the buffer - CRLF. The number of characters loaded into the buffer can be determined from the relevant const in JsonReader so it should be able to test this
  • All-whitespace buffer: Long run of whitespace to stress the refill path and advance logic
  • Different category of payloads - minified, lightly spaced, heavily spaced (We should do benchmarks around the same and show statistically significant speedups where whitespace is abundant; but show no regressions on minified payloads)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added comprehensive tests using a variety of large string inputs with different combinations of whitespace characters (spaces, tabs, newlines, carriage returns) surrounding and within JSON string values. The test data includes cases such as:

public static TheoryData<string, string> LargeStringsWithWhitespace()
{
    return new TheoryData<string, string>
    {
        { new string(' ', 10000) + "\"HelloWorld\"" + new string(' ', 10000), "HelloWorld" },
        { new string('\t', 5000) + "\"TestString\"" + new string('\n', 5000), "TestString" },
        { new string('\r', 8000) + "\"WhitespaceTest\"" + new string(' ', 2000), "WhitespaceTest" },
        { new string(' ', 2500) + new string('\n', 2500) + "\"MixedWhitespace\"" + new string('\t', 2500) + new string('\r', 2500), "MixedWhitespace" },
        { new string(' ', 5) + new string('\n', 5) + "\"Hello\"" + new string(' ', 5) + new string('\t', 5) + "\"World\"" + new string('\r', 5) + new string(' ', 5) + new string('\n', 5) + "!", "Hello" },
        { new string(' ', 5) + new string('\n', 5) + "\"Hello" + new string(' ', 5) + new string('\t', 5) + "World" + new string('\r', 5) + new string(' ', 5) + new string('\n', 5) + "!\"", "Hello     \t\t\t\t\tWorld\r\r\r\r\r     \n\n\n\n\n!" },
    };
}

{
for (; this.tokenStartIndex < this.storedCharacterCount; this.tokenStartIndex++)
int remaining = this.storedCharacterCount - this.tokenStartIndex;
int vectorSize = Vector<ushort>.Count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WanjohiSammy My suggestion is to switch to ReadOnlySpan<char>.IndexOfAnyExcept with a cached SearchValues<char> of the 4 JSON whitespace chars (' ', '\t', '\r', '\n' - only). This is already vectorized by the runtime and keeps the code tiny. I think it should be at least as fast and much easier to maintain across x64/ARM64.

If custom vectorization logic should remain, add a small-span fast-path (e.g. < 16 or < 32 chars) to avoid paying setup cost on short inputs where SIMD can be slower than scalar

Again, if the custom handwritten logic should remain, factor the span-scan into one helper to avoid duplication and divergence between sync and async flavours - if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have incorporated the recommended solution:
#3381 (comment)

@gathogojr
Copy link
Contributor

@WanjohiSammy Is IsWhitespaceCharacter still being used or can we get rid of it?

Copy link
Contributor

@gathogojr gathogojr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment, otherwise LGTM

@WanjohiSammy
Copy link
Member Author

@WanjohiSammy Is IsWhitespaceCharacter still being used or can we get rid of it?

@gathogojr IsWhitespaceCharacter is much used in other like 9 places.

@WanjohiSammy
Copy link
Member Author

/AzurePipelines run

@gathogojr
Copy link
Contributor

@WanjohiSammy Given our update strategy, it's better to use Update with rebase than Update with merge commit - that is usually selected by default, and what you appear to use often:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants