-
Notifications
You must be signed in to change notification settings - Fork 357
Optimize JSON Reader Whitespace Skipping with Vectorized Operations #3381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/AzurePipelines run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request optimizes JSON whitespace skipping by implementing vectorized operations using SIMD instructions to improve parsing performance. The optimization leverages Vector for cross-platform SIMD support to process multiple characters simultaneously instead of the previous character-by-character approach.
- Replaces scalar whitespace checking with vectorized operations using System.Numerics.Vector
- Introduces helper methods for vectorized and scalar whitespace processing
- Maintains fallback to scalar operations for smaller buffers or when SIMD is not available
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <[email protected]>
|
/AzurePipelines run |
…OData/odata.net into fix/vectorizing-skipwhitespaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
/AzurePipelines run |
b84defa to
c1a27cc
Compare
| for (int i = 0; i < vectorSize; i++) | ||
| { | ||
| // 0 means NOT whitespace | ||
| if (isWhitespace[i] == ushort.MinValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the possible values for each isWhiteSpace element? Is each element either all 1 bits or all 0 bits? Or are there scenarios where some bits 0 and some are 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each position in isWhitespace will be ushort.MaxValue(65535) if the character is whitespace, or 0 otherwise.
Consider the following:
Vector<ushort> isWhitespace = Vector.Equals(currentVector, spaceCharVector) |
Vector.Equals(currentVector, tabCharVector) |
Vector.Equals(currentVector, newlineCharVector) |
Vector.Equals(currentVector, carriageCharVector);The vectors represent the whitespace characters ('', '\t', '\n', '\r') repeated across all the elements of the vector:
Vector<ushort> spaceCharVector = new Vector<ushort>(' '); // <32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32>
Vector<ushort> tabCharVector = new Vector<ushort>('\t'); // <9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9>
Vector<ushort> newlineCharVector = new Vector<ushort>('\n'); // <10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10>
Vector<ushort> carriageCharVector = new Vector<ushort>('\r'); // <13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13>To check for whitespaces, we use ushort.MaxValue and 0
ushortMaxValueVectorwithVector.EqualsAll(...): All elements areushort.MaxValue(used to check if all positions are whitespace).zeroVectorwithVector.EqualsAll(...): All elements are0(used to check if no positions are whitespace).ushortMaxValueVectorwithVector.EqualsAny(...): Checks if it contains whitespacezeroVectorwithVector.EqualsAny(...): Checks if it contains non-whitespace
Vector<ushort> ushortMaxValueVector = new Vector<ushort>(ushort.MaxValue);
Vector<ushort> zeroVector = new Vector<ushort>(0);Here is the full code
Vector<ushort> ushortMaxValueVector = new Vector<ushort>(ushort.MaxValue);
Vector<ushort> zeroVector = new Vector<ushort>(0);
var inputWithNoWhitespace = new string('a', 100);
var inputWithAllWhitespace = " \t\n\r \n \t\n\r \t\n\n\n\n \t\t\t\t\r\r\r\r \r\r\n\r\r";
var inputWithSomeWhitespace = "Hello, \tWorld!\nThis is a test string.\r\nIt contains some whitespace characters.";
var currentVector = new Vector<ushort>(MemoryMarshal.Cast<char, ushort>(inputWithNoWhitespace.AsSpan(0, Vector<ushort>.Count)));
Vector<ushort> isWhitespaceWithNoWhitespace = GetWhitespaceMask(currentVector); // <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
Console.WriteLine(Vector.EqualsAll(isWhitespaceWithNoWhitespace, ushortMaxValueVector)); // False because there are no whitespace characters
Console.WriteLine(Vector.EqualsAll(isWhitespaceWithNoWhitespace, zeroVector)); // True because there are no whitespace characters
currentVector = new Vector<ushort>(MemoryMarshal.Cast<char, ushort>(inputWithAllWhitespace.AsSpan(0, Vector<ushort>.Count)));
Vector<ushort> isWhitespaceWithAllWhitespace = GetWhitespaceMask(currentVector); // <65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535>
Console.WriteLine(Vector.EqualsAll(isWhitespaceWithAllWhitespace, ushortMaxValueVector)); // True because all characters are whitespace
Console.WriteLine(Vector.EqualsAll(isWhitespaceWithAllWhitespace, zeroVector)); // False because all characters are whitespace
currentVector = new Vector<ushort>(MemoryMarshal.Cast<char, ushort>(inputWithSomeWhitespace.AsSpan(0, Vector<ushort>.Count)));
Vector<ushort> isWhitespaceWithSomeWhitespace = GetWhitespaceMask(currentVector); // <0, 0, 0, 0, 0, 0, 65535, 65535, 0, 0, 0, 0, 0, 0, 65535, 0>
Console.WriteLine(Vector.EqualsAll(isWhitespaceWithSomeWhitespace, ushortMaxValueVector)); // False because not all characters are whitespace
Console.WriteLine(Vector.EqualsAll(isWhitespaceWithSomeWhitespace, zeroVector)); // False because there are some whitespace characters
static Vector<ushort> GetWhitespaceMask(Vector<ushort> currentVector)
{
Vector<ushort> spaceCharVector = new Vector<ushort>(' ');
Vector<ushort> tabCharVector = new Vector<ushort>('\t');
Vector<ushort> newlineCharVector = new Vector<ushort>('\n');
Vector<ushort> carriageCharVector = new Vector<ushort>('\r');
return Vector.Equals(currentVector, spaceCharVector) |
Vector.Equals(currentVector, tabCharVector) |
Vector.Equals(currentVector, newlineCharVector) |
Vector.Equals(currentVector, carriageCharVector);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WanjohiSammy Is there a reason you went for handwritten SIMD rather than the built-in, vectorized primitives approach? From my reading, IndexOfAny/IndexOfAnyExcept on spans are implemented with SIMD‑aware algorithms and often outperform handwritten loops, especially across CPUs with SSE2/AVX2/ARM NEON, without occasioning the challenge of maintaining this vectorization logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored the whitespace skipping logic to eliminate manual vectorization and replaced it with the recommended approach using SearchValues and IndexOfAnyExcept. This simplifies the code, improves maintainability, and leverages efficient runtime vectorization.
| do | ||
| { | ||
| for (; this.tokenStartIndex < this.storedCharacterCount; this.tokenStartIndex++) | ||
| int remaining = this.storedCharacterCount - this.tokenStartIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method only called when there's a whitespace character detected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. The method is just called regardless of whether there is whitespace or not. It returns true if the first non-whitespace is found and false otherwise.
ff6610b to
1a462ab
Compare
|
/AzurePipelines run |
Co-authored-by: Copilot <[email protected]>
1a462ab to
b89446e
Compare
|
@WanjohiSammy My initial view about this change is that if the vectorized approach you went for materially increases code complexity, the bar for measurable wins need to be high and well-documented. You could benchmark your approach against the suggestion to use |
|
Especially if you're going to keep the custom vectorized code, I'd suggest benchmarks against minified, lightly spaced, highly spaced payloads just so we get an idea about the performance across the different categories. Specifically to confirm there's no regressions on minified payloads with little to know whitespace characters. In addition, determine if any of the custom vectorized code can be put in shareable helper method that can be reused from the different location |
| Assert.Equal(expected, await reader.GetValueAsync()); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd make sense to add tests that specifically validate vectorization doesn't change semantics for the following edge cases:
- Buffer boundary:
\rat the end of buffer and\nat the beginning of the buffer - CRLF. The number of characters loaded into the buffer can be determined from the relevantconstinJsonReaderso it should be able to test this - All-whitespace buffer: Long run of whitespace to stress the refill path and advance logic
- Different category of payloads - minified, lightly spaced, heavily spaced (We should do benchmarks around the same and show statistically significant speedups where whitespace is abundant; but show no regressions on minified payloads)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added comprehensive tests using a variety of large string inputs with different combinations of whitespace characters (spaces, tabs, newlines, carriage returns) surrounding and within JSON string values. The test data includes cases such as:
public static TheoryData<string, string> LargeStringsWithWhitespace()
{
return new TheoryData<string, string>
{
{ new string(' ', 10000) + "\"HelloWorld\"" + new string(' ', 10000), "HelloWorld" },
{ new string('\t', 5000) + "\"TestString\"" + new string('\n', 5000), "TestString" },
{ new string('\r', 8000) + "\"WhitespaceTest\"" + new string(' ', 2000), "WhitespaceTest" },
{ new string(' ', 2500) + new string('\n', 2500) + "\"MixedWhitespace\"" + new string('\t', 2500) + new string('\r', 2500), "MixedWhitespace" },
{ new string(' ', 5) + new string('\n', 5) + "\"Hello\"" + new string(' ', 5) + new string('\t', 5) + "\"World\"" + new string('\r', 5) + new string(' ', 5) + new string('\n', 5) + "!", "Hello" },
{ new string(' ', 5) + new string('\n', 5) + "\"Hello" + new string(' ', 5) + new string('\t', 5) + "World" + new string('\r', 5) + new string(' ', 5) + new string('\n', 5) + "!\"", "Hello \t\t\t\t\tWorld\r\r\r\r\r \n\n\n\n\n!" },
};
}
| { | ||
| for (; this.tokenStartIndex < this.storedCharacterCount; this.tokenStartIndex++) | ||
| int remaining = this.storedCharacterCount - this.tokenStartIndex; | ||
| int vectorSize = Vector<ushort>.Count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WanjohiSammy My suggestion is to switch to ReadOnlySpan<char>.IndexOfAnyExcept with a cached SearchValues<char> of the 4 JSON whitespace chars (' ', '\t', '\r', '\n' - only). This is already vectorized by the runtime and keeps the code tiny. I think it should be at least as fast and much easier to maintain across x64/ARM64.
If custom vectorization logic should remain, add a small-span fast-path (e.g. < 16 or < 32 chars) to avoid paying setup cost on short inputs where SIMD can be slower than scalar
Again, if the custom handwritten logic should remain, factor the span-scan into one helper to avoid duplication and divergence between sync and async flavours - if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have incorporated the recommended solution:
#3381 (comment)
|
@WanjohiSammy Is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment, otherwise LGTM
@gathogojr |
|
/AzurePipelines run |
|
@WanjohiSammy Given our update strategy, it's better to use Update with rebase than Update with merge commit - that is usually selected by default, and what you appear to use often:
|

Issues
This pull request fixes #xxx.
Description
This PR improves perf of JSON parsing by implementing vectorized whitespace skipping operations using ReadOnlySpan.IndexOfAnyExcept and SearchValues.
Before
After
Benchmark project
/WanjohiSammy/ODataJsonReaderBenchmarks-3
Input data:
https://github.com/WanjohiSammy/ODataJsonReaderBenchmarks-3/blob/main/ODataJsonReaderBenchmarks/SamplePayloadWithValues.json
Checklist (Uncheck if it is not completed)
Additional work necessary
If documentation update is needed, please add "Docs Needed" label to the issue and provide details about the required document change in the issue.
Repository notes
Team members can start a CI build by adding a comment with the text
/AzurePipelines runto a PR. A bot may respond indicating that there is no pipeline associated with the pull request. This can be ignored if the build is triggered.Team members should not trigger a build this way for pull requests coming from forked repositories. They should instead trigger the build manually by setting the "branch" to
refs/pull/{prId}/mergewhere{prId}is the ID of the PR.