Optimize JSON Reader Whitespace Skipping with Vectorized Operations #3381

WanjohiSammy · 2025-10-13T06:50:49Z

Issues

This pull request fixes #xxx.

Description

This PR improves perf of JSON parsing by implementing vectorized whitespace skipping operations using ReadOnlySpan.IndexOfAnyExcept and SearchValues.

Before

After

Benchmark project

/WanjohiSammy/ODataJsonReaderBenchmarks-3

Input data:
https://github.com/WanjohiSammy/ODataJsonReaderBenchmarks-3/blob/main/ODataJsonReaderBenchmarks/SamplePayloadWithValues.json

Checklist (Uncheck if it is not completed)

Test cases added
Build and test with one-click build and test script passed

Additional work necessary

If documentation update is needed, please add "Docs Needed" label to the issue and provide details about the required document change in the issue.

Repository notes

Team members can start a CI build by adding a comment with the text /AzurePipelines run to a PR. A bot may respond indicating that there is no pipeline associated with the pull request. This can be ignored if the build is triggered.

Team members should not trigger a build this way for pull requests coming from forked repositories. They should instead trigger the build manually by setting the "branch" to refs/pull/{prId}/merge where {prId} is the ID of the PR.

…paces

WanjohiSammy · 2025-10-13T06:59:27Z

/AzurePipelines run

Copilot

Pull Request Overview

This pull request optimizes JSON whitespace skipping by implementing vectorized operations using SIMD instructions to improve parsing performance. The optimization leverages Vector for cross-platform SIMD support to process multiple characters simultaneously instead of the previous character-by-character approach.

Replaces scalar whitespace checking with vectorized operations using System.Numerics.Vector
Introduces helper methods for vectorized and scalar whitespace processing
Maintains fallback to scalar operations for smaller buffers or when SIMD is not available

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/Microsoft.OData.Core/Json/JsonReader.cs

Co-authored-by: Copilot <[email protected]>

WanjohiSammy · 2025-10-14T07:47:48Z

/AzurePipelines run

…OData/odata.net into fix/vectorizing-skipwhitespaces

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/Microsoft.OData.Core/Json/JsonReader.cs

WanjohiSammy · 2025-10-14T18:22:33Z

/AzurePipelines run

habbes · 2025-10-15T08:17:00Z

src/Microsoft.OData.Core/Json/JsonReader.cs

+                            for (int i = 0; i < vectorSize; i++)
+                            {
+                                // 0 means NOT whitespace
+                                if (isWhitespace[i] == ushort.MinValue)


What are the possible values for each isWhiteSpace element? Is each element either all 1 bits or all 0 bits? Or are there scenarios where some bits 0 and some are 1?

Each position in isWhitespace will be ushort.MaxValue(65535) if the character is whitespace, or 0 otherwise.

Consider the following:

Vector<ushort> isWhitespace = Vector.Equals(currentVector, spaceCharVector) | Vector.Equals(currentVector, tabCharVector) | Vector.Equals(currentVector, newlineCharVector) | Vector.Equals(currentVector, carriageCharVector);

The vectors represent the whitespace characters ('', '\t', '\n', '\r') repeated across all the elements of the vector:

Vector<ushort> spaceCharVector = new Vector<ushort>(' '); // <32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32> Vector<ushort> tabCharVector = new Vector<ushort>('\t'); // <9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9> Vector<ushort> newlineCharVector = new Vector<ushort>('\n'); // <10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10> Vector<ushort> carriageCharVector = new Vector<ushort>('\r'); // <13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13>

To check for whitespaces, we use ushort.MaxValue and 0

ushortMaxValueVector with Vector.EqualsAll(...): All elements are ushort.MaxValue (used to check if all positions are whitespace).

zeroVector with Vector.EqualsAll(...): All elements are 0 (used to check if no positions are whitespace).

ushortMaxValueVector with Vector.EqualsAny(...): Checks if it contains whitespace

zeroVector with Vector.EqualsAny(...): Checks if it contains non-whitespace

Vector<ushort> ushortMaxValueVector = new Vector<ushort>(ushort.MaxValue); Vector<ushort> zeroVector = new Vector<ushort>(0);

Here is the full code

Vector<ushort> ushortMaxValueVector = new Vector<ushort>(ushort.MaxValue); Vector<ushort> zeroVector = new Vector<ushort>(0); var inputWithNoWhitespace = new string('a', 100); var inputWithAllWhitespace = " \t\n\r \n \t\n\r \t\n\n\n\n \t\t\t\t\r\r\r\r \r\r\n\r\r"; var inputWithSomeWhitespace = "Hello, \tWorld!\nThis is a test string.\r\nIt contains some whitespace characters."; var currentVector = new Vector<ushort>(MemoryMarshal.Cast<char, ushort>(inputWithNoWhitespace.AsSpan(0, Vector<ushort>.Count))); Vector<ushort> isWhitespaceWithNoWhitespace = GetWhitespaceMask(currentVector); // <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0> Console.WriteLine(Vector.EqualsAll(isWhitespaceWithNoWhitespace, ushortMaxValueVector)); // False because there are no whitespace characters Console.WriteLine(Vector.EqualsAll(isWhitespaceWithNoWhitespace, zeroVector)); // True because there are no whitespace characters currentVector = new Vector<ushort>(MemoryMarshal.Cast<char, ushort>(inputWithAllWhitespace.AsSpan(0, Vector<ushort>.Count))); Vector<ushort> isWhitespaceWithAllWhitespace = GetWhitespaceMask(currentVector); // <65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535> Console.WriteLine(Vector.EqualsAll(isWhitespaceWithAllWhitespace, ushortMaxValueVector)); // True because all characters are whitespace Console.WriteLine(Vector.EqualsAll(isWhitespaceWithAllWhitespace, zeroVector)); // False because all characters are whitespace currentVector = new Vector<ushort>(MemoryMarshal.Cast<char, ushort>(inputWithSomeWhitespace.AsSpan(0, Vector<ushort>.Count))); Vector<ushort> isWhitespaceWithSomeWhitespace = GetWhitespaceMask(currentVector); // <0, 0, 0, 0, 0, 0, 65535, 65535, 0, 0, 0, 0, 0, 0, 65535, 0> Console.WriteLine(Vector.EqualsAll(isWhitespaceWithSomeWhitespace, ushortMaxValueVector)); // False because not all characters are whitespace Console.WriteLine(Vector.EqualsAll(isWhitespaceWithSomeWhitespace, zeroVector)); // False because there are some whitespace characters static Vector<ushort> GetWhitespaceMask(Vector<ushort> currentVector) { Vector<ushort> spaceCharVector = new Vector<ushort>(' '); Vector<ushort> tabCharVector = new Vector<ushort>('\t'); Vector<ushort> newlineCharVector = new Vector<ushort>('\n'); Vector<ushort> carriageCharVector = new Vector<ushort>('\r'); return Vector.Equals(currentVector, spaceCharVector) | Vector.Equals(currentVector, tabCharVector) | Vector.Equals(currentVector, newlineCharVector) | Vector.Equals(currentVector, carriageCharVector); }

@WanjohiSammy Is there a reason you went for handwritten SIMD rather than the built-in, vectorized primitives approach? From my reading, IndexOfAny/IndexOfAnyExcept on spans are implemented with SIMD‑aware algorithms and often outperform handwritten loops, especially across CPUs with SSE2/AVX2/ARM NEON, without occasioning the challenge of maintaining this vectorization logic

I have refactored the whitespace skipping logic to eliminate manual vectorization and replaced it with the recommended approach using SearchValues and IndexOfAnyExcept. This simplifies the code, improves maintainability, and leverages efficient runtime vectorization.

habbes · 2025-10-15T08:17:39Z

src/Microsoft.OData.Core/Json/JsonReader.cs

            do
            {
-                for (; this.tokenStartIndex < this.storedCharacterCount; this.tokenStartIndex++)
+                int remaining = this.storedCharacterCount - this.tokenStartIndex;


Is this method only called when there's a whitespace character detected?

Not really. The method is just called regardless of whether there is whitespace or not. It returns true if the first non-whitespace is found and false otherwise.

src/Microsoft.OData.Core/Json/JsonReader.cs

WanjohiSammy · 2025-10-21T09:24:06Z

/AzurePipelines run

Co-authored-by: Copilot <[email protected]>

gathogojr · 2025-10-23T06:22:20Z

@WanjohiSammy My initial view about this change is that if the vectorized approach you went for materially increases code complexity, the bar for measurable wins need to be high and well-documented. You could benchmark your approach against the suggestion to use IndexOfAnyExcept and determine if that would be a cleaner alternative while still deriving the same benefits

gathogojr · 2025-10-23T06:38:25Z

Especially if you're going to keep the custom vectorized code, I'd suggest benchmarks against minified, lightly spaced, highly spaced payloads just so we get an idea about the performance across the different categories. Specifically to confirm there's no regressions on minified payloads with little to know whitespace characters.

In addition, determine if any of the custom vectorized code can be put in shareable helper method that can be reused from the different location

gathogojr · 2025-10-23T06:47:08Z

test/UnitTests/Microsoft.OData.Core.Tests/Json/JsonReaderTests.cs

+                Assert.Equal(expected, await reader.GetValueAsync());
+            }
+        }
+


It'd make sense to add tests that specifically validate vectorization doesn't change semantics for the following edge cases:

Buffer boundary: \r at the end of buffer and \n at the beginning of the buffer - CRLF. The number of characters loaded into the buffer can be determined from the relevant const in JsonReader so it should be able to test this

All-whitespace buffer: Long run of whitespace to stress the refill path and advance logic

Different category of payloads - minified, lightly spaced, heavily spaced (We should do benchmarks around the same and show statistically significant speedups where whitespace is abundant; but show no regressions on minified payloads)

I have added comprehensive tests using a variety of large string inputs with different combinations of whitespace characters (spaces, tabs, newlines, carriage returns) surrounding and within JSON string values. The test data includes cases such as:

public static TheoryData<string, string> LargeStringsWithWhitespace() { return new TheoryData<string, string> { { new string(' ', 10000) + "\"HelloWorld\"" + new string(' ', 10000), "HelloWorld" }, { new string('\t', 5000) + "\"TestString\"" + new string('\n', 5000), "TestString" }, { new string('\r', 8000) + "\"WhitespaceTest\"" + new string(' ', 2000), "WhitespaceTest" }, { new string(' ', 2500) + new string('\n', 2500) + "\"MixedWhitespace\"" + new string('\t', 2500) + new string('\r', 2500), "MixedWhitespace" }, { new string(' ', 5) + new string('\n', 5) + "\"Hello\"" + new string(' ', 5) + new string('\t', 5) + "\"World\"" + new string('\r', 5) + new string(' ', 5) + new string('\n', 5) + "!", "Hello" }, { new string(' ', 5) + new string('\n', 5) + "\"Hello" + new string(' ', 5) + new string('\t', 5) + "World" + new string('\r', 5) + new string(' ', 5) + new string('\n', 5) + "!\"", "Hello \t\t\t\t\tWorld\r\r\r\r\r \n\n\n\n\n!" }, }; }

gathogojr · 2025-10-23T06:54:55Z

src/Microsoft.OData.Core/Json/JsonReader.cs

            {
-                for (; this.tokenStartIndex < this.storedCharacterCount; this.tokenStartIndex++)
+                int remaining = this.storedCharacterCount - this.tokenStartIndex;
+                int vectorSize = Vector<ushort>.Count;


@WanjohiSammy My suggestion is to switch to ReadOnlySpan<char>.IndexOfAnyExcept with a cached SearchValues<char> of the 4 JSON whitespace chars (' ', '\t', '\r', '\n' - only). This is already vectorized by the runtime and keeps the code tiny. I think it should be at least as fast and much easier to maintain across x64/ARM64.

If custom vectorization logic should remain, add a small-span fast-path (e.g. < 16 or < 32 chars) to avoid paying setup cost on short inputs where SIMD can be slower than scalar

Again, if the custom handwritten logic should remain, factor the span-scan into one helper to avoid duplication and divergence between sync and async flavours - if possible.

I have incorporated the recommended solution:
#3381 (comment)

…paces

gathogojr · 2025-11-05T09:01:15Z

@WanjohiSammy Is IsWhitespaceCharacter still being used or can we get rid of it?

gathogojr

Left a comment, otherwise LGTM

WanjohiSammy · 2025-11-05T09:49:24Z

@WanjohiSammy Is IsWhitespaceCharacter still being used or can we get rid of it?

@gathogojr IsWhitespaceCharacter is much used in other like 9 places.

WanjohiSammy · 2025-11-05T09:49:50Z

/AzurePipelines run

gathogojr · 2025-11-05T12:02:27Z

@WanjohiSammy Given our update strategy, it's better to use Update with rebase than Update with merge commit - that is usually selected by default, and what you appear to use often:

WanjohiSammy added 4 commits October 11, 2025 07:47

Trying vectorizing skip whitespaces

aba832a

Merge remote-tracking branch 'origin' into fix/vectorizing-skipwhites…

93ba346

…paces

replace SkipWhitespaces with vectorized one

d62586b

Also vectorized Asyncronous SkipWhitespace method

a5a5cbb

WanjohiSammy requested review from Copilot, gathogojr, habbes and xuzhg October 13, 2025 08:01

Copilot AI reviewed Oct 13, 2025

View reviewed changes

src/Microsoft.OData.Core/Json/JsonReader.cs Outdated Show resolved Hide resolved

src/Microsoft.OData.Core/Json/JsonReader.cs Outdated Show resolved Hide resolved

WanjohiSammy marked this pull request as ready for review October 13, 2025 08:03

Apply suggestions from code review

37eac0f

Co-authored-by: Copilot <[email protected]>

Add tests

b1c245a

WanjohiSammy requested a review from Copilot October 14, 2025 18:16

Merge branch 'fix/vectorizing-skipwhitespaces' of https://github.com/…

099c0e1

…OData/odata.net into fix/vectorizing-skipwhitespaces

Copilot AI reviewed Oct 14, 2025

View reviewed changes

nit

c1a27cc

WanjohiSammy force-pushed the fix/vectorizing-skipwhitespaces branch from b84defa to c1a27cc Compare October 14, 2025 18:34

habbes reviewed Oct 15, 2025

View reviewed changes

src/Microsoft.OData.Core/Json/JsonReader.cs Outdated Show resolved Hide resolved

nit

ff6610b

WanjohiSammy requested a review from habbes October 21, 2025 08:13

WanjohiSammy force-pushed the fix/vectorizing-skipwhitespaces branch from ff6610b to 1a462ab Compare October 21, 2025 09:23

WanjohiSammy added 3 commits October 22, 2025 23:47

Trying vectorizing skip whitespaces

cb4d233

replace SkipWhitespaces with vectorized one

0e5630c

Also vectorized Asyncronous SkipWhitespace method

4c0fb1d

WanjohiSammy and others added 4 commits October 22, 2025 23:47

Add tests

f2bf6fa

Apply suggestions from code review

2201b7c

Co-authored-by: Copilot <[email protected]>

nit

0c6a503

nit

b89446e

WanjohiSammy force-pushed the fix/vectorizing-skipwhitespaces branch from 1a462ab to b89446e Compare October 22, 2025 20:47

gathogojr reviewed Oct 23, 2025

View reviewed changes

WanjohiSammy added 4 commits October 23, 2025 11:12

Merge remote-tracking branch 'origin' into fix/vectorizing-skipwhites…

6e29690

…paces

using a simpler Vectorized

7d30360

resolve conflicts

38b18a7

add some tests

abf0efd

gathogojr approved these changes Nov 5, 2025

View reviewed changes

Merge branch 'main' into fix/vectorizing-skipwhitespaces

ad0b966

xuzhg approved these changes Nov 5, 2025

View reviewed changes

Optimize JSON Reader Whitespace Skipping with Vectorized Operations #3381

Are you sure you want to change the base?

Optimize JSON Reader Whitespace Skipping with Vectorized Operations #3381

Conversation

WanjohiSammy commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Description

Before

After

Benchmark project

Checklist (Uncheck if it is not completed)

Additional work necessary

Repository notes

Uh oh!

WanjohiSammy commented Oct 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

WanjohiSammy commented Oct 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WanjohiSammy commented Oct 14, 2025

Uh oh!

habbes Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

WanjohiSammy Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Here is the full code

Uh oh!

gathogojr Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WanjohiSammy Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

habbes Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

WanjohiSammy Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WanjohiSammy commented Oct 21, 2025

Uh oh!

gathogojr commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gathogojr commented Oct 23, 2025

Uh oh!

gathogojr Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

WanjohiSammy Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

gathogojr Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

WanjohiSammy Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

gathogojr commented Nov 5, 2025

Uh oh!

gathogojr left a comment

WanjohiSammy commented Oct 13, 2025 •

edited

Loading

WanjohiSammy Oct 15, 2025 •

edited

Loading

gathogojr Oct 23, 2025 •

edited

Loading

WanjohiSammy Oct 15, 2025 •

edited

Loading

gathogojr commented Oct 23, 2025 •

edited

Loading