Reorder statements to improve spatial locality #43

yy214123 · 2025-04-08T05:16:10Z

Since row-wise and column-wise checks are logically equivalent in terms of win probability, evaluating row-wise directions first is more favorable due to better spatial locality.

This reordering does not alter functional behavior but may lead to more efficient execution in practice.

visitorckw · 2025-04-08T05:39:08Z

Looks good.

However, I would appreciate more explanation on why prioritizing row-wise checks improves spatial locality.
Also, I'm curious if there's any observable performance gain or data showing how much faster this change might be.

yy214123 · 2025-04-08T05:59:36Z

I have not yet conducted a performance benchmark for this optimization. The improvement was motivated by insights from the following references:

After printing the board indices being checked by check_win and check_line_segment_win, we can observe the access patterns:

---- Line type: COL ----
Checking board indices:
(0,0)[0] → (1,0)[4] → (2,0)[8]
(0,1)[1] → (1,1)[5] → (2,1)[9]
(0,2)[2] → (1,2)[6] → (2,2)[10]
(0,3)[3] → (1,3)[7] → (2,3)[11]
(1,0)[4] → (2,0)[8] → (3,0)[12]
(1,1)[5] → (2,1)[9] → (3,1)[13]
(1,2)[6] → (2,2)[10] → (3,2)[14]
(1,3)[7] → (2,3)[11] → (3,3)[15]

---- Line type: ROW ----
Checking board indices:
(0,0)[0] → (0,1)[1] → (0,2)[2]
(0,1)[1] → (0,2)[2] → (0,3)[3]
(1,0)[4] → (1,1)[5] → (1,2)[6]
(1,1)[5] → (1,2)[6] → (1,3)[7]
(2,0)[8] → (2,1)[9] → (2,2)[10]
(2,1)[9] → (2,2)[10] → (2,3)[11]
(3,0)[12] → (3,1)[13] → (3,2)[14]
(3,1)[13] → (3,2)[14] → (3,3)[15]

---- Line type: PRIMARY ----
Checking board indices:
(0,0)[0] → (1,1)[5] → (2,2)[10]
(0,1)[1] → (1,2)[6] → (2,3)[11]
(1,0)[4] → (2,1)[9] → (3,2)[14]
(1,1)[5] → (2,2)[10] → (3,3)[15]

---- Line type: SECONDARY ----
Checking board indices:
(0,2)[2] → (1,1)[5] → (2,0)[8]
(0,3)[3] → (1,2)[6] → (2,1)[9]
(1,2)[6] → (2,1)[9] → (3,0)[12]
(1,3)[7] → (2,2)[10] → (3,1)[13]

Since the chances of winning via ROW and COLUMN are the same, let's consider a case where the current board already contains a winning sequence like ROW XXX or ROW OOO. Even in this situation, the current implementation still performs a complete COLUMN-major traversal of the board before identifying a winner.

Given that the board is only 4×4, the performance loss from accessing non-contiguous memory might not be significant at this scale. However, the access pattern is still worth considering for potential optimization.

visitorckw · 2025-04-08T06:12:58Z

I have not yet conducted a performance benchmark for this optimization. The improvement was motivated by insights from the following references:

https://hackmd.io/@sysprog/CSAPP-ch6

https://en.wikipedia.org/wiki/Row-_and_column-major_order

Maybe consider including them in the commit message using the Link: tag?

After printing the board indices being checked by check_win and check_line_segment_win, we can observe the access patterns:

---- Line type: COL ---- Checking board indices: (0,0)[0] → (1,0)[4] → (2,0)[8] (0,1)[1] → (1,1)[5] → (2,1)[9] (0,2)[2] → (1,2)[6] → (2,2)[10] (0,3)[3] → (1,3)[7] → (2,3)[11] (1,0)[4] → (2,0)[8] → (3,0)[12] (1,1)[5] → (2,1)[9] → (3,1)[13] (1,2)[6] → (2,2)[10] → (3,2)[14] (1,3)[7] → (2,3)[11] → (3,3)[15]

---- Line type: ROW ---- Checking board indices: (0,0)[0] → (0,1)[1] → (0,2)[2] (0,1)[1] → (0,2)[2] → (0,3)[3] (1,0)[4] → (1,1)[5] → (1,2)[6] (1,1)[5] → (1,2)[6] → (1,3)[7] (2,0)[8] → (2,1)[9] → (2,2)[10] (2,1)[9] → (2,2)[10] → (2,3)[11] (3,0)[12] → (3,1)[13] → (3,2)[14] (3,1)[13] → (3,2)[14] → (3,3)[15]

---- Line type: PRIMARY ---- Checking board indices: (0,0)[0] → (1,1)[5] → (2,2)[10] (0,1)[1] → (1,2)[6] → (2,3)[11] (1,0)[4] → (2,1)[9] → (3,2)[14] (1,1)[5] → (2,2)[10] → (3,3)[15]

---- Line type: SECONDARY ---- Checking board indices: (0,2)[2] → (1,1)[5] → (2,0)[8] (0,3)[3] → (1,2)[6] → (2,1)[9] (1,2)[6] → (2,1)[9] → (3,0)[12] (1,3)[7] → (2,2)[10] → (3,1)[13]

Since the chances of winning via ROW and COLUMN are the same, let's consider a case where the current board already contains a winning sequence like ROW XXX or ROW OOO. Even in this situation, the current implementation still performs a complete COLUMN-major traversal of the board before identifying a winner.

Given that the board is only 4×4, the performance loss from accessing non-contiguous memory might not be significant at this scale. However, the access pattern is still worth considering for potential optimization.

I agree that, in theory, prioritizing row-wise checks can lead to better cache locality. However, the current commit message only states that spatial locality improves, without explaining WHY.

Maybe consider tweaking the commit message to clarify the relationship between row-wise access, memory layout, spatial locality, and potential performance benefits ?

Since row-wise and column-wise checks are logically equivalent in terms of win probability, this change prioritizes row-wise checks to better align with row-major memory layout and improve cache locality. In C, 2D arrays are stored in row-major order, meaning that accessing elements across rows (e.g., [0][0], [0][1], [0][2]) is more cache-friendly than accessing down columns (e.g., [0][0], [1][0], [2][0]). Reordering the checks to evaluate row directions first results in more sequential memory access patterns, which may lead to better spatial locality and improved performance. This optimization does not alter functional behavior, but aligns better with how memory is physically accessed in most systems. While the current board size is only 4×4, the locality benefit of row-wise access patterns becomes more pronounced as the board size increases, making this change more valuable in scalable scenarios. References supporting this optimization: Link: https://hackmd.io/@sysprog/CSAPP-ch6 Link: https://en.wikipedia.org/wiki/Row-_and_column-major_order

jserv · 2025-04-08T08:02:13Z

Instead of appending the references supporting this proposed change, show experimental evidences.

yy214123 · 2025-04-11T06:17:18Z

Experiment 1: Fairly Generated Test Data

In this experiment, each generated board guarantees exactly one winning condition, evenly covering all possible directions.

To eliminate bias from board position and access order, I applied the Fisher–Yates shuffle to randomly permute the entire board after inserting the winning condition. This ensures the spatial location of the win does not favor any particular access pattern and provides a fair baseline for performance comparison.

On a 10 × 10 board:

On a 50 × 50 board:

On a 100 × 100 board:

Since all potential
winning patterns are included during data generation, programs using row-major storage incur additional memory access costs when scanning for column-based wins — and vice versa for column-major storage.

Experiment 2: Biased Sample Distribution (Favoring Rows)

On a 100 × 100 board, the test data was deliberately skewed to increase the proportion of row-based wins.

Under this biased condition, the row-major implementation clearly outperforms column-major, with a much wider performance gap.

Perf analysis summary:

Metric	column-major	row-major
Instructions	1,585,771,268,229	1,426,533,169,134
CPU cycles	427,169,360,881	390,890,530,455
IPC (insn/cycle)	3.71	3.65
Cache references	3,505,765,178	3,411,224,518
Cache misses	2,549,701,006	2,526,058,997
Cache miss rate	72.73%	74.05%
Time elapsed (s)	85.00 s	77.88 s

These results demonstrate a clear performance benefit for row-major storage when handling row-aligned winning conditions. Therefore, if the AI algorithm is designed to favor row-based placements, it can reduce the cost of victory checks and improve overall efficiency.

Experiment 3: Simulating Real Gameplay

The previous experiments assume "guaranteed wins," which is unrealistic in actual gameplay, where wins often occur only after many turns. For example:

 O | O | O | O | O 
---+---+---+---+---
 O |   |   | X |   
---+---+---+---+---
 O |   | X |   |   
---+---+---+---+---
 O |   | X |   | X 
---+---+---+---+---
 X |   |   | X |

In this scenario, scanning from the top-left may require multiple invalid checks before identifying a win at the bottom-left, which can be costly for column-major access.

To simulate this, I created two 5 × 5 boards that are transposes of each other:

One favors row-major
The other favors column-major

Both require several invalid checks before locating a winning pattern, representing a more realistic edge-case scenario.

 O | O |   |   |  
---+---+---+---+---
 O | O |   |   | 
---+---+---+---+---
   |   |   |   | O
---+---+---+---+---
 O | O |   | O | O
---+---+---+---+---
 O | O |   | O | O

 O | O |   | O | O
---+---+---+---+---
 O | O |   | O | O
---+---+---+---+---
   |   |   |   | 
---+---+---+---+---
   |   |   | O | O
---+---+---+---+---
   |   | O | O | O

After running 1,000K iterations with perf, the metrics were as follows:

Metric	column_major	row_major
Instructions	20,042,179,354	9,528,171,690
CPU cycles	4,974,042,956	3,064,415,787
IPC (insn/cycle)	4.03	3.11
Cache references	205,075	115,548
Cache misses	28,503	20,176
Cache miss rate	13.90%	17.46%
Time elapsed (s)	0.971	0.597

Even under near-realistic conditions, row-major consistently outperformed column-major across almost all metrics. Although its cache miss rate is higher, this is due to having nearly half the total cache references — the overall execution time is still significantly lower.

Summary & Outlook

Across all three experiments, row-major storage clearly benefits from better spatial locality, especially when the game favors horizontal wins or involves frequent win-checking logic.

As a result, future AI algorithms for board games could benefit from encouraging row-oriented placement strategies. This not only simplifies decision-making but also enhances performance, particularly on large boards or scenarios requiring frequent win evaluations.

visitorckw · 2025-04-11T14:22:48Z

Experiment 1: Fairly Generated Test Data

In this experiment, each generated board guarantees exactly one winning condition, evenly covering all possible directions.

To eliminate bias from board position and access order, I applied the Fisher–Yates shuffle to randomly permute the entire board after inserting the winning condition. This ensures the spatial location of the win does not favor any particular access pattern and provides a fair baseline for performance comparison.

[...]

Since all potential winning patterns are included during data generation, programs using row-major storage incur additional memory access costs when scanning for column-based wins — and vice versa for column-major storage.

I'm not sure if "column-major storage" here refers to actually storing the board in column-major order. If so, I'm unclear about the point of this experiment, since we're currently using a 1D array in row-major order and aren't planning to change that.

Experiment 2: Biased Sample Distribution (Favoring Rows)

On a 100 × 100 board, the test data was deliberately skewed to increase the proportion of row-based wins.

[...]

Under this biased condition, the row-major implementation clearly outperforms column-major, with a much wider performance gap.

Perf analysis summary:

Metric column-major row-major
Instructions 1,585,771,268,229 1,426,533,169,134
CPU cycles 427,169,360,881 390,890,530,455
IPC (insn/cycle) 3.71 3.65
Cache references 3,505,765,178 3,411,224,518
Cache misses 2,549,701,006 2,526,058,997
Cache miss rate 72.73% 74.05%
Time elapsed (s) 85.00 s 77.88 s
These results demonstrate a clear performance benefit for row-major storage when handling row-aligned winning conditions. Therefore, if the AI algorithm is designed to favor row-based placements, it can reduce the cost of victory checks and improve overall efficiency.

I'm also unsure about the purpose of this experiment. For fairness, shouldn't we also test column-based wins and prioritize scanning columns?

Experiment 3: Simulating Real Gameplay

The previous experiments assume "guaranteed wins," which is unrealistic in actual gameplay, where wins often occur only after many turns. For example:

[...]

After running 1,000K iterations with perf, the metrics were as follows:

Metric column_major row_major
Instructions 20,042,179,354 9,528,171,690
CPU cycles 4,974,042,956 3,064,415,787
IPC (insn/cycle) 4.03 3.11
Cache references 205,075 115,548
Cache misses 28,503 20,176
Cache miss rate 13.90% 17.46%
Time elapsed (s) 0.971 0.597
Even under near-realistic conditions, row-major consistently outperformed column-major across almost all metrics. Although its cache miss rate is higher, this is due to having nearly half the total cache references — the overall execution time is still significantly lower.

This experiment seems more reasonable, but I'm a bit confused - I expected spatial locality to help by reducing cache misses, but the results show more cache misses. The faster execution seems to come from fewer instructions instead.

Summary & Outlook

Across all three experiments, row-major storage clearly benefits from better spatial locality, especially when the game favors horizontal wins or involves frequent win-checking logic.

I didn't really see a clear benefit from the first experiment.

As a result, future AI algorithms for board games could benefit from encouraging row-oriented placement strategies. This not only simplifies decision-making but also enhances performance, particularly on large boards or scenarios requiring frequent win evaluations.

yy214123 · 2025-04-12T13:14:33Z

I'm not sure if "column-major storage" here refers to actually storing the board in column-major order. If so, I'm unclear about the point of this experiment, since we're currently using a 1D array in row-major order and aren't planning to change that.

You're right to point that out — I realize my previous wording may have caused some confusion. The data is still stored in a row-major 1D array; I didn’t change the underlying layout. The main goal of Experiment 1 was to generate a complete set of winning patterns on an n × n board, and then evaluate the performance of both row-wise and column-wise scanning strategies using this dataset.

I'm also unsure about the purpose of this experiment. For fairness, shouldn't we also test column-based wins and prioritize scanning columns?

You're right — for fairness, we should definitely include this experiment.
Here are the results for the column-biased case:

Perf analysis summary:

Metric	column-major	row-major
Instructions	1,430,153,246,436	1,585,481,506,424
CPU cycles	390,527,903,361	423,555,794,132
IPC (insn/cycle)	3.66	3.74
Cache references	3,416,572,300	3,496,775,391
Cache misses	2,515,143,524	2,549,693,611
Cache miss rate	73.62%	72.92%
Time elapsed (s)	77.70 s	84.30 s

This experiment seems more reasonable, but I'm a bit confused - I expected spatial locality to help by reducing cache misses, but the results show more cache misses. The faster execution seems to come from fewer instructions instead.

I just want to clarify that the cache miss rate is typically calculated as:
Cache Miss Rate = Cache Misses / Cache References
(e.g., 20,176 / 115,548 ≈ 17.46%)

So while the row-major version does have a higher miss rate percentage (17.46% vs. 13.90%), it actually had fewer total cache references and fewer absolute cache misses (20,176 vs. 28,503).

I didn't really see a clear benefit from the first experiment.

You're right — I misspoke earlier. The performance improvements are more clearly demonstrated in Experiment 2 (row-biased wins) and Experiment 3, not in Experiment 1.

yy214123 · 2025-04-16T13:06:15Z

@visitorckw — just wanted to follow up on this PR when you have time. Appreciate your thoughts!

yy214123 force-pushed the optimize-spatial-locality branch from ddf7e45 to c77acdb Compare April 8, 2025 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorder statements to improve spatial locality #43

Reorder statements to improve spatial locality #43

yy214123 commented Apr 8, 2025

visitorckw commented Apr 8, 2025

yy214123 commented Apr 8, 2025

visitorckw commented Apr 8, 2025

jserv commented Apr 8, 2025

yy214123 commented Apr 11, 2025 •

edited

Loading

visitorckw commented Apr 11, 2025

Experiment 1: Fairly Generated Test Data

Experiment 2: Biased Sample Distribution (Favoring Rows)

Experiment 3: Simulating Real Gameplay

Summary & Outlook

yy214123 commented Apr 12, 2025

yy214123 commented Apr 16, 2025

Reorder statements to improve spatial locality #43

Are you sure you want to change the base?

Reorder statements to improve spatial locality #43

Conversation

yy214123 commented Apr 8, 2025

visitorckw commented Apr 8, 2025

yy214123 commented Apr 8, 2025

visitorckw commented Apr 8, 2025

jserv commented Apr 8, 2025

yy214123 commented Apr 11, 2025 • edited Loading

Experiment 1: Fairly Generated Test Data

Experiment 2: Biased Sample Distribution (Favoring Rows)

Experiment 3: Simulating Real Gameplay

Summary & Outlook

visitorckw commented Apr 11, 2025

Experiment 1: Fairly Generated Test Data

Experiment 2: Biased Sample Distribution (Favoring Rows)

Experiment 3: Simulating Real Gameplay

Summary & Outlook

yy214123 commented Apr 12, 2025

yy214123 commented Apr 16, 2025

yy214123 commented Apr 11, 2025 •

edited

Loading