Skip to content

Reorder statements to improve spatial locality #43

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yy214123
Copy link
Contributor

@yy214123 yy214123 commented Apr 8, 2025

Since row-wise and column-wise checks are logically equivalent in terms of win probability, evaluating row-wise directions first is more favorable due to better spatial locality.

This reordering does not alter functional behavior but may lead to more efficient execution in practice.

@visitorckw
Copy link
Collaborator

Looks good.

However, I would appreciate more explanation on why prioritizing row-wise checks improves spatial locality.
Also, I'm curious if there's any observable performance gain or data showing how much faster this change might be.

@yy214123
Copy link
Contributor Author

yy214123 commented Apr 8, 2025

I have not yet conducted a performance benchmark for this optimization. The improvement was motivated by insights from the following references:

After printing the board indices being checked by check_win and check_line_segment_win, we can observe the access patterns:

---- Line type: COL ----
Checking board indices:
(0,0)[0] → (1,0)[4] → (2,0)[8]
(0,1)[1] → (1,1)[5] → (2,1)[9]
(0,2)[2] → (1,2)[6] → (2,2)[10]
(0,3)[3] → (1,3)[7] → (2,3)[11]
(1,0)[4] → (2,0)[8] → (3,0)[12]
(1,1)[5] → (2,1)[9] → (3,1)[13]
(1,2)[6] → (2,2)[10] → (3,2)[14]
(1,3)[7] → (2,3)[11] → (3,3)[15]

---- Line type: ROW ----
Checking board indices:
(0,0)[0] → (0,1)[1] → (0,2)[2]
(0,1)[1] → (0,2)[2] → (0,3)[3]
(1,0)[4] → (1,1)[5] → (1,2)[6]
(1,1)[5] → (1,2)[6] → (1,3)[7]
(2,0)[8] → (2,1)[9] → (2,2)[10]
(2,1)[9] → (2,2)[10] → (2,3)[11]
(3,0)[12] → (3,1)[13] → (3,2)[14]
(3,1)[13] → (3,2)[14] → (3,3)[15]

---- Line type: PRIMARY ----
Checking board indices:
(0,0)[0] → (1,1)[5] → (2,2)[10]
(0,1)[1] → (1,2)[6] → (2,3)[11]
(1,0)[4] → (2,1)[9] → (3,2)[14]
(1,1)[5] → (2,2)[10] → (3,3)[15]

---- Line type: SECONDARY ----
Checking board indices:
(0,2)[2] → (1,1)[5] → (2,0)[8]
(0,3)[3] → (1,2)[6] → (2,1)[9]
(1,2)[6] → (2,1)[9] → (3,0)[12]
(1,3)[7] → (2,2)[10] → (3,1)[13]

Since the chances of winning via ROW and COLUMN are the same, let's consider a case where the current board already contains a winning sequence like ROW XXX or ROW OOO. Even in this situation, the current implementation still performs a complete COLUMN-major traversal of the board before identifying a winner.

Given that the board is only 4×4, the performance loss from accessing non-contiguous memory might not be significant at this scale. However, the access pattern is still worth considering for potential optimization.

@visitorckw
Copy link
Collaborator

I have not yet conducted a performance benchmark for this optimization. The improvement was motivated by insights from the following references:

Maybe consider including them in the commit message using the Link: tag?

After printing the board indices being checked by check_win and check_line_segment_win, we can observe the access patterns:

---- Line type: COL ---- Checking board indices: (0,0)[0] → (1,0)[4] → (2,0)[8] (0,1)[1] → (1,1)[5] → (2,1)[9] (0,2)[2] → (1,2)[6] → (2,2)[10] (0,3)[3] → (1,3)[7] → (2,3)[11] (1,0)[4] → (2,0)[8] → (3,0)[12] (1,1)[5] → (2,1)[9] → (3,1)[13] (1,2)[6] → (2,2)[10] → (3,2)[14] (1,3)[7] → (2,3)[11] → (3,3)[15]

---- Line type: ROW ---- Checking board indices: (0,0)[0] → (0,1)[1] → (0,2)[2] (0,1)[1] → (0,2)[2] → (0,3)[3] (1,0)[4] → (1,1)[5] → (1,2)[6] (1,1)[5] → (1,2)[6] → (1,3)[7] (2,0)[8] → (2,1)[9] → (2,2)[10] (2,1)[9] → (2,2)[10] → (2,3)[11] (3,0)[12] → (3,1)[13] → (3,2)[14] (3,1)[13] → (3,2)[14] → (3,3)[15]

---- Line type: PRIMARY ---- Checking board indices: (0,0)[0] → (1,1)[5] → (2,2)[10] (0,1)[1] → (1,2)[6] → (2,3)[11] (1,0)[4] → (2,1)[9] → (3,2)[14] (1,1)[5] → (2,2)[10] → (3,3)[15]

---- Line type: SECONDARY ---- Checking board indices: (0,2)[2] → (1,1)[5] → (2,0)[8] (0,3)[3] → (1,2)[6] → (2,1)[9] (1,2)[6] → (2,1)[9] → (3,0)[12] (1,3)[7] → (2,2)[10] → (3,1)[13]

Since the chances of winning via ROW and COLUMN are the same, let's consider a case where the current board already contains a winning sequence like ROW XXX or ROW OOO. Even in this situation, the current implementation still performs a complete COLUMN-major traversal of the board before identifying a winner.

Given that the board is only 4×4, the performance loss from accessing non-contiguous memory might not be significant at this scale. However, the access pattern is still worth considering for potential optimization.

I agree that, in theory, prioritizing row-wise checks can lead to better cache locality. However, the current commit message only states that spatial locality improves, without explaining WHY.

Maybe consider tweaking the commit message to clarify the relationship between row-wise access, memory layout, spatial locality, and potential performance benefits ?

Since row-wise and column-wise checks are logically equivalent in
terms of win probability, this change prioritizes row-wise checks
to better align with row-major memory layout and improve cache
locality.

In C, 2D arrays are stored in row-major order, meaning that
accessing elements across rows (e.g., [0][0], [0][1], [0][2]) is
more cache-friendly than accessing down columns (e.g., [0][0],
[1][0], [2][0]). Reordering the checks to evaluate row directions
first results in more sequential memory access patterns, which may
lead to better spatial locality and improved performance.

This optimization does not alter functional behavior, but aligns
better with how memory is physically accessed in most systems.

While the current board size is only 4×4, the locality benefit of
row-wise access patterns becomes more pronounced as the board size
increases, making this change more valuable in scalable scenarios.

References supporting this optimization:
Link: https://hackmd.io/@sysprog/CSAPP-ch6
Link: https://en.wikipedia.org/wiki/Row-_and_column-major_order
@yy214123 yy214123 force-pushed the optimize-spatial-locality branch from ddf7e45 to c77acdb Compare April 8, 2025 06:31
@jserv
Copy link
Owner

jserv commented Apr 8, 2025

Instead of appending the references supporting this proposed change, show experimental evidences.

@yy214123
Copy link
Contributor Author

yy214123 commented Apr 11, 2025

Experiment 1: Fairly Generated Test Data

In this experiment, each generated board guarantees exactly one winning condition, evenly covering all possible directions.

To eliminate bias from board position and access order, I applied the Fisher–Yates shuffle to randomly permute the entire board after inserting the winning condition. This ensures the spatial location of the win does not favor any particular access pattern and provides a fair baseline for performance comparison.

On a 10 × 10 board:
BkA6_bL0kg

On a 50 × 50 board:
Sk_JKWIC1l

On a 100 × 100 board:
SJuZYZLAJl

Since all potential
winning patterns are included during data generation, programs using row-major storage incur additional memory access costs when scanning for column-based wins — and vice versa for column-major storage.

Experiment 2: Biased Sample Distribution (Favoring Rows)

On a 100 × 100 board, the test data was deliberately skewed to increase the proportion of row-based wins.

ByJT-ML0ke

Under this biased condition, the row-major implementation clearly outperforms column-major, with a much wider performance gap.

Perf analysis summary:

Metric column-major row-major
Instructions 1,585,771,268,229 1,426,533,169,134
CPU cycles 427,169,360,881 390,890,530,455
IPC (insn/cycle) 3.71 3.65
Cache references 3,505,765,178 3,411,224,518
Cache misses 2,549,701,006 2,526,058,997
Cache miss rate 72.73% 74.05%
Time elapsed (s) 85.00 s 77.88 s

These results demonstrate a clear performance benefit for row-major storage when handling row-aligned winning conditions. Therefore, if the AI algorithm is designed to favor row-based placements, it can reduce the cost of victory checks and improve overall efficiency.

Experiment 3: Simulating Real Gameplay

The previous experiments assume "guaranteed wins," which is unrealistic in actual gameplay, where wins often occur only after many turns. For example:

 O | O | O | O | O 
---+---+---+---+---
 O |   |   | X |   
---+---+---+---+---
 O |   | X |   |   
---+---+---+---+---
 O |   | X |   | X 
---+---+---+---+---
 X |   |   | X |   

In this scenario, scanning from the top-left may require multiple invalid checks before identifying a win at the bottom-left, which can be costly for column-major access.

To simulate this, I created two 5 × 5 boards that are transposes of each other:

  • One favors row-major
  • The other favors column-major

Both require several invalid checks before locating a winning pattern, representing a more realistic edge-case scenario.

 O | O |   |   |  
---+---+---+---+---
 O | O |   |   | 
---+---+---+---+---
   |   |   |   | O
---+---+---+---+---
 O | O |   | O | O
---+---+---+---+---
 O | O |   | O | O
 O | O |   | O | O
---+---+---+---+---
 O | O |   | O | O
---+---+---+---+---
   |   |   |   | 
---+---+---+---+---
   |   |   | O | O
---+---+---+---+---
   |   | O | O | O

After running 1,000K iterations with perf, the metrics were as follows:

Metric column_major row_major
Instructions 20,042,179,354 9,528,171,690
CPU cycles 4,974,042,956 3,064,415,787
IPC (insn/cycle) 4.03 3.11
Cache references 205,075 115,548
Cache misses 28,503 20,176
Cache miss rate 13.90% 17.46%
Time elapsed (s) 0.971 0.597

Even under near-realistic conditions, row-major consistently outperformed column-major across almost all metrics. Although its cache miss rate is higher, this is due to having nearly half the total cache references — the overall execution time is still significantly lower.

Summary & Outlook

Across all three experiments, row-major storage clearly benefits from better spatial locality, especially when the game favors horizontal wins or involves frequent win-checking logic.

As a result, future AI algorithms for board games could benefit from encouraging row-oriented placement strategies. This not only simplifies decision-making but also enhances performance, particularly on large boards or scenarios requiring frequent win evaluations.

@visitorckw
Copy link
Collaborator

Experiment 1: Fairly Generated Test Data

In this experiment, each generated board guarantees exactly one winning condition, evenly covering all possible directions.

To eliminate bias from board position and access order, I applied the Fisher–Yates shuffle to randomly permute the entire board after inserting the winning condition. This ensures the spatial location of the win does not favor any particular access pattern and provides a fair baseline for performance comparison.

[...]

Since all potential winning patterns are included during data generation, programs using row-major storage incur additional memory access costs when scanning for column-based wins — and vice versa for column-major storage.

I'm not sure if "column-major storage" here refers to actually storing the board in column-major order. If so, I'm unclear about the point of this experiment, since we're currently using a 1D array in row-major order and aren't planning to change that.

Experiment 2: Biased Sample Distribution (Favoring Rows)

On a 100 × 100 board, the test data was deliberately skewed to increase the proportion of row-based wins.

[...]

Under this biased condition, the row-major implementation clearly outperforms column-major, with a much wider performance gap.

Perf analysis summary:

Metric column-major row-major
Instructions 1,585,771,268,229 1,426,533,169,134
CPU cycles 427,169,360,881 390,890,530,455
IPC (insn/cycle) 3.71 3.65
Cache references 3,505,765,178 3,411,224,518
Cache misses 2,549,701,006 2,526,058,997
Cache miss rate 72.73% 74.05%
Time elapsed (s) 85.00 s 77.88 s
These results demonstrate a clear performance benefit for row-major storage when handling row-aligned winning conditions. Therefore, if the AI algorithm is designed to favor row-based placements, it can reduce the cost of victory checks and improve overall efficiency.

I'm also unsure about the purpose of this experiment. For fairness, shouldn't we also test column-based wins and prioritize scanning columns?

Experiment 3: Simulating Real Gameplay

The previous experiments assume "guaranteed wins," which is unrealistic in actual gameplay, where wins often occur only after many turns. For example:

[...]

After running 1,000K iterations with perf, the metrics were as follows:

Metric column_major row_major
Instructions 20,042,179,354 9,528,171,690
CPU cycles 4,974,042,956 3,064,415,787
IPC (insn/cycle) 4.03 3.11
Cache references 205,075 115,548
Cache misses 28,503 20,176
Cache miss rate 13.90% 17.46%
Time elapsed (s) 0.971 0.597
Even under near-realistic conditions, row-major consistently outperformed column-major across almost all metrics. Although its cache miss rate is higher, this is due to having nearly half the total cache references — the overall execution time is still significantly lower.

This experiment seems more reasonable, but I'm a bit confused - I expected spatial locality to help by reducing cache misses, but the results show more cache misses. The faster execution seems to come from fewer instructions instead.

Summary & Outlook

Across all three experiments, row-major storage clearly benefits from better spatial locality, especially when the game favors horizontal wins or involves frequent win-checking logic.

I didn't really see a clear benefit from the first experiment.

As a result, future AI algorithms for board games could benefit from encouraging row-oriented placement strategies. This not only simplifies decision-making but also enhances performance, particularly on large boards or scenarios requiring frequent win evaluations.

@yy214123
Copy link
Contributor Author

I'm not sure if "column-major storage" here refers to actually storing the board in column-major order. If so, I'm unclear about the point of this experiment, since we're currently using a 1D array in row-major order and aren't planning to change that.

You're right to point that out — I realize my previous wording may have caused some confusion. The data is still stored in a row-major 1D array; I didn’t change the underlying layout. The main goal of Experiment 1 was to generate a complete set of winning patterns on an n × n board, and then evaluate the performance of both row-wise and column-wise scanning strategies using this dataset.


I'm also unsure about the purpose of this experiment. For fairness, shouldn't we also test column-based wins and prioritize scanning columns?

You're right — for fairness, we should definitely include this experiment.
Here are the results for the column-biased case:
benchmark_results

Perf analysis summary:

Metric column-major row-major
Instructions 1,430,153,246,436 1,585,481,506,424
CPU cycles 390,527,903,361 423,555,794,132
IPC (insn/cycle) 3.66 3.74
Cache references 3,416,572,300 3,496,775,391
Cache misses 2,515,143,524 2,549,693,611
Cache miss rate 73.62% 72.92%
Time elapsed (s) 77.70 s 84.30 s

This experiment seems more reasonable, but I'm a bit confused - I expected spatial locality to help by reducing cache misses, but the results show more cache misses. The faster execution seems to come from fewer instructions instead.

I just want to clarify that the cache miss rate is typically calculated as:
Cache Miss Rate = Cache Misses / Cache References
(e.g., 20,176 / 115,548 ≈ 17.46%)

So while the row-major version does have a higher miss rate percentage (17.46% vs. 13.90%), it actually had fewer total cache references and fewer absolute cache misses (20,176 vs. 28,503).


I didn't really see a clear benefit from the first experiment.

You're right — I misspoke earlier. The performance improvements are more clearly demonstrated in Experiment 2 (row-biased wins) and Experiment 3, not in Experiment 1.

@yy214123
Copy link
Contributor Author

@visitorckw — just wanted to follow up on this PR when you have time. Appreciate your thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants