Chapter 3 edits #70

dankamongmen · 2024-09-09T21:14:53Z

Fewer than you might expect for so much text, but I do have a lot of notes:

weird to mention both AVX2 and AVX512 but not SVE and SVE2 imho
the ARM "ISA" topic is complex. i'd consider "AArch64" to be the
ISA, not Armv8-A (which i'd call an architecture). technically
AArch64 is an "execution environment", though. the reason i
bring this up is because SME was an ARM9 deal, whereas you're
talking about ARM8.

from the horse's mouth, though:

"This guide introduces the A64 instruction set, used in the 64-bit Armv8-A architecture, also known as AArch64."

so i guess Armv8-A works there.

i didn't like "the longer the pipeline, the more effective bypassing becomes."
i would change it to "the more stages bypassed...", since if the pipeline becomes
longer in the e.g. decoding area, bypassing isn't going to become any more effective.
but this seemed pedantic, so i left it alone, silently cursing.
in the WAR example, shouldn't the renaming be

R101 = R1
R103 = R0

i'm not following what you're doing here (i mean, i know how register
renaming works; i'm just not following your notation).

the WAW example is also weird. assuming the code is correct as written,
the first value written to R2 is a dead value; a good compiler ought not have
generated this code (it could of course have been handwritten).
i agree with not mentioning branch-delay slots; they would just complicate things
(and are rarely used these days)
poor itanium!
reservation stations were present in the original Tomasulo algorithm!
see Tomasulo 1954, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units",
where they are discussed at length.
SIMD: technically you eliminate more than 3/4 of the instructions, because
you have fewer loop calculations
don't love the emphasis on pipeline width in SMT discussion. you can interleave
two processes on uniscalar SMT2, hiding memory delays etc. think of a barrel processor.
3.6.1.6: in addition to competing for memory bandwidth, prefetched data could
evict demand data from cache
regarding "64-bit" vs "64 bit": the difference is subtle. "64-bit" is correct when
it's used as an appositive. "64 bit" is correct when used as a pure adjective.
(you use "64 bit" correctly later in this passage). i'd just trust me on this.
3.8.3: removed "1.25 or 2MB" as this duplicates exactly a sentence from a
previous section (and isn't really relevant here)
3.8.4: likewise, the "there is a hardware page walker" duplicates an earlier sentence. removed.
You use "frontend" in the exercises, but "front-end" in other places. I have changed
them all to "frontend"/"backend".

chapters/3-CPU-Microarchitecture/3-7 Virtual memory.md

dendibakh · 2024-09-13T20:44:52Z

weird to mention both AVX2 and AVX512 but not SVE and SVE2 IMHO

Which exact place you're talking about?

in the WAR example, shouldn't the renaming be
R101 = R1
R103 = R0

i'm not following what you're doing here (i mean, i know how register renaming works; i'm just not following your notation).

I will improve the explanation. I just gave the R0, R1, and R2 registers new names (R100, R101, etc.).

the WAW example is also weird. assuming the code is correct as written,
the first value written to R2 is a dead value; a good compiler ought not have
generated this code (it could of course have been handwritten).

Yes, I was thinking about it. And I guess I need to change it to the following:

R1 = R0 ADD 1
R2 = R1 ADD R3
R1 = R0 MUL 3   ; WAW and WAR

I was trying to make a minimal example, but the example I came up with is not a meaningful one.

reservation stations were present in the original Tomasulo algorithm!
see Tomasulo 1954, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units",
where they are discussed at length.

Ok, good, I don't think there is anything I should change in the text. Do I?

SIMD: technically you eliminate more than 3/4 of the instructions, because
you have fewer loop calculations

Which exact place you're talking about?

don't love the emphasis on pipeline width in SMT discussion. you can interleave
two processes on uniscalar SMT2, hiding memory delays etc. think of a barrel processor.

Please elaborate.

3.6.1.6: in addition to competing for memory bandwidth, prefetched data could
evict demand data from cache

Please elaborate. I talk more about prefetching in Section 8.5 (I have a reference as well).

You use "frontend" in the exercises, but "front-end" in other places. I have changed
them all to "frontend"/"backend".

Yes, thanks.

dendibakh · 2024-09-13T20:49:47Z

Approved and ready to merge, but @dankamongmen, please reply to my comments above.

dankamongmen · 2024-09-13T21:15:54Z

weird to mention both AVX2 and AVX512 but not SVE and SVE2 imho

this was in reference to 3.1's "These include enhanced vector processing instructions (e.g., Intel
AVX2, AVX512, ARM SVE, RISC-V “V” vector extension) and matrix/tensor instructions (Intel AMX, ARM
SME)." just seems odd to mention AVX2 and AVX512, but not SVE and SVE2 in this sentence. [shrug]

the ARM "ISA" topic is complex. i'd consider "AArch64" to be the
ISA, not Armv8-A (which i'd call an architecture). technically
AArch64 is an "execution environment", though. the reason i
bring this up is because SME was an ARM9 deal, whereas you're
talking about ARM8.

Refers to 3.1's "Intel x86-64, ARM v8 and RISC-V are examples of..."

from the horse's mouth, though:
"This guide introduces the A64 instruction set, used in the 64-bit Armv8-A architecture, also known as AArch64."

this is taken from the ARM Reference Manual

i didn't like "the longer the pipeline, the more effective bypassing becomes."
i would change it to "the more stages bypassed...", since if the pipeline becomes
longer in the e.g. decoding area, bypassing isn't going to become any more effective.
but this seemed pedantic, so i left it alone, silently cursing.

3.3 p30 "Bypassing helps to save a few cycles. The longer the pipeline, the more effective bypassing
becomes."

in the WAR example, shouldn't the renaming be
R101 = R1
R103 = R0

same page

the WAW example is also weird. assuming the code is correct as written,
the first value written to R2 is a dead value; a good compiler ought not have
generated this code (it could of course have been handwritten).

same page

reservation stations were present in the original Tomasulo algorithm!
see Tomasulo 1954, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units",
where they are discussed at length.

3.3 p32 "Modern processors implement dynamic scheduling techniques that are derivatives of Tomasulo’s original algorithm and include the Reorder Buffer (ROB) and the Reservation Station (RS)." the way it's written makes it sound IMHO like ROB/RS came after toamsulo, but RS was very much in his design

> * SIMD: technically you eliminate more than 3/4 of the instructions, because you have fewer loop calculations
3.4 p35 "This leads to issuing 4x fewer instructions and can potentially gain a 4x speedup over four scalar computations." since you say "issue" , i assume you mean dynamic and not static instructions, in which case we probably get a small win from less loop overhead with SIMD.

don't love the emphasis on pipeline width in SMT discussion. you can interleave two processes on uniscalar SMT2, hiding memory delays etc. think of a barrel processor.

s 3.5.2 p37 "An example of execution on a non-SMT and a 2-way SMT (SMT2) processor is shown in Figure 3.6. In both cases, the width of the processor pipeline is four, and each slot represents an opportunity to issue a new instruction."

first off, it's weird to say "pipeline width". i always hear length, with width referring to degree of superscalar. either way, i don't think pipeline properties are that fundamental to SMT. simply hiding latency due to loads -- essentially, exposing a second stream of instructions to the OOO engine, one unlikely to have dependencies on the first -- can be a big win.

3.6.1.6: in addition to competing for memory bandwidth, prefetched data could evict demand data from cache

3.6.1.6 "Prefetch techniques need to balance between demand and prefetch requests to guard against prefetch traffic slowing down demand traffic."

i'm just pointing out that it can have negative cache effects, too

regarding "64-bit" vs "64 bit": the difference is subtle. "64-bit" is correct when it's used as an appositive. "64 bit" is correct when used as a pure adjective. (you use "64 bit" correctly later in this passage). i'd just trust me on this.

i don't think either of us wants to think about hyphens anymore. no one knows this rule, and it's not like rules truly exist, so who cares where your hyphens are? i will stop sending this kind of correction and i suspect we'll both be happier.

but if you care, go look up "hyphen appositive"

but you probably don't

dankamongmen · 2024-09-13T21:16:10Z

@dendibakh , as requested, context for notes ^^^^^

dankamongmen · 2024-09-13T21:26:40Z

3.4 p35 "This leads to issuing 4x fewer instructions and can potentially gain a 4x speedup over four scalar computations."

since you say "issue" , i assume you mean dynamic and not static instructions, in which case we probably get a small win from less loop overhead with SIMD.

ignore this. your 4x didn't just apply to FLOPs, it applied to all dynamic instructions, and thus correctly includes the reductions in loop overhead. carry on!

dendibakh · 2024-09-13T21:39:14Z

weird to mention both AVX2 and AVX512 but not SVE and SVE2 IMHO

Well, I mention SVE right in that sentence. SVE2 is not mentioned, yes.

Tomasulo and RS

Ok, I will reword it.

SMT and pipeline width.

Yes, we use the term "pipeline width". We actually refer to allocation width as it is where the pipeline is usually the most narrow. Pipeline width has a very big impact on SMT. Imagine you have threads with IPC of 2, so you can execute 2 instructions every cycle in both. If you have a pipeline width = 2, then you have no bandwidth for the second thread since you already maxed out the throughput of your pipeline. If you double the width, you can fit two such threads in a single core and they will run as if they run on separate cores. Unfortunately, this idealistic scenario never happens in practice. But you could achieve ~70% on a carefully engineered benchmark and ~50% on production software if you're lucky. Don't quote me on those numbers though. :)

i don't think either of us wants to think about hyphens anymore. no one knows this rule, and it's not like rules truly exist, so who cares where your hyphens are? i will stop sending this kind of correction and i suspect we'll both be happier.

I'm not super picky about those small things. If you (as a native speaker) don't roll your eyes seeing this type of mistake, then I'm fine.

dankamongmen added 10 commits September 9, 2024 14:38

3-1: number agreement, ARM v8 becomes Armv8-A

b421ed5

3-2: markdown syntax, indefinite article

6ddd453

ch3: Model 91

7c4f3ab

chapter4: target, not just direction

67649ef

ch3-4: number agreement

7694b64

3-6: simplify to 'per transfer'

465b3de

ch3: small things

0fa62c1

3.8: kill duplicated sentences, number issues

a19a5ce

normalize front-end to frontend

6e4d654

3: as is -> as-is

22e9133

dankamongmen commented Sep 9, 2024

View reviewed changes

chapters/3-CPU-Microarchitecture/3-7 Virtual memory.md Show resolved Hide resolved

3-9: monotype 'perf'

568fb19

dendibakh merged commit 49cda28 into dendibakh:main Sep 14, 2024
1 check failed

dankamongmen deleted the dankamongmen/ch3edits branch September 15, 2024 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 3 edits #70

Chapter 3 edits #70

dankamongmen commented Sep 9, 2024

dendibakh commented Sep 13, 2024

dendibakh commented Sep 13, 2024

dankamongmen commented Sep 13, 2024 •

edited

Loading

dankamongmen commented Sep 13, 2024

dankamongmen commented Sep 13, 2024

dendibakh commented Sep 13, 2024

Chapter 3 edits #70

Chapter 3 edits #70

Conversation

dankamongmen commented Sep 9, 2024

dendibakh commented Sep 13, 2024

dendibakh commented Sep 13, 2024

dankamongmen commented Sep 13, 2024 • edited Loading

dankamongmen commented Sep 13, 2024

dankamongmen commented Sep 13, 2024

dendibakh commented Sep 13, 2024

dankamongmen commented Sep 13, 2024 •

edited

Loading