Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 3 edits #70

Merged
merged 11 commits into from
Sep 14, 2024
Merged

Conversation

dankamongmen
Copy link
Contributor

Fewer than you might expect for so much text, but I do have a lot of notes:

  • weird to mention both AVX2 and AVX512 but not SVE and SVE2 imho

  • the ARM "ISA" topic is complex. i'd consider "AArch64" to be the
    ISA, not Armv8-A (which i'd call an architecture). technically
    AArch64 is an "execution environment", though. the reason i
    bring this up is because SME was an ARM9 deal, whereas you're
    talking about ARM8.

from the horse's mouth, though:

"This guide introduces the A64 instruction set, used in the 64-bit Armv8-A architecture, also known as AArch64."

so i guess Armv8-A works there.

  • i didn't like "the longer the pipeline, the more effective bypassing becomes."
    i would change it to "the more stages bypassed...", since if the pipeline becomes
    longer in the e.g. decoding area, bypassing isn't going to become any more effective.
    but this seemed pedantic, so i left it alone, silently cursing.

  • in the WAR example, shouldn't the renaming be

    R101 = R1
    R103 = R0

i'm not following what you're doing here (i mean, i know how register
renaming works; i'm just not following your notation).

  • the WAW example is also weird. assuming the code is correct as written,
    the first value written to R2 is a dead value; a good compiler ought not have
    generated this code (it could of course have been handwritten).

  • i agree with not mentioning branch-delay slots; they would just complicate things
    (and are rarely used these days)

  • poor itanium!

  • reservation stations were present in the original Tomasulo algorithm!
    see Tomasulo 1954, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units",
    where they are discussed at length.

  • SIMD: technically you eliminate more than 3/4 of the instructions, because
    you have fewer loop calculations

  • don't love the emphasis on pipeline width in SMT discussion. you can interleave
    two processes on uniscalar SMT2, hiding memory delays etc. think of a barrel processor.

  • 3.6.1.6: in addition to competing for memory bandwidth, prefetched data could
    evict demand data from cache

  • regarding "64-bit" vs "64 bit": the difference is subtle. "64-bit" is correct when
    it's used as an appositive. "64 bit" is correct when used as a pure adjective.
    (you use "64 bit" correctly later in this passage). i'd just trust me on this.

  • 3.8.3: removed "1.25 or 2MB" as this duplicates exactly a sentence from a
    previous section (and isn't really relevant here)

  • 3.8.4: likewise, the "there is a hardware page walker" duplicates an earlier sentence. removed.

  • You use "frontend" in the exercises, but "front-end" in other places. I have changed
    them all to "frontend"/"backend".

@dendibakh
Copy link
Owner

  • weird to mention both AVX2 and AVX512 but not SVE and SVE2 IMHO

Which exact place you're talking about?

  • in the WAR example, shouldn't the renaming be
    R101 = R1
    R103 = R0

i'm not following what you're doing here (i mean, i know how register renaming works; i'm just not following your notation).

I will improve the explanation. I just gave the R0, R1, and R2 registers new names (R100, R101, etc.).

  • the WAW example is also weird. assuming the code is correct as written,
    the first value written to R2 is a dead value; a good compiler ought not have
    generated this code (it could of course have been handwritten).

Yes, I was thinking about it. And I guess I need to change it to the following:

R1 = R0 ADD 1
R2 = R1 ADD R3
R1 = R0 MUL 3   ; WAW and WAR

I was trying to make a minimal example, but the example I came up with is not a meaningful one.

  • reservation stations were present in the original Tomasulo algorithm!
    see Tomasulo 1954, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units",
    where they are discussed at length.

Ok, good, I don't think there is anything I should change in the text. Do I?

  • SIMD: technically you eliminate more than 3/4 of the instructions, because
    you have fewer loop calculations

Which exact place you're talking about?

  • don't love the emphasis on pipeline width in SMT discussion. you can interleave
    two processes on uniscalar SMT2, hiding memory delays etc. think of a barrel processor.

Please elaborate.

  • 3.6.1.6: in addition to competing for memory bandwidth, prefetched data could
    evict demand data from cache

Please elaborate. I talk more about prefetching in Section 8.5 (I have a reference as well).

  • You use "frontend" in the exercises, but "front-end" in other places. I have changed
    them all to "frontend"/"backend".

Yes, thanks.

@dendibakh
Copy link
Owner

Approved and ready to merge, but @dankamongmen, please reply to my comments above.

@dankamongmen
Copy link
Contributor Author

dankamongmen commented Sep 13, 2024

  • weird to mention both AVX2 and AVX512 but not SVE and SVE2 imho

this was in reference to 3.1's "These include enhanced vector processing instructions (e.g., Intel
AVX2, AVX512, ARM SVE, RISC-V “V” vector extension) and matrix/tensor instructions (Intel AMX, ARM
SME)." just seems odd to mention AVX2 and AVX512, but not SVE and SVE2 in this sentence. [shrug]

  • the ARM "ISA" topic is complex. i'd consider "AArch64" to be the
    ISA, not Armv8-A (which i'd call an architecture). technically
    AArch64 is an "execution environment", though. the reason i
    bring this up is because SME was an ARM9 deal, whereas you're
    talking about ARM8.

Refers to 3.1's "Intel x86-64, ARM v8 and RISC-V are examples of..."

from the horse's mouth, though:
"This guide introduces the A64 instruction set, used in the 64-bit Armv8-A architecture, also known as AArch64."

this is taken from the ARM Reference Manual

  • i didn't like "the longer the pipeline, the more effective bypassing becomes."
    i would change it to "the more stages bypassed...", since if the pipeline becomes
    longer in the e.g. decoding area, bypassing isn't going to become any more effective.
    but this seemed pedantic, so i left it alone, silently cursing.

3.3 p30 "Bypassing helps to save a few cycles. The longer the pipeline, the more effective bypassing
becomes."

  • in the WAR example, shouldn't the renaming be
    R101 = R1
    R103 = R0

same page

  • the WAW example is also weird. assuming the code is correct as written,
    the first value written to R2 is a dead value; a good compiler ought not have
    generated this code (it could of course have been handwritten).

same page

  • reservation stations were present in the original Tomasulo algorithm!
    see Tomasulo 1954, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units",
    where they are discussed at length.

3.3 p32 "Modern processors implement dynamic scheduling techniques that are derivatives of Tomasulo’s original algorithm and include the Reorder Buffer (ROB) and the Reservation Station (RS)." the way it's written makes it sound IMHO like ROB/RS came after toamsulo, but RS was very much in his design

> * SIMD: technically you eliminate more than 3/4 of the instructions, because you have fewer loop calculations
3.4 p35 "This leads to issuing 4x fewer instructions and can potentially gain a 4x speedup over four scalar computations." since you say "issue" , i assume you mean dynamic and not static instructions, in which case we probably get a small win from less loop overhead with SIMD.

  • don't love the emphasis on pipeline width in SMT discussion. you can interleave two processes on uniscalar SMT2, hiding memory delays etc. think of a barrel processor.

s 3.5.2 p37 "An example of execution on a non-SMT and a 2-way SMT (SMT2) processor is shown in Figure 3.6. In both cases, the width of the processor pipeline is four, and each slot represents an opportunity to issue a new instruction."

first off, it's weird to say "pipeline width". i always hear length, with width referring to degree of superscalar. either way, i don't think pipeline properties are that fundamental to SMT. simply hiding latency due to loads -- essentially, exposing a second stream of instructions to the OOO engine, one unlikely to have dependencies on the first -- can be a big win.

  • 3.6.1.6: in addition to competing for memory bandwidth, prefetched data could evict demand data from cache

3.6.1.6 "Prefetch techniques need to balance between demand and prefetch requests to guard against prefetch traffic slowing down demand traffic."

i'm just pointing out that it can have negative cache effects, too

  • regarding "64-bit" vs "64 bit": the difference is subtle. "64-bit" is correct when it's used as an appositive. "64 bit" is correct when used as a pure adjective. (you use "64 bit" correctly later in this passage). i'd just trust me on this.

i don't think either of us wants to think about hyphens anymore. no one knows this rule, and it's not like rules truly exist, so who cares where your hyphens are? i will stop sending this kind of correction and i suspect we'll both be happier.

but if you care, go look up "hyphen appositive"

but you probably don't

@dankamongmen
Copy link
Contributor Author

@dendibakh , as requested, context for notes ^^^^^

@dankamongmen
Copy link
Contributor Author

3.4 p35 "This leads to issuing 4x fewer instructions and can potentially gain a 4x speedup over four scalar computations."

since you say "issue" , i assume you mean dynamic and not static instructions, in which case we probably get a small win from less loop overhead with SIMD.

ignore this. your 4x didn't just apply to FLOPs, it applied to all dynamic instructions, and thus correctly includes the reductions in loop overhead. carry on!

@dendibakh
Copy link
Owner

weird to mention both AVX2 and AVX512 but not SVE and SVE2 IMHO

Well, I mention SVE right in that sentence. SVE2 is not mentioned, yes.

Tomasulo and RS

Ok, I will reword it.

SMT and pipeline width.

Yes, we use the term "pipeline width". We actually refer to allocation width as it is where the pipeline is usually the most narrow. Pipeline width has a very big impact on SMT. Imagine you have threads with IPC of 2, so you can execute 2 instructions every cycle in both. If you have a pipeline width = 2, then you have no bandwidth for the second thread since you already maxed out the throughput of your pipeline. If you double the width, you can fit two such threads in a single core and they will run as if they run on separate cores. Unfortunately, this idealistic scenario never happens in practice. But you could achieve ~70% on a carefully engineered benchmark and ~50% on production software if you're lucky. Don't quote me on those numbers though. :)

i don't think either of us wants to think about hyphens anymore. no one knows this rule, and it's not like rules truly exist, so who cares where your hyphens are? i will stop sending this kind of correction and i suspect we'll both be happier.

I'm not super picky about those small things. If you (as a native speaker) don't roll your eyes seeing this type of mistake, then I'm fine.

@dendibakh dendibakh merged commit 49cda28 into dendibakh:main Sep 14, 2024
1 check failed
@dankamongmen dankamongmen deleted the dankamongmen/ch3edits branch September 15, 2024 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants