-
-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chapter 3 edits #70
Chapter 3 edits #70
Conversation
Which exact place you're talking about?
I will improve the explanation. I just gave the R0, R1, and R2 registers new names (R100, R101, etc.).
Yes, I was thinking about it. And I guess I need to change it to the following:
I was trying to make a minimal example, but the example I came up with is not a meaningful one.
Ok, good, I don't think there is anything I should change in the text. Do I?
Which exact place you're talking about?
Please elaborate.
Please elaborate. I talk more about prefetching in Section 8.5 (I have a reference as well).
Yes, thanks. |
Approved and ready to merge, but @dankamongmen, please reply to my comments above. |
this was in reference to 3.1's "These include enhanced vector processing instructions (e.g., Intel
Refers to 3.1's "Intel x86-64, ARM v8 and RISC-V are examples of..."
this is taken from the ARM Reference Manual
3.3 p30 "Bypassing helps to save a few cycles. The longer the pipeline, the more effective bypassing
same page
same page
3.3 p32 "Modern processors implement dynamic scheduling techniques that are derivatives of Tomasulo’s original algorithm and include the Reorder Buffer (ROB) and the Reservation Station (RS)." the way it's written makes it sound IMHO like ROB/RS came after toamsulo, but RS was very much in his design
s 3.5.2 p37 "An example of execution on a non-SMT and a 2-way SMT (SMT2) processor is shown in Figure 3.6. In both cases, the width of the processor pipeline is four, and each slot represents an opportunity to issue a new instruction." first off, it's weird to say "pipeline width". i always hear length, with width referring to degree of superscalar. either way, i don't think pipeline properties are that fundamental to SMT. simply hiding latency due to loads -- essentially, exposing a second stream of instructions to the OOO engine, one unlikely to have dependencies on the first -- can be a big win.
3.6.1.6 "Prefetch techniques need to balance between demand and prefetch requests to guard against prefetch traffic slowing down demand traffic." i'm just pointing out that it can have negative cache effects, too
i don't think either of us wants to think about hyphens anymore. no one knows this rule, and it's not like rules truly exist, so who cares where your hyphens are? i will stop sending this kind of correction and i suspect we'll both be happier. but if you care, go look up "hyphen appositive" but you probably don't |
@dendibakh , as requested, context for notes ^^^^^ |
ignore this. your 4x didn't just apply to FLOPs, it applied to all dynamic instructions, and thus correctly includes the reductions in loop overhead. carry on! |
Well, I mention SVE right in that sentence. SVE2 is not mentioned, yes.
Ok, I will reword it.
Yes, we use the term "pipeline width". We actually refer to allocation width as it is where the pipeline is usually the most narrow. Pipeline width has a very big impact on SMT. Imagine you have threads with IPC of 2, so you can execute 2 instructions every cycle in both. If you have a pipeline width = 2, then you have no bandwidth for the second thread since you already maxed out the throughput of your pipeline. If you double the width, you can fit two such threads in a single core and they will run as if they run on separate cores. Unfortunately, this idealistic scenario never happens in practice. But you could achieve ~70% on a carefully engineered benchmark and ~50% on production software if you're lucky. Don't quote me on those numbers though. :)
I'm not super picky about those small things. If you (as a native speaker) don't roll your eyes seeing this type of mistake, then I'm fine. |
Fewer than you might expect for so much text, but I do have a lot of notes:
weird to mention both AVX2 and AVX512 but not SVE and SVE2 imho
the ARM "ISA" topic is complex. i'd consider "AArch64" to be the
ISA, not Armv8-A (which i'd call an architecture). technically
AArch64 is an "execution environment", though. the reason i
bring this up is because SME was an ARM9 deal, whereas you're
talking about ARM8.
from the horse's mouth, though:
"This guide introduces the A64 instruction set, used in the 64-bit Armv8-A architecture, also known as AArch64."
so i guess Armv8-A works there.
i didn't like "the longer the pipeline, the more effective bypassing becomes."
i would change it to "the more stages bypassed...", since if the pipeline becomes
longer in the e.g. decoding area, bypassing isn't going to become any more effective.
but this seemed pedantic, so i left it alone, silently cursing.
in the WAR example, shouldn't the renaming be
R101 = R1
R103 = R0
i'm not following what you're doing here (i mean, i know how register
renaming works; i'm just not following your notation).
the WAW example is also weird. assuming the code is correct as written,
the first value written to R2 is a dead value; a good compiler ought not have
generated this code (it could of course have been handwritten).
i agree with not mentioning branch-delay slots; they would just complicate things
(and are rarely used these days)
poor itanium!
reservation stations were present in the original Tomasulo algorithm!
see Tomasulo 1954, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units",
where they are discussed at length.
SIMD: technically you eliminate more than 3/4 of the instructions, because
you have fewer loop calculations
don't love the emphasis on pipeline width in SMT discussion. you can interleave
two processes on uniscalar SMT2, hiding memory delays etc. think of a barrel processor.
3.6.1.6: in addition to competing for memory bandwidth, prefetched data could
evict demand data from cache
regarding "64-bit" vs "64 bit": the difference is subtle. "64-bit" is correct when
it's used as an appositive. "64 bit" is correct when used as a pure adjective.
(you use "64 bit" correctly later in this passage). i'd just trust me on this.
3.8.3: removed "1.25 or 2MB" as this duplicates exactly a sentence from a
previous section (and isn't really relevant here)
3.8.4: likewise, the "there is a hardware page walker" duplicates an earlier sentence. removed.
You use "frontend" in the exercises, but "front-end" in other places. I have changed
them all to "frontend"/"backend".