, Answers to Selected Exercises
Chapter I
1. Use, for example, the Wikipedia site. Yes, the Chart will follow Moore’s law even more closely
than Figure I.3 since the number of transistors for the Pentium Series is smaller than for the
Itanium.
2. Most influenced by:
Technology: cycle time
Compiler: number of instructions
Computer architect: CPI
3. CPI = 1.03 + 0.12c where c is the cost of a branch misprediction. When:
c = 3 then CPI = 1.39
c = 10 then CPI = 2.23
c = 20 then CPI = 3.43
Clearly a sophisticated branch predictor is necessary for deep pipelines.
4. CPI = 1.39 + 0.05 l where l is the cost of a cache miss. When
l = 20 then CPI =2.39
l = 100 then CPI =6.39
l = 500 then CPI =26.39
Of course a cache hierarchy where the latency to the second level cache will be of the order of 10
cycles or less is mandatory.
5. CPI = 1.26 and IPC = 0.79
6. MIPS = IPC/(c.106). If c is halved, MIPS is doubled without change to IPC. However, with
smaller c there is a good chance that the pipelines will be deeper and the relative penalty for a
cache miss will be greater thus requiring adjustments to keep the same IPC.
7. (a) 2 billion (2.109) (b) 0.13µsec (insignificant)
8. (a) 1.6 GHz (b) M1 (c) 1 P1 and 4 P2.
9. (a) M-7 (b) M-5 (c) M-7
10. Both computers have same cycle time and same CPI hence same MIPS. However one executes
fewer instructions than the other.
11. For s = 0.01, speedup = 7.47
For s = 0.1, speedup = 4.71
For s = 0.25, speedup = 2.91
,12. (a) 6.67 (b) 23 (c) 278
13. (a) and (b) weighted arithmetic mean (c) arithmetic mean (d) harmonic mean
14. (a), (b), (d) Direct applications of formulas.
(c) WAM weighed with cycles; WHM weighed with instruction counts
(e) WAM weighed with B’s; WHM weighed with A’s
, Chapter II
1. The basic idea is to stall the pipeline once a “branch taken” has been recognized. The only ALU
is in Stage EX. A possible solution is as follows. A new control line “branch in progress” is set at
decode time when a branch is recognized and will be passed to ID/EX and EX/MEM. The
address computation does not take place in stage ID but the signed displacement is transferred
into ID/EX and EX/MEM. If the after stage EX the “branch in progress” is set in EX/MEM and
“branch taken” is set as the output of the ALU, the result of that AND operation is sent to the
control unit. If it is “1”, fetching of a new instruction is disallowed and the instructions in
progress in IF/ID and ID/EX are no-oped. The displacement and the PC found in EX/MEM are
“forwarded” back to the inputs of the ALU giving, in the next cycle, the new PC. Then we
proceed as in the original pipeline for a branch taken. Notice that we can still apply a “branch not
taken” prediction strategy.
2. (a) 3 cycles (b) 0 cycles if the branch is not taken and 3 cycles if it is taken.
3. (a) Forwarding to inputs of the ALU comparator in stage ID must be set from both the ID/EX and
EX/MEM pipeline registers. Stalling must occur if one of the operand registers used for the
comparisons was generated by a load in one of the two previous instructions (in fact a stall of 2
cycles if the previous instruction was a load). (b) 1 cycle in the case of “branch taken”, 0 if the
branch is not taken assuming a “branch not taken” strategy.
4. One possible solution is to “no-op” the instruction generating an exception and the instructions
following it in the pipeline. The PC of the saved state should point to the instruction with the
exception.
5. We need a new ALU on the MEM stage. At decode time, we need a control line of 2 bits
indicating whether this extra ALU should be activated and setting also the control of a
multiplexer for one input of the ALU yielding the values of c (we assume only 3 possible values
of c; if more make the control line 3 bits wide). Stalling and forwarding must be set for Rj as it is
done for a RAW dependency after a load. Finally, the data path to the register file must be
doubled in width and two registers must be able to be stored in the same cycle.
6. 1 right and 23 left.
7. 0 right and 24 left
8. (a) always true (b) not necessarily (c) true in general
,9. (a) 256 (b) Lines 0 to 63 will contain A[0] to A[127]; lines 64 to 255 will contain B[128] to
B[1023] (c) 0 read misses and 512 write misses; no difference for 2-way set-associative (d) 768
bytes were written back
10. Same answers as in Ex. 9 for (a) (b) and (c). (d) 2048 bytes
11. (a) The CPIs are respectively: 1.245, 1.238, and 1.227. Therefore choose cache C (b) the average
instruction times are respectively 24.9, 27.2, and 29.4 therefore choose cache A (c) For cache A
no extra bit; one extra bit for caches B and C.
12. (a) Extract_TLB_fields (in vaddress; out tagtlb,indextlb);
tagtlb = vaddress shifted right by 16;
indextlb = (vaddress shifted left by 16) shifted right by 28);
end;
(b) TLB is defined as:
2-dimensional array TLBtag(0:3,0:15) where each element is a 16-bit tag
2-dimensional array PTE(0:3,0:15) where each element is a 20 bit physical frame number
Check_TLB(in tagtlb, indextlb; out TLBhit, physical_frame_number)
TLBhit = false;
For (i = 0; i<4; i++)
If TLB[i,indextlb] = tagtlb then
begin
TLBhit = true;
physical_frame_number = PTE[i,indextlb];
exit;
end
end;
In the worst case, there are 4 comparisons. The theoretical minimum is 0 and the
theoretical maximum is 220 -1. In the case where the first 16KB are reserved for the O.S., the
minimum is 4. If there are 256 MB of main memory, the maximum is 218- 1.
(c) Get_physical_address (in vaddress, physical_frame_number; out physical_address);
offset = (vaddress left shift 20) right shift 20;
physical_address = concatenate(physical_frame_number,offset);
end;
(d) Cache is defined as:
2-dimensional array Cachetag(0:1,0:127) where each element is a 20-bit tag
2-dimensional array Cachedata(0:1,0:127) where each element is a 32-byte line
, Extract_cache_fields (in physical_address; out tagcache, indexcache)
tagcache= physical_ address right shift 12;
indexcache = (physical_address left shift 20) right shift 25;
end;
(e) 3.965
13. (a) false (b) false
14. (a) Data_TLB hit followed by Data_Cache miss is what happens generally on a cache miss when
the page has already been referenced. (b)Data_TLB miss followed by Data_Cache miss could
happen after a page fault. It could also happen if the page is sill resident in memory and has not
been accessed for a long time. (c) Data_TLB miss followed by a Data_Cache hit is possible but
unlikely. It could happen if the page has not been referenced for a long time, is still resident in
main memory, and the cache line has not been evicted.
15. (a) 7.445 (b) 5.745
, Chapter III
1. (a) A floating-point add followed by an unrelated integer add (b) A load followed by an
unrelated integer add
2. Instructions leave the front-end in program order only when their operands (registers) are ready or
will become ready through forwarding. A subsequent instruction cannot therefore overwrite a
register operand before its contents have been delivered to the functional unit for which it is an
operand.
3. This could happen if the two writes were not separated by an instruction using the result of the
first write. The first write is therefore not useful. However since it might complete after the
second one, a check is needed.
4. The scoreboard is used to: (a) Allow instructions to pass front the front-end to the back-end by
checking RAW and WAW dependencies (b) Schedule forwarding (c) Recover from branch
misprediction (d) Replay L1 D-cache misses (e) Allow precise exceptions for integer and
memory units.
The first two uses are slight extensions of what the forwarding and stalling units do in simple
pipelines. Because the functional units are pipelined and because instructions are issued in strict
program order, this part of the scoreboard is simpler than that of the CDC 6600 although
forwarding implies some added complexity. The last 3 uses require that the scoreboard monitor
all in flight instructions but this was true also of the CDC.
5. No it is not possible (unless much complexity is added) because a mispredicted branch might be
in the back-end and the instruction in S3 might be on the wrong path.
6. PAL code accesses hardware registers that are not part of process state. Saving and recovering
these registers is not part of the hardware definition.
7. A Boolean shift register is used to model the occupancy of the resulting bus. Assume we are at
time t. The leftmost bit of the shift register shows the occupancy of the bus at time t+1 (true if the
bus will be occupied, false otherwise). Similalrly, bits2 (from the left), 3,..., n show the
occupancies at time t+2, t+3, …, t+n. The shift register will shift left at every cycle. When at
time t an instruction is ready to be issued, the occupancy of the bus at time t +lat (where lat is the
latency of the operation) is checked. If the resulting bit in the shift register is false, it is set to true
and the instruction can issue. If not, the operation is repeated on the next cycle. (Such a schme
was present in the Cray-1 supercomputer).
8. Dispatch Start exec. Finish exec. Write back
i1 t0 + 1 t0 + 2 t0 + 22 t0 + 23
i2 t0 + 2 t0 + 23 t0 + 27 t0 + 28
, i3 t0 + 3 t0 + 4 t0 + 14 t0 + 24
9. No change for i1 and i2
i3 t0 + 3 t0 + 4 t0 + 14 t0 + 15
10. The issue step now has to check for whether buffers associated with the functional units are free.
The dispatch and execution steps are similar. The write-result is more complex since a functional
unit might have several instructions occupying it. To that effect, the Pi field must be of the form
Ua-n where n is the latency of the unit (number of stages) and an instruction carries that tag n
during its execution. Note that WAW are still forbidden (no register renaming).
11. From [SBP00] in Chapter 5:
12. They grow linearly with m.
13. Instructions must be processed in order in the dispatch step since we access the tail of the ROB.
14. No major difference. Example sequence must use a renaming scheme avoiding WAW, for
example:
i1: R2 R3 / R5
i2: R1 R2 + R4
i3: R2 R6 * R6
i4: R7 R2 * R6
15. In the decode-rename scheme, we must implement the check and priority encoding of Exercise 12
and of course check for the availability of 2 free entries in the ROB. We must double the number
, of ports in the reservation stations and have two common buses for results etc. We should also be
able to commit 2 instructions per cycle to balance the design.
16. The primordial function of the ROB is to assure commit of instructions in program order. This
requirement was not need in the floating-point unit of the IBM System 360/91.
17. One advantage would be to save some space on individual (integer) entries in the ROB. This is
overwhelmingly defeated by the drawbacks: (i) splitting a shared resource into two private ones is
rarely a good allocation strategy (ii) branches would have to be entered in both ROBs (iii)
precise exceptions and interrupts would be harder to handle.
18. Saving of an ROB entry. More decoding bandwidth.
19. Well it’s not RISC but see answer to previous exercise.
20. (a) pred = a< 0;
b.pred =1;
c.pred =2;
d.(not pred) =3;
e.(not pred) =4
(b) 7 (c) 6 (d) 4.6 and 4.4 (e) 6 in both cases. Yes but only if the sequence of code is short.
, Chapter IV
1. (a) 0% for 1-bit and 50% for 2-bit (b) 11001100…1100; no evident programming pattern
taken
^ not taken
strong taken weak taken
TT TN
taken
not taken
taken
not taken strong not taken
weak not taken
NT NN
taken
^ not taken
Generally, this is the initial
state
(c) 101010... programming pattern as in (b).
2. It costs more in storage (50%) and the logic is about the same cost. There is no difference in
performance.
NT
T
NT NT NT
TTT TTN TNN NNN
NT T
T T NT
T NT T
NTT TNT NTN NNT
T NT
T
3. (a) 8 (b) Answers in the form (gpr)(pred) (new gpr) (act) (PHT)
(000)(0)(000)(1)(01,01,01,01,01,01,01,11); misprediction restore GPR to (001)