Keep the most often-used data in a small, fast SRAM (often local to CPU chip). The reason this strategy works: LOCALITY.

Locality of reference: Access to address X at time t implies that access to address X + ΔX at time t + Δt becomes more probable as ΔX and Δt approach zero.

AMAT = HitTime + MissRatio * MissPenalty

Example: 8-location DM cache (W=3)

Example: 2-way set-associative cache, 8 sets, 4-word block size, write-back

Replacement strategy choices: least-recently used (LRU); first in, first out (FIFO); random
Write-policy choices: write-through, write-behind, write-back
Problem 1.

(A) The timing for a particular cache is as follows: checking the cache takes 1 cycle. If there’s a hit the data is returned to the CPU at the end of the first cycle. If there’s a miss, it takes 10 additional cycles to retrieve the word from main memory, store it in the cache, and return it to the CPU. If we want an average memory access time of 1.4 cycles, what is the minimum possible value for the cache’s hit ratio?

\[ 1.4 = 1 + (1 - \alpha) \cdot 10 \]

\[ \therefore 0.4 = 1 - \alpha \]

Minimum possible value of hit ratio: \( 0.96 \)

(B) If the cache block size, i.e., words/cache line, is doubled but the total number of data words in the cache is unchanged, how will the following cache parameters change? Please circle the best answer.

\# of offset bits: UNCHANGED \( +1 \) \( -1 \) \( 2x \) \( 0.5x \) \( \text{CAN'T TELL} \)

\# of tag bits: UNCHANGED \( +1 \) \( -1 \) \( 2x \) \( 0.5x \) \( \text{CAN'T TELL} \)

\# of cache lines: UNCHANGED \( +1 \) \( -1 \) \( 2x \) \( 0.5x \) \( \text{CAN'T TELL} \)

Consider a direct-mapped cache with 64 total data words with 1 word/cache line, which uses a LRU replacement strategy and a write-back write strategy. This cache architecture is used for parts (C) through (F).

(C) If cache line number 5 is valid and its tag field has the value 0x1234, what is the address in main memory of the data word currently residing in cache line 5?

Main memory address of data word in cache line 5: 0x\textcircled{0x123414}

The program shown on the right repeatedly executes an inner loop that sums the 16 elements of an array that is stored starting in location 0x310.

The program is executed for many iterations, then a measurement of the cache statistics is made during one iteration through all the code, i.e., starting with the execution of the instruction labeled outer_loop: until just before the next time that instruction is executed.

inner_loop:

\begin{align*}
. &= \emptyset \\
\text{outer_loop:} &
\begin{align*}
\text{CMOVE(16,R0)} &\quad // initialize loop index \text{ J} \\
\text{CMOVE(0,R1)} &
\end{align*}
\begin{align*}
\text{loop:} &\quad // add up elements in array \\
\text{SUBC(R0,1,R0)} &\quad // decrement index \\
\text{MULC(R0,4,R2)} &\quad // convert to byte offset \\
\text{LD(R2,0x310,R3)} &// load value from A[J] \\
\text{ADD(R3,R1,R1)} &\quad // add to sum \\
\text{BNE(R0,loop)} &\quad // loop until all words are summed \\
\text{BR(outer_loop)} &\quad // perform test again!
\end{align*}
\end{align*}
(D) In total, how many instruction fetches occur during one complete iteration of the outer loop? How many data reads?

\[ i_{\text{fetch}} = 2 + (5 \times 16) + 1 \]
\[ d_{\text{read}} = 1 \times 16 \]

Number of instruction fetches: 83

Number of data reads: 16

(E) How many instruction fetch misses occur during one complete iteration of the outer loop? How many data read misses? Hint: remember that the array starts at address 0x310.

4 locations in array contract w/ 4 instructions
\Rightarrow\text{other instrs. in cache}
\Rightarrow\text{other data in cache}

Number of instruction fetch misses: 4

Number of data read misses: 4

(F) What is the hit ratio measured after one complete iteration of the outer loop?

\[ \text{total accesses} = 83 + 16 = 99 \]
\[ \text{total hits} = 99 - 8 \]

Hit ratio: \( \frac{99 - 8}{99} = \frac{91}{99} \)

Problem 2.

The Beta Engineering Team is working on the design of a cache. They’ve decided that the cache will have a total of \( 2^{10} = 1024 \) data words, but are still thinking about the other aspects of the cache architecture.

First assume the team chooses to build a direct-mapped write-back cache with a block size of 4 words.

\Rightarrow\text{4 bit offset}

(A) Please answer the following questions:

Number of lines in the cache: 256

Number of bits in the tag field for each cache entry: 20

(B) This cache takes 2 clock cycles to determine if a memory access is a hit or a miss and, if it’s a hit, return data to the Beta. If the access is a miss, the cache takes 20 additional clock cycles to fill the cache line and return the requested word to the Beta. If the hit rate is 90%, what is the Beta’s average memory access time in clock cycles?

Average memory access time assuming 90% hit rate (clock cycles): 4

\[ \Delta \text{MAT} = 2 + (1 - 0.9)(20) = 2 + 2 = 4 \]

6.004 Worksheet - 3 of 11 - L13/L14 - Memory Hierarchy & Caches
Now assume the team chooses to build a 2-way set-associative write-back cache with a block size of 4 words. The total number of data words in the entire cache is still 1024. The cache uses a LRU replacement strategy.

\[ \Rightarrow 128 \text{ lines in each way (7 bits of index)} \]

(C) Please answer the following questions:

Address bits used as offset (including byte offset): \[ A[\_\_\_\_ : \_\_\_\_] \]

Address bits used as cache line index: \[ A[\_\_\_\_ : \_\_\_\_] \]

Address bits used for tag comparison: \[ A[\_\_\_\_ : \_\_\_\_] \]

(D) To implement the LRU replacement strategy this cache requires some additional state for each set. How many state bits are required for each set?

Number of state bits needed for each set for LRU: \[ \_\_\_\_ \]

To test this set-associative cache, the team runs the benchmark code shown on the right. The code sums the elements of a 16-element array. The first instruction of the code is at location 0x0 and the first element of the array is at location 0x10000. Assume that the cache is empty when execution starts and remember the cache has a block size of 4 words.

(E) How many instruction misses will occur when running the benchmark?

Number of instruction misses when running the benchmark: \[ \_\_\_\_ \]

(F) How many data misses (i.e., misses caused by the memory access from the LD instruction) will occur when running the benchmark?

Number of data misses when running the benchmark: \[ \_\_\_\_ \]

16-element array \[ \Rightarrow 4 \text{ cache lines} \]

(g) What’s the exact hit rate when the complete benchmark is executed?

Benchmark hit rate:

\[ \frac{93-6}{99} \times 100\% \approx 94\% \]

\[ \text{hd inst. fetches} = 2 + 16 + 5 + 1 = 24 \] loop iterations

\[ \text{hd data fetches} = 16 \Rightarrow \text{total fetches} = 99 \]
Problem 3.

The program from the Cache performance lab is shown at the right. Assume the program is being run on a Beta with a cache with the following parameters:

- **2-way set-associative**
- **block size of 2**, i.e., 2 data words are stored in each cache line
- total number of data words in the cache is 32
- **LRU** replacement strategy

(A) The cache will divide the 32-bit address supplied by the Beta into three fields: \( B \) bits of block offset (including byte offset bits), \( L \) bits of cache line index, and \( T \) bits of tag field. Based on the cache parameters given above, what are the appropriate values for \( B \), \( L \), and \( T \)?

- \( B \) bytes/line: __________
- \( B \) lines/way: __________
- \( 32-L-B \) value for \( T \): __________

(B) If the MULC instruction is resident in a cache line, what will be its cache line index? The value of the tag field for the cache?

[Cache line index for MULC when resident in cache: ________]

[Tag field for MULC when resident in cache: 0x___]

(C) With the values of \( I \), \( A \), and \( N \) as shown, list all the values \( j \) (\( 0 \leq j < N \)) where the location holding the value \( A[j] \) will map to the same cache line index as the MULC instruction in the program.

List all \( j \) where \( A[j] \) have the same cache line index as MULC: ________

\( A[0] @ 0x240 \Rightarrow \) cache line 4. 2 words/line \( \Rightarrow A[8], A[9] \) are in cache line 8.

(D) If the outer loop is run many times, give the steady-state hit ratio for the cache, i.e., assume that the number of compulsory misses as the cache is first filled are insignificant compared to the number of hits and misses during execution.

Steady-state ratio (%): ________

Everything fits in the cache!

6.004 Worksheet - 5 of 11 - L13/L14 – Memory Hierarchy & Caches
Problem 4.

Consider a 2-way set-associative cache where each way has 4 cache lines with a block size of 2 words. Each cache line includes a valid bit (V) and a dirty bit (D), which is used to implement a write-back strategy. The replacement policy is least-recently-used (LRU). The cache is used for both instruction fetch and data (LD,ST) accesses. Please use this cache when answering questions (A) through (D).

(A) Using this cache, a particular benchmark program experiences an average memory access time (AMAT) of 1.3 cycles. The access time on a cache hit is 1 cycle; the miss penalty (i.e., additional access time) is 10 cycles. What is the hit ratio when running the benchmark program? You can express your answer as a formula if you wish:

\[ \text{Hit ratio for benchmark program: } 1.3 = 1 + (1 - \text{hit ratio})(10) \]

(B) The circuitry for this cache uses various address bits as the block offset, cache line index and tag field. Please indicate which address bits A[31:0] are used for each purpose by placing a "B" in each address bit used for the block offset, "L" in each address bit used for the cache line index, and "T" in each address bit used for the tag field.

![Address bits layout](image)

Fill in each box with "B", "L", or "T"

(C) This cache needs room to store new data and based on the LRU replacement policy has chosen the cache line whose information is shown to the right for replacement. Since the current contents of that line are marked as dirty (D = 1), the cache must write some information back to main memory. What is the address of each memory location to be written? Please give each address in hex.

Addresses of each location to be written (in hex): \( 0x2478, 0x247C \)

\begin{align*}
\text{Way: 0} \\
\text{Cache line index: 3} \\
\text{Valid bit (V): 1} \\
\text{Dirty bit (D): 1} \\
\text{Tag field: 0x123}
\end{align*}

6.004 Worksheet - 6 of 11 - 1.13/1.14 – Memory Hierarchy & Caches
(D) This cache is used to run the following benchmark program. The code starts at memory address 0; the array referenced by the code has its first element at memory address 0x2000. First determine the number of memory accesses (both instruction and data) made during each iteration through the loop. Then estimate the steady-state average hit ratio for the program, i.e., the average hit ratio after many iterations through the loop.

```
. = 0
CMOVE(0,R0) // byte index into array
CMOVE(0,R1) // initialize checksum accumulator
loop:
   LD(R0,array,R2) // load next element of array
   SHLC(R1,1,R1) // shift checksum
   ADDC(1,R1,R1) // increment checksum
   ADD(R2,R1,R1) // include data value in checksum
   ADDC(R0,4,R0) // byte index of next array element
   CMP3TC(R0,1000,R2) // process 250 entries
   BT(R2,loop)
HALT()
```

- 7 fetches, 1 data access
- Instructions occupy 4 cache lines (2 words/line)
- 2-way => 100% hit on fetch
- Data accesses: new word every cycle
- 2 words/line => 50% hit

Number of memory accesses made during each iteration of the loop: 8

Estimated steady-state average hit ratio: \( \frac{515}{16} = 32.19% \)

Problem 5.

Consider the diagram to the right for a 2-way set associative cache to be used with our Beta design. Each cache line holds a single 32-bit word of data along with its associated tag and valid bit (0 when the cache line is invalid, 1 when the cache line is valid).

(A) The Beta produces 32-bit byte addresses, A[31:0]. To ensure the best cache performance, which address bits should be used for the cache index? For the tag field?

- 8 cache lines
- Address bits used for cache index: A[4:2]
- Address bits used for tag field: A[31:5] (remaining bits are tag)
(B) Suppose the Beta does a read of location 0x5678. Identify which cache location(s) would be checked to see if that location is in the cache. For each location specify the cache section (A or B) and row number (0 through 7). E.g., 3A for row 3, section A. If there is a cache hit on this access what would be the contents of the tag data for the cache line that holds the data for this location?

Cache location(s) checked on access to 0x5678: 6A, 6B

Cache tag data on hit for location 0x5678 (hex): 0x2BB

(C) Assume that checking the cache on each read takes 1 cycle and that refilling the cache on a miss takes an additional 8 cycles. If we wanted the average access time over many reads to be 1.1 cycles, what is the minimum hit ratio the cache must achieve during that period of time? You needn’t simplify your answer.

\[ 1.1 = 1 + (1 - HR) \cdot (8) \]

Minimum hit ratio for 1.1 cycle average access time: \( \frac{7.9}{8} \)

(D) Estimate the approximate cache hit ratio for the following program. Assume the cache is empty before execution begins (all the valid bits are 0) and that an LRU replacement strategy is used. Remember the cache is used for both instruction and data (LD) accesses.

```
LINE ADDR
0 0
0 4
1 8
2 12
3 16
4 20
5 24

= 0
Cmove(source, R0)
Cmove(0, R1)
Cmove(0x1000, R2)
Loop: LD(R0, 0, R3)
ADD(R0, 4, R0)
ADD(R3, R1, R1)
subc(R2, 1, R2)
Bne(R2, loop)
St(R1, source)
Halt()
```

Each iteration:

- 5 inst. fetches = 100% hit
- 1 data fetch = 0% hit
- no data word fetched twice

Total: 6

Source:

\. = . + 0x4000 // Set source to 0x100, reserve 1000 words

Approximate hit ratio: \( \frac{5}{6} \)

(E) After the program of part (D) has finished execution what information is stored in row 4 of the cache? Give the addresses for the two locations that are cached (one in each of the sections) or briefly explain why that information can’t be determined.

Addresses whose data is cached in “Row 4”: 0x10 and 0x408F

6.004 Worksheet - 8 of 11 - L13/L14 – Memory Hierarchy & Caches
Problem 6.

A standard unpipelined Beta is connected to a 2-way set-associative cache containing 8 sets, with a block size of 4 32-bit words. The cache uses a LRU replacement strategy. At a particular point during execution, a snapshot is taken of the cache contents, which are shown below. All values are in hex; assume that any hex digits not shown are 0.

**Way #1**

```
<table>
<thead>
<tr>
<th>Line</th>
<th>V</th>
<th>D</th>
<th>TAG</th>
<th>A3</th>
<th>A2</th>
<th>A1</th>
<th>A0</th>
</tr>
</thead>
<tbody>
<tr>
<td>7</td>
<td>1</td>
<td>1</td>
<td>f29</td>
<td>a5218800</td>
<td>77f70002</td>
<td>01f00003</td>
<td>8359c80d</td>
</tr>
<tr>
<td>6</td>
<td>1</td>
<td>0</td>
<td>027</td>
<td>c01f0037</td>
<td>73fffeaf7</td>
<td>77ee0002</td>
<td>d33ac000</td>
</tr>
<tr>
<td>5</td>
<td>0</td>
<td>0</td>
<td>000</td>
<td>00000000</td>
<td>00000000</td>
<td>00000000</td>
<td>00000000</td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>0</td>
<td>33a</td>
<td>8358c000</td>
<td>73fa0002</td>
<td>c1f00003</td>
<td>73fffeaf0</td>
</tr>
<tr>
<td>3</td>
<td>1</td>
<td>0</td>
<td>801</td>
<td>641f0600</td>
<td>60f0f0600</td>
<td>c01f0003</td>
<td>80200000</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
<td>0</td>
<td>002</td>
<td>90072800</td>
<td>73fffeaf0</td>
<td>80a4f800</td>
<td>d0052ca0</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>5c8</td>
<td>80400000</td>
<td>8358c000</td>
<td>73fffeaf0</td>
<td>80200000</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>a11</td>
<td>c01f0003</td>
<td>d1c7d0f0</td>
<td>77f0f091a</td>
<td>77e00002</td>
</tr>
</tbody>
</table>
```

**Way #2**

```
<table>
<thead>
<tr>
<th>Line</th>
<th>V</th>
<th>D</th>
<th>TAG</th>
<th>B3</th>
<th>B2</th>
<th>B1</th>
<th>B0</th>
</tr>
</thead>
<tbody>
<tr>
<td>7</td>
<td>1</td>
<td>0</td>
<td>77e</td>
<td>d01e0001</td>
<td>61040008</td>
<td>c01f0004</td>
<td>80a38000</td>
</tr>
<tr>
<td>6</td>
<td>1</td>
<td>0</td>
<td>1f0</td>
<td>c23f0394</td>
<td>77e00002</td>
<td>80400000</td>
<td>c01f0046</td>
</tr>
<tr>
<td>5</td>
<td>1</td>
<td>0</td>
<td>d1c</td>
<td>f79c0001</td>
<td>e809fff</td>
<td>aaaaaaa</td>
<td>55555555</td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>0</td>
<td>00a</td>
<td>d37a0000</td>
<td>d0000000</td>
<td>73ffef93</td>
<td>80600000</td>
</tr>
<tr>
<td>3</td>
<td>1</td>
<td>0</td>
<td>01b</td>
<td>c01f0003</td>
<td>d01e0003</td>
<td>80a4f800</td>
<td>73ffefab</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
<td>0</td>
<td>02b</td>
<td>c6510001</td>
<td>77f90002</td>
<td>90082000</td>
<td>c01f0041</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>f99</td>
<td>77ee0002</td>
<td>f99b001f</td>
<td>64040000</td>
<td>fc000000</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>7af</td>
<td>a9ab6000</td>
<td>c01f0003b</td>
<td>77e00002</td>
<td>fb5e0011</td>
</tr>
</tbody>
</table>
```

(A) The cache uses bits from the 32-bit byte address produced by the Beta to select the appropriate set (L), as input to the tag comparisons (T) and to select the appropriate word from the data block (B). For correct and optimal performance what are the appropriate portions of the address to use for L, T and B? Express your answer in the form “A[N:M]” for N and M in the range 0 to 31, or write “CAN’T TELL”.

Address bits to use for L: **A[6:4]**
Address bits to use for T: **A[3:7]**
Address bits to use for B: **A[3:2]**

(B) For the following addresses, if the contents of the specified location appear in the cache, give the location’s 32-bits contents in hex (determined by using the appropriate value from the cache). If the contents of the specified location are NOT in the cache, write “MISS”.

**R2, L=0, T=1422** Contents of location 0xA1100 (in hex) or “MISS”: 0x**miss**

**R2, L=4, T=0xA** Contents of location 0x548 (in hex) or “MISS”: 0x**00000000**

(C) Ignoring the current contents of the cache, is it possible for the contents of locations 0x0 and 0x1000 to both be present in the cache simultaneously?

Locations 0x0 and 0x1000 present simultaneously (circle one): **YES**. **NO**

both map to cache line 4, but it's a 2-way cache
(D) Give a one-sentence explanation of how the D bit got set to 1 for Line #7 of Way #1.

One sentence explanation

ST changed a value of a word on that cache line but it hasn't yet been written back to memory.

(E) The following code snippet sums the elements of the 32-element integer array X. Assume this code is executing on a Beta with a cache architecture as described above and that, initially, the cache is empty, i.e., all the V bits have been set to 0. Compute the hit ratio as this program runs until it executes the HALT() instruction, a total of \( \frac{2 + (6 \times 32) + 1}{223} = 95.29\% \)

\[
\begin{align*}
. &= 0 \\
+ 2 &\quad \text{CMOVE}(0, R0) \quad \text{// loop counter} \\
+ 2 &\quad \text{CMOVE}(0, R1) \quad \text{// accumulated sum}
\end{align*}
\]

\[
\begin{align*}
\text{loop:} & \\
\text{6 fetch} & \\
\text{1 fetch} & \\
\text{32 iterations} & \\
\text{b: fetch} & \\
\text{bt(R2, loop)} & \quad \text{// nope, keep going}
\end{align*}
\]

\[
\begin{align*}
\text{HALT()} & \quad \text{// all done, sum in R1}
\end{align*}
\]

\[
\begin{align*}
\text{X:} & \quad \text{LONG}(1) \quad \text{// the 32-element integer array X} \\
& \quad \text{LONG}(2) \\
& \quad \ldots \\
& \quad \text{LONG}(32)
\end{align*}
\]

2-way cache: 100% hit rate on ifetch

block size = 4:

- 3 misses to load instructions
- 32 data accesses, but only 25% miss (each miss loads 4 words)

\( \Rightarrow 8 \) data misses total

6.004 Worksheet - 10 of 11 - L13/A.114 – Memory Hierarchy & Caches
Problem 7.

After his gig hit single *I Hit the Line*, renegade singer Johnny Cache has decided he'd better actually learn how a cache works. He bought three Beta processors, identical except for their cache architectures:

- **Beta1** has a 64-line direct-mapped cache
- **Beta2** has a 2-way set associative cache, LRU, with a total of 64 lines
- **Beta3** has a 4-way set associative cache, LRU, with a total of 64 lines

Note that each cache has the same total capacity: 64 lines, each holding a single 32-bit word of data or instruction. All three machines use the same cache for data and instructions fetched from main memory.

Johnny has written a simple test program:

```assembly
// Try a little cache benchmark
I = 0x1000  // where program lives
A = 0x2000  // data region 1
B = 0x3000  // data region 2
N = 16      // size of data regions (BYTES!)

.i = I       // start program here
P:  CMOVE(1000, R6)  // outer loop count
Q:  CMOVE(N, R0)    // Loop index I (array offset)
R:  SUBC(R0, 4, R0)  // I = I - 1
LD(R0, A, R1)      // read A[I]
LD(R0, B, R2)      // read B[I]
BNE(R0, R)
SUBC(R6,1, R6)     // repeat many times
BNE(R6, Q)
HALT()
```

Johnny runs his program on each Beta, and finds that one Beta model outperforms the other two.

(A) (2 points) Which Beta gets the highest hit ratio on the above benchmark?

3 regions: 0x1000, 0x2000, 0x3000  
Circle one: Beta1  Beta2  **Beta3**

(B) (2 points) Johnny changes the value of B in his program to 0x2000 (same as A), and finds a substantial improvement in the hit rate attained by one of the Beta models (approaching 100%). Which model shows this marked improvement?

now just 2 regions: 0x1000, 0x2000  
Circle one: Beta1  **Beta2**  Beta3

(C) (3 points) Finally, Johnny sets I, A, and B each to 0x0, and sets N to 64. What is the TOTAL number of cache misses that will occur executing this version of the program on each of the Beta models?

TOTAL cache misses running on Beta1: __________; Beta2: __________; Beta3: __________

only compulsory misses for all three caches