Function and Operation of the System Cache



This section discusses the principles behind the design of cache 
memory, and explains how the secondary (level 2) cache works in 
detail. This will give you a much better understanding of how the 
cache works and what the issues are in its design--at least I 
hope it will, because that was my primary goal in writing this. I 
was frustrated as I put the site together with my inability to 
find anything on the 'net that really explained how the cache 
worked.

This section is focused on the secondary cache, but in fact, the 
function of the primary (level 1) cache built into modern 
processors is in many ways identical: in terms of how 
associativity works, how the cache is organized, how the system 
checks for hits, etc. However, many of the implementation 
details are different.

 Note: This is an advanced section with some potentially confusing 
concepts. I make use of examples in order to hopefully make sure 
the explanations make sense. You will find this section most 
helpful if you read all the subsections it contains in order. You 
may also find reading the section explaining system memory 
operation and timing instructive. This page also makes extensive 
reference to memory addresses and locations, and binary numbers. 
If you are not familiar with binary mathematics, you may want to 
read this introductory page on the subject .


Why Caching Works

Cache is in some ways a really amazing technology. A 512 KB level 
2 cache, caching 64 MB of system memory, can supply the 
information that the processor requests 90-95% of the time. Think 
about the ratios here: the level 2 cache is less than 1% of the 
size of the memory it is caching, but it is able to register a 
"hit" on over 90% of requests. That's pretty efficient, and is 
the reason why caching is so important.

The reason that this happens is due to a computer science 
principle called locality of reference. It states basically that 
even within very large programs with several megabytes of 
instructions, only small portions of this code generally get used 
at once. Programs tend to spend large periods of time working in 
one small area of the code, often performing the same work many 
times over and over with slightly different data, and then move 
to another area. This occurs because of "loops", which are what 
programs use to do work many times in rapid succession.

Just as one example (there are many), let's suppose you start up 
your word processor and open your favorite document. The word 
processor program at some point must read the file and then print 
on the screen the text it finds. This is done (in very simplified 
terms) using code similar to this: 

      - Open document file. 
      - Open screen window. 
      - For each character in the document: 
	      - Read the character. 
	      - Store the character into working memory. 
	      - Write the character to the window if the 
		  character is part of the first page. 
      - Close the document file. 

The loop is of course the three instructions that are done "for 
each character in the document". These instructions will be 
repeated many thousands of times, and there are hundreds or 
thousands of loops like these in the software you use. Every time 
you hit "page down" on your keyboard, the word processor must 
clear the screen, figure out which characters to display next, 
and then run a similar loop to copy them from memory to the 
screen. Several loops are used when you tell it to save the file 
to the hard disk.

This example shows how caching improves performance when dealing 
with program code, but what about your data? Not surprisingly, 
access to data (your work files, etc.) is similarly repetitive. 
When you are using your word processor, how many times do you 
scroll up and down looking at the same text over and over, as 
you edit it? The system cache holds much of this information so 
that it can be loaded more quickly the second, third, and next 
times that it is needed.


How Caching Works

In the example in the previous section a loop was used to read 
characters from a file, store them in working memory, and then 
write them to the screen. The first time each of these 
instructions (read, store, write) is executed, it must be loaded 
from relatively slow system memory (assuming it is in memory, 
otherwise it must be read from the hard disk which is much, much 
slower even than the memory).

The cache is programmed (in hardware) to hold recently-accessed 
memory locations in case they are needed again. So each of these 
instructions will be saved in the cache after being loaded from 
memory the first time. The next time the processor wants to use 
the same instruction, it will check the cache first, see that the 
instruction it needs is there, and load it from cache instead of 
going to the slower system RAM. The number of instructions that 
can be buffered this way is a function of the size and design of 
the cache.

Let's suppose that our loop is going to process 1,000 characters 
and the cache is able to hold all three instructions in the loop 
(which sounds obvious, but isn't always, due to cache mapping 
techniques). This means that 999 of the 1,000 times these 
instructions are executed, they will be loaded from the cache, 
or 99.9% of the time. This is why caching is able to satisfy such 
a large percentage of requests for memory even though it has a 
capacity that is often less than 1% the size of the system RAM.


Parts of the Level 2 Cache

The level 2 cache is comprised of two main components. These are 
not usually physically located in the same chips, but represent 
logically how the cache works. These parts of the cache are: 

      - The Data Store: This is where the cached information is 
	actually kept. When reference is made to "storing 
	something in the cache" or "retrieving something from 
	the cache", this is where the actual data goes to or 
	comes from. When someone says that the cache is 256 KB 
	or 512 KB, they are referring to the size of the data 
	store. The larger the store, the more information that 
	can be cached and the more likelihood of the cache being 
	able to satisfy a request, all else being equal. 

      - The Tag RAM: This is a small area of memory used by the 
	cache to keep track of where in memory the entries in 
	the data store belong. The size of the tag RAM--and not 
	the size of the data store--controls how much of main 
	memory can be cached. 

In addition to these memory areas are of course the cache 
controller circuitry. Most of the work of controlling the level 
2 cache on a modern PC is performed by the system chipset .


Structure of the Data Store

Many people think of the cache as being organized as a large 
sequence of bytes (8 bits each). In fact, on a modern 
fifth-generation or later PC, the level 2 cache is organized as 
a set of long cache lines, each containing 32 bytes (256 bits). 
This means that each time the cache is written to or read from, 
a transfer of 32 bytes takes place; there is no way to read or 
write just 1 byte. This is done mainly for performance reasons. 
At the very least, you can't have less than 64 bits per line of 
cache, because the data bus on a Pentium or later PC is 64 bits 
wide . The data store is 256 bits wide because memory is accessed 
in four-read bursts, and 4 times 64 is 256.

Let's take the case of a 512 KB cache (data store). If we wanted 
to mentally envision how this memory is structured, instead of 
seeing a single long column with 524,288 (512 K) individual rows, 
we should instead see 32 columns and 16,384 (16 K) rows. Each 
access to the data store is a line (row), and the cache has 
16,384 different addresses.


Cache Mapping and Associativity

A very important factor in determining the effectiveness of the 
level 2 cache relates to how the cache is mapped to the system 
memory. What this means in brief is that there are many different 
ways to allocate the storage in our cache to the memory addresses 
it serves. Let's take as an example a system with 512 KB of L2 
cache and 64 MB of main memory. The burning question is: how do 
we decide how to divvy up the 16,384 address lines in our cache 
amongst the "huge" 64 MB of memory?

There are three different ways that this mapping can generally 
be done. The choice of mapping technique is so critical to the 
design that the cache is often named after this choice: 

      - Direct Mapped Cache: The simplest way to allocate the 
	cache to the system memory is to determine how many 
	cache lines there are (16,384 in our example) and just 
	chop the system memory into the same number of chunks. 
	Then each chunk gets the use of one cache line. This is 
	called direct mapping. So if we have 64 MB of main 
	memory addresses, each cache line would be shared by 
	4,096 memory addresses (64 M divided by 16 K). 

      - Fully Associative Cache: Instead of hard-allocating cache 
	lines to particular memory locations, it is possible to 
	design the cache so that any line can store the contents 
	of any memory location. This is called fully 
	associative mapping. 

      - N-Way Set Associative Cache: "N" here is a number, 
	typically 2, 4, 8 etc. This is a compromise between the 
	direct mapped and fully associative designs. In this 
	case the cache is broken into sets where each set 
	contains "N" cache lines, let's say 4. Then, each memory 
	address is assigned a set, and can be cached in any one 
	of those 4 locations within the set that it is assigned 
	to. In other words, within each set the cache is 
	associative, and thus the name.

	This design means that there are "N" possible places 
	that a given memory location may be in the cache. The 
	tradeoff is that there are "N" times as many memory 
	locations competing for the same "N" lines in the set. 
	Let's suppose in our example that we are using a 4-way 
	set associative cache. So instead of a single block of 
	16,384 lines, we have 4,096 sets with 4 lines in each. 
	Each of these sets is shared by 16,384 memory addresses 
	(64 M divided by 4 K) instead of 4,096 addresses as in 
	the case of the direct mapped cache. So there is more 
	to share (4 lines instead of 1) but more addresses 
	sharing it (16,384 instead of 4,096). 

Conceptually, the direct mapped and fully associative caches are 
just "special cases" of the N-way set associative cache. You can 
set "N" to 1 to make a "1-way" set associative cache. If you do 
this, then there is only one line per set, which is the same as 
a direct mapped cache because each memory address is back to 
pointing to only one possible cache location. On the other hand, 
suppose you make "N" really large; say, you set "N" to be equal 
to the number of lines in the cache (16,384 in our example). If 
you do this, then you only have one set, containing all of the 
cache lines, and every memory location points to that huge set. 
This means that any memory address can be in any line, and you 
are back to a fully associative cache.


Comparison of Cache Mapping Techniques

There is a critical tradeoff in cache performance that has led to 
the creation of the various cache mapping techniques described in 
the previous section. In order for the cache to have good 
performance you want to maximize both of the following: 

      - Hit Ratio: You want to increase as much as possible the 
	likelihood of the cache containing the memory addresses 
	that the processor wants. Otherwise, you lose much of 
	the benefit of caching because there will be too 
	many misses. 

      - Search Speed: You want to be able to determine as quickly 
	as possible if you have scored a hit in the cache. 
	Otherwise, you lose a small amount of time on every 
	access, hit or miss, while you search the cache. 


Now let's look at the three cache types and see how they fare: 

      - Direct Mapped Cache: The direct mapped cache is the 
	simplest form of cache and the easiest to check for a 
	hit. Since there is only one possible place that any 
	memory location can be cached, there is nothing to 
	search; the line either contains the memory information 
	we are looking for, or it doesn't.

	Unfortunately, the direct mapped cache also has the worst 
	performance, because again there is only one place that 
	any address can be stored. Let's look again at our 512 
	KB level 2 cache and 64 MB of system memory. As you 
	recall this cache has 16,384 lines (assuming 32-byte 
	cache lines) and so each one is shared by 4,096 memory 
	addresses. In the absolute worst case, imagine that the 
	processor needs 2 different addresses (call them X and Y) 
	that both map to the same cache line, in alternating 
	sequence (X, Y, X, Y). This could happen in a small loop 
	if you were unlucky. The processor will load X from 
	memory and store it in cache. Then it will look in the 
	cache for Y, but Y uses the same cache line as X, so it 
	won't be there. So Y is loaded from memory, and stored in 
	the cache for future use. But then the processor requests 
	X, and looks in the cache only to find Y. This conflict 
	repeats over and over. The net result is that the hit 
	ratio here is 0%. This is a worst case scenario, but in 
	general the performance is worst for this type of mapping. 

      - Fully Associative Cache: The fully associative cache has 
	the best hit ratio because any line in the cache can hold 
	any address that needs to be cached. This means the 
	problem seen in the direct mapped cache disappears, 
	because there is no dedicated single line that an 
	address must use.

	However (you knew it was coming), this cache suffers 
	from problems involving searching the cache. If a 
	given address can be stored in any of 16,384 lines, 
	how do you know where it is? Even with specialized 
	hardware to do the searching, a performance penalty is 
	incurred. And this penalty occurs for all accesses to 
	memory, whether a cache hit occurs or not, because it is 
	part of searching the cache to determine a hit. In 
	addition, more logic must be added to determine which of 
	the various lines to use when a new entry must be added 
	(usually some form of a "least recently used" algorithm 
	is employed to decide which cache line to use next). All 
	this overhead adds cost, complexity and execution time.

      - N-Way Set Associative Cache: The set associative cache is 
	a good compromise between the direct mapped and set 
	associative caches. Let's consider the 4-way set 
	associative cache. Here, each address can be cached in 
	any of 4 places. This means that in the example 
	described in the direct mapped cache description above, 
	where we accessed alternately two addresses that map to 
	the same cache line, they would now map to the same cache 
	set instead. This set has 4 lines in it, so one could 
	hold X and another could hold Y. This raises the hit 
	ratio from 0% to near 100%! Again an extreme example, of 
	course. As for searching, since the set only has 4 lines 
	to examine this is not very complicated to deal with, 
	although it does have to do this small search, and it 
	also requires additional circuitry to decide which cache 
	line to use when saving a fresh read from memory. Again, 
	some form of LRU (least recently used) algorithm is 
	typically used. 

Here's a summary table of the different cache mapping techniques 
and their relative performance:

Cache Type		Hit Ratio		Search Speed
---------------------   ---------------------   ---------------------
Direct Mapped		Good			Best

Fully Associative	Best			Moderate

N-Way Set Associative,	Very Good, 		Good, 
 N>1			Better as N Increases	Worse as N Increases

In the "real world", the direct mapped and set associative caches 
are by far the most common. Direct mapping is used more for level 
2 caches on motherboards, while the higher-performance 
set-associative cache is found more commonly on the smaller 
primary caches contained within processors.


Tag Storage

Since each cache line (or set) in the data store is shared by a 
large number of memory addresses that map to it, we need to keep 
track of which one is using each cache line at a given time. This 
is what the tag RAM is used for.

Let's take a look at the same example again: a system with 64 MB 
of main memory, a 512 KB cache, and 32-byte cache lines. There 
are 16,384 cache lines, and therefore 4,096 different memory 
locations that share each line. However, recall that each line 
contains 32 bytes; that means 32 different bytes can be placed in 
each line without interfering with each other. So really, there 
are 128 (4,096 divided by 32) different 32-byte lines of memory 
that must share a cache spot.

Okay, now to address 64 MB of memory you need 26 address lines 
(because 2^26 is 64 M) which are numbered from A0 to A25. 512 KB 
only requires 19 lines, A0 to A18. The difference between these 
is 7 lines; not surprisingly, since 128 is 2^7. These 7 address 
lines are what tell you which of the 128 different address lines 
that can use a given cache line, are actually using it at the 
moment. That's what the tag RAM is for. There will be as many 
entries in the tag RAM as there are in the data store, so we will 
have 16,384 tag RAM lines, although of course these entries are 
only a few bits wide, not 32 bytes wide like the data store.

Notice that the tag RAM is used early in the process of 
determining whether or not we have a cache hit. This means that 
no matter how fast the cache data store is, the tag RAM must be 
slightly faster.


How the Memory Address Is Used

The memory address provided by the processor represents which byte 
of information the processor is looking for at a given time. This 
is looked at in three sections by the cache controller as it does 
its work of checking for hits. This example is the same as before 
(64 MB memory, 512 KB cache, direct mapping to keep things simple) 
so we again have 26 address bits, A0 through A25: 

	A0 to A4: The lowest-order 5 bits represent the 32 
	different bytes within the data store (2^5 = 32). Recall 
	that the cache we are looking at has 32 byte lines, all 
	of which are moved around together. Therefore, the 
	address bits A0 to A4 are ignored by the cache 
	controller; the processor will use them later to 
	determine which to use of the 32 bytes it receives from 
	the cache. 

	A5 to A18: These 14 bits represent the cache line that 
	this address maps to. 2^14 is 16,384, which is the total 
	number of cache lines in our example, as you recall. This 
	cache line address is used for looking up both the tag 
	address in the tag RAM, and later the actual data in the 
	data store if there is a hit. 

	A19 to A25: These 7 bits represent the tag address, which 
	tells the system which of the possible memory locations 
	that share the cache line (indicated by address lines A5 
	to A18) is currently using it. 

If the numbers in the example change, so do these ranges. If 
instead we have 32 MB of memory, 128 KB of cache, and 16 byte 
cache lines, then A0 to A3 are ignored, A4 to A16 represent the 
cache line address, and A17 to A24 are the tag address.


Cache Write Policy and the Dirty Bit

In addition to caching reads from memory, the system is capable 
of caching writes to memory. The handling of the address bits and 
the cache lines, etc. is pretty similar to how this is done when 
the cache is read. However, there are two different ways that the 
cache can handle writes, and this is referred to as the "write 
policy" of the cache. 

	Write-Back Cache: Also called "copy back" cache, this 
	policy is "full" write caching of the system memory. When 
	a write is made to system memory at a location that is 
	currently cached, the new data is only written to the 
	cache, not actually written to the system memory. Later, 
	if another memory location needs to use the cache line 
	where this data is stored, it is saved ("written back") 
	to the system memory and then the line can be used by the 
	new address. 

	Write-Through Cache: With this method, every time the 
	processor writes to a cached memory location, both the 
	cache and the underlying memory location are updated. 
	This is really sort of like "half caching" of writes; the 
	data just written is in the cache in case it is needed to 
	be read by the processor soon, but the write itself isn't 
	actually cached because we still have to initiate a 
	memory write operation each time. 

Many caches that are capable of write-back operation can also be 
set to operate as write-through (not all however), but not 
generally the other way around.

Comparing the two policies, in general terms write-back provides 
better performance, but at the slight risk of memory integrity. 
Write-back caching saves the system from performing many 
unnecessary write cycles to the system RAM, which can lead to 
noticeably faster execution. However, when write-back caching is 
used, writes to cached memory locations are only placed in cache, 
and the RAM itself isn't actually updated until the cache line is 
booted out to make room for another address to use it.

As a result, at any given time, there can be a mismatch between 
many of the lines in the cache and the memory addresses that 
they correspond to. When this happens, the data in the memory is 
said to be "stale", since it doesn't have the fresh information 
yet that was only written to the cache. Memory used with a 
write-through cache can never be "stale" because the system 
memory is written at the same time that the cache is. 

Normally, stale memory isn't a problem, because the cache 
controller keeps track of which locations in the cache have been 
changed and therefore which memory locations may be stale. This 
is done by using an extra single bit of memory, one per cache 
line, called the "dirty bit". Whenever a write is cached, this 
bit is set (made a 1) to tell the cache controller "when you 
decide to re-use this cache line for a different address, you 
need to write the current contents back to memory". This dirty 
bit is normally implemented by adding one extra bit to the tag 
RAM, instead of using a separate memory chip (to save cost). 

However, the use of a write-back cache does entail the small 
possibility of data corruption if something were to happen before 
the "dirty" cache lines could be saved to memory. There aren't 
too many cases where this could happen, because both the memory 
and the cache are volatile (cleared when the machine is 
powered off).

On the other hand, consider a disk cache, where system memory is 
used to cache writes to the disk. Here, the memory is volatile 
but the disk is not. If a write-back cache is used here, you 
could have stale data on your disk compared to what is in memory. 
Then, if the power goes out, you lose everything that hadn't yet 
been written back to the disk, leading to possible corruption. 
For this reason, most disk caches allow programs to over-rule the 
write-back policy to ensure consistency between the cache (in 
memory) and disk. Disk utilities, for example, don't like 
write-back caching very much!

It is also possible with many caches to tell the controller 
"please write out to system memory all dirty cache lines, right 
now". This is done when it is necessary to make sure that the 
cache is in sync with the memory, and there is no stale data. 
This is sometimes called "flushing" the cache, and is especially 
common with disk caches, for the reason outlined in the previous 
paragraph.


Summary: The Cache Read/Write Process

Having looked at all the parts and design factors that make up a 
cache, in this section the actual process is described that is 
followed when the processor reads or writes from the system 
memory. This example is the same as in the other sections on this 
page: 64 MB memory, 512 KB cache, 32 byte cache lines. I will 
assume a direct mapped cache, since that is the simplest to 
explain (and is in fact most common for level 2 cache): 

1.   The processor begins a read/write from/to the system memory. 

2.   Simultaneously, the cache controller begins to check if the 
information requested is in the cache, and the memory controller 
begins the process of either reading or writing from the system 
RAM. This is done so that we don't lose any time at all in the 
event of a cache miss; if we have a cache hit, the system will 
cancel the partially-completed request from RAM, if appropriate. 
If we are doing a write on a write-through cache, the write to 
memory always proceeds. 

3.   The cache controller checks for a hit by looking at the 
address sent by the processor. The lowest five bits (A0 to A4) 
are ignored, because these differentiate between the 32 
different bytes in the cache line. We aren't concerned with that 
because the cache will always return the whole 32 bytes and let 
the processor decide which one it wants. The next 14 lines (A5 
to A18) represent the line in the cache that we need to check 
(notice that 2^14 is 16,384). 

4.   The cache controller reads the tag RAM at the address 
indicated by the 14 address lines A5 to A18. So if those 14 bits 
say address 13,714, the controller will examine the contents of 
tag RAM entry #13,714. It compares the 7 bits that it reads from 
the tag RAM at this location to the 7 address bits A19 to A25 
that it gets from the processor. If they are identical, then the 
controller knows that the entry in the cache at that line address 
is the one the processor wanted; we have a hit. If the tag RAM 
doesn't match, then we have a miss. 

5.   If we do have a hit, then for a read, the cache controller 
reads the 32-byte contents of the cache data store at the same 
line address indicated by bits A5 to A18 (13,714), and sends them 
to the processor. The read that was started to the system RAM is 
canceled. The process is complete. For a write, the cache 
controller writes 32 bytes to the data store at that same cache 
line location referenced by bits A5 to A18. Then, if we are using 
a write-through cache the write to memory proceeds; if we are 
using a write-back cache, the write to memory is canceled, and 
the dirty bit for this cache line is set to 1 to indicate that 
the cache was updated but the memory was not. 

6.   If we have a miss and we were doing a read, the read of 
system RAM that we started earlier carries on, with 32 bytes 
being read from memory at the location specified by bits A5 to 
A25. These bytes are fed to the processor, which uses the lowest 
five bits (A0 to A4) to decide which of the 32 bytes it wanted. 
While this is happening the cache also must perform the work of 
storing these bytes that were just read from memory into the 
cache so they will be there for the next time this location is 
wanted. If we are using a write-through cache, the 32 bytes are 
just placed into the data store at the address indicated by bits 
A5 to A18. The contents of bits A19 to A25 are saved in the tag 
RAM at the same 14-bit address, A5 to A18. The entry is now ready 
for any future request by the processor. If we are using a 
write-back cache, then before overwriting the old contents of the 
cache line, we must check the line's dirty bit. If it is set (1) 
then we must first write back the contents of the cache line to 
memory, and then clear the dirty bit. If it is clear (0) then the 
memory isn't stale and we continue without the write cycle. 

7.   If we have a cache miss and we were doing a write, 
interestingly, the cache doesn't do much at all, because most 
caches don't update the cache line on a write miss. They just 
leave the entry that was there alone, and write to memory, 
bypassing the cache entirely. There are some caches that put all 
writes into the appropriate cache line whenever a write is done. 
They make the general assumption that anything the processor has 
just written, it is likely to read back again at some point in 
the near future. Therefore, they treat every write as a hit, by 
definition. This means there is no check for a hit on a write; in 
essence, the cache line that is used by the address just written 
is always replaced by the data that was just put out by the 
processor. It also means that on a write miss the cache 
controller must update the cache, including checking the dirty 
bit on the entry that was there before the write, exactly the 
same as what happens for a read miss. 

As complex as it already is :^) this example would of course be 
even more complex if we used a set associative or fully 
associative cache. Then we would have a search to do when 
checking for a hit, and we would also have the matter of deciding 
which cache line to update on a cache miss.
