Cache Characteristics



This section discusses the different features of the level 2 
cache. These are the characteristics you will normally need to 
understand when making a motherboard selection, or upgrading the 
cache in your existing system. Some of the descriptions in this 
section are explained in much more detail in Function and 
Operation of the System Cache . The focus of this page is on the 
higher-level performance aspects of the various cache features.


Cache Speed

There is no single number that dictates completely the "speed" of 
the system cache. Instead, we must consider the raw speed of the 
components used, as well as how the circuitry chooses to use 
them. These considerations are identical to how they are when 
looking at the system RAM itself; saying "my RAM is 60 ns" tells 
only a small part of the story .

The "raw" speed of the cache is the speed of the RAM chips used 
to make it. Caches are normally made from static RAM chips (SRAM), 
unlike main system memory which is made from dynamic RAM (DRAM). 
The short version of the difference between the two, is that 
static RAM is faster but also more expensive. The access speed of 
SRAMs are normally rated in the tens of nanoseconds. SRAMs 
normally have a speed of 7 to 20 ns; DRAMs on the other hand are 
usually 50 to 70 ns.

The speed of the SRAM chips gives the upper bound on performance. 
It is up to the motherboard and chipset designer to make full use 
of the speed. Let's consider a Pentium motherboard with a memory 
bus speed running at 66 MHz. This means 66.66 million cycles per 
second; if we take the reciprocal of this it gives the cycle time, 
which is 15 nanoseconds (1 divided by 66 million). In order for 
the motherboard to be able to read from the cache in one cycle at 
this speed, the SRAM must be faster than 15 ns in speed (there is 
some overhead time as well so exactly 15 ns won't work). If the 
SRAM is faster than this, there will be no additional benefit; if 
it is slower, timing problems will occur, which usually manifest 
themselves as memory errors and system lockups.

The tag RAM used as part of the cache must normally be faster than 
the actual cache data store . This is because the tag RAM must be 
read first to check for a cache hit. We want to be able to check 
the tag and still have enough time to read the cache within a 
single clock cycle, if we have a hit. So for example, you may find 
that your system's main cache chips are 15 ns, while the tag may
be 12 ns.

The more complicated the cache mapping technique, the more 
important the difference in speed between the tag and the data 
store. Simple techniques like direct mapping don't generally 
require much difference at all. Your system may use the same 
speed for all the memory in this case; for example, if the system 
needs 15 ns for the tag and 16 ns for the data store, the 
motherboard may just specify 15 ns for everything since this is 
simpler. In any event, if your motherboard doesn't already come 
with the level 2 cache on it, you should buy for it whatever the 
motherboard manual or your dealer specifies.

The true speed of any cache, in terms of how quickly it really 
transfers information to and from the processor so that you get 
faster speed in your applications, is dependent on the cache 
controller and other chipset circuits. The capabilities of the 
chipset determine what kind of transfer technologies your cache 
can use. This in turn determines your cache's optimal system 
timing, the number of clock cycles required to move data in and 
out of the cache. This is discussed in detail in this section.

The performance of the cache obviously also is greatly dependent 
on the speed that the cache subsystem is running at. In a typical 
Pentium machine this is the speed of the memory bus, 66 MHz. 
However a Pentium Pro processor has an integrated level 2 cache, 
which runs at full processor speed, normally 180 or 200 MHz. 
Obviously, this will yield superior performance! The Intel 
Pentium II uses instead a daughterboard cache with level 2 caches 
running at half the processor speed, which with a 233 or 266 MHz 
chip will still mean much better performance than running the 
cache at 66 MHz.


Cache Size

The size of the cache normally refers actually to the size of the 
data store, where the memory elements are actually stored. A 
typical PC level 2 cache is either 256 KB or 512 KB, but can be 
as small as 64 KB on older machines, or as high as 1 MB or even 
2 MB. Within processors, level 1 cache usually ranges in size 
from 8 KB to 64 KB. 

The more cache the system has, the more likely it is to register 
a hit on a memory access, because fewer memory locations are 
forced to share the same cache line. Let's use an example to 
illustrate (the same one we used when we discussed cache 
operation in detail .). We have a system with 64 MB of memory and 
512 KB of direct-mapped cache, arranged into 32-byte cache lines. 
This means that we have 16,384 cache lines (512 K divided by 32). 
Each line is shared by 4,096 memory addresses (64 MB divided by 
16,384). Now if we increase the amount of cache to 1 MB, we will 
have 32,768 cache lines, and each will only be shared by 2,048 
addresses. Conversely, if we leave the cache at 512 KB but 
increase the system memory to 256 MB, each of the 16,384 cache 
lines will be shared by 16,384 addresses.

There are many areas in the computer world where Pareto's Law 
applies, and cache size is definitely one of them. If you have a 
256 KB cache on a system using 32 MB, increasing the cache by 
100% to 512 KB will probably result in an increase in the hit 
ratio of less than 10%. Doubling it again will likely result in 
an increase of less than 5%. In the real world, this differential 
is not noticeable to most people. However, if you greatly increase 
the amount of system memory you use, you will probably want to up 
your cache total as well to prevent a degradation in performance. 
Just make sure you watch closely the system RAM 
cacheability issue.


System RAM Cacheability

This is one of the most misunderstood aspects of the caching 
equation. The amount of RAM that the system can cache is very 
important if you are going to be using a lot of system memory. 
Almost all modern fifth generation systems can cache 64 MB of 
system memory. However, many systems, even newer ones, cannot 
cache more than 64 MB of memory. Intel's popular 430FX ("Triton 
I"), 430VX (one of the "Triton II"s, also called "Triton III") 
and 430TX chipsets, do not cache more than 64 MB of main memory. 
There are millions and millions of these PCs on the market.

If you put more memory in a system than can be cached, the result 
is a performance decrease. The speed differential between the 
cache and memory is significant; that's why we use it. :^) When 
some of that memory is not cached, the system must go to memory 
for every access to that uncached memory, which is much slower. 
In addition, when using a multitasking operating system (pretty 
much anything other than DOS these days) you can't really control 
what ends up in cached memory and what ends up in non-cached 
memory, unless you really know what you are doing.

The keys to how much memory your system can cache are first, the 
design of the chipset, and second, the width of the tag RAM. The 
more memory you have, the more address lines you need to specify 
an address. This means that you have more address bits to store 
in the tag RAM to use in order to check for a cache hit. Of 
course if the chipset isn't designed to cache more than 64 MB, an 
extra wide tag RAM won't help anyway.

Let's take our standard example again; 64 MB of memory, 512 KB 
cache, 32-byte cache lines. As we described in detail in this 
section , 64 MB means 26 address lines (A0 to A25); A0 to A4 
specify the byte in the cache line, A5 to A18 specify the cache 
line, and A19 to A25 go into the tag RAM to specify which memory 
address is currently using the cache line. That's 7 bits; let's 
say our tag RAM is 8 bits wide, and we are reserving one bit for 
the "dirty bit", to allow write-back operation of the cache . So 
we're fine, we have enough tag memory in the cache. Now, suppose 
we add another 32 MB of memory. To address 96 MB you need another 
address line, A26, to be held in the tag RAM. Hmm, we have a 
problem, because now we need 9 bits in our tag RAM and it only 
has 8.

The only mainstream Pentium chipset to support caching over 64 MB 
is the 430HX "Triton II" chipset by Intel. In actual fact, 
caching over 64 MB on this chipset is considered "optional"; the 
motherboard manufacturer has to make sure to use an 11-bit tag 
RAM instead of the default 8-bit. The extra 3 bits increase 
cacheability from 64 MB to 512 MB (2^3=8, and 64*8=512).

Many people confuse the issue of system RAM size and system RAM 
cacheability. The common thought is that adding more cache will 
let you cache more RAM, but you can see that really it is the tag 
RAM and chipset that controls this. Further complicating the 
matter is that some companies put extra tag RAM on their COASt 
modules. So a user will insert a 256 KB COASt module, and think 
that increasing his cache let him cache more system memory, when 
really it was the extra tag RAM that did it.

Pentium Pro PCs use an integrated level 2 cache that contains the 
tag RAM within it, so none of this is really a concern for these 
machines. The Pentium Pro will cache up to 4 GB of main memory, 
basically anything you can throw at it. The Pentium II uses an 
SEC daughtercard. It has the same general architecture as the 
Pentium Pro, but due to a design limitation will "only" cache up 
to 512 MB. This isn't nearly as much of an issue as a 64 MB 
barrier, but considering that the PII is used in many high-end 
applications, this might be a concern for some people.

One question that people ask a lot is: "How much will the system 
slow down if I have more RAM in it than can be cached?" There is 
no easy answer to this question, because it depends both on the 
system and what you are doing with it. Somewhere between 5% and 
25% is most likely, but you should bear something else in mind: 
adding real physical memory to the system is one way to avoid the 
extreme slowdown to the system that occurs when it runs out of 
real memory and must use virtual memory . If you are doing heavy 
multitasking and notice that the system is thrashing, you will 
always be better off to have more memory, even uncached, instead 
of having the system swap a great deal to disk. Of course having 
all the memory cached is still preferred.


Integrated vs. Separate Data and Instruction Caches

Most (all?) level 2 caches work on both data and processor 
instructions (code, programs). They don't differentiate between 
the two because they view both as just memory addresses. However,
many processors use a split design for their level 1 cache. For 
example, the Intel "Classic" Pentium (P54C) processor uses an 8 
KB cache for data, and a separate 8 KB cache for program 
instructions. This is more efficient due to the way the processor 
is designed, and doesn't really affect performance very much 
compared to a single 16 KB cache, though it might lead to a very 
slightly lower hit ratio. Each of these caches can have different 
characteristics. For example they can use different mapping 
techniques (as they do on the Pentium Pro). 


Mapping Technique

The cache mapping technique is another factor that determines how 
effective the cache is, that is, what its hit ratio and speed 
will be. This is discussed in detail in this section , but 
briefly, the three types are: 

	Direct Mapped Cache: Each memory location is mapped to a 
	single cache line that it shares with many others; only 
	one of the many addresses that share this line can use it 
	at a given time. This is the simplest technique both in 
	concept and in implementation. Using this cache means the 
	circuitry to check for hits is fast and easy to design, 
	but the hit ratio is relatively poor compared to the 
	other designs because of its inflexibility. 
	Motherboard-based system caches are typically 
	direct mapped. 

	Fully Associative Cache: Any memory location can be 
	cached in any cache line. This is the most complex 
	technique and requires sophisticated search algorithms 
	when checking for a hit. It can lead to the whole cache 
	being slowed down because of this, but it offers the best 
	theoretical hit ratio since there are so many options for 
	caching any memory address. 

	N-Way Set Associative Cache: "N" is typically 2, 4, 8 etc. 
	A compromise between the two previous design, the cache is 
	broken into sets of "N" lines each, and any memory address 
	can be cached in any of those "N" lines. This improves hit 
	ratios over the direct mapped cache, but without incurring 
	a severe search penalty (since "N" is kept small). The 
	2-way or 4-way set associative cache is common in 
	processor level 1 caches. 


Write Policy

The cache's write policy determines how it handles writes to 
memory locations that are currently being held in cache. 
Described in more detail here, the two policy types are: 

	Write-Back Cache: When the system writes to a memory 
	location that is currently held in cache, it only writes 
	the new information to the appropriate cache line. When 
	the cache line is eventually needed for some other memory 
	address, the changed data is "written back" to system 
	memory. This type of cache provides better performance 
	than a write-through cache, because it saves on 
	(time-consuming) write cycles to memory. 

	Write-Through Cache: When the system writes to a memory 
	location that is currently held in cache, it writes the 
	new information both to the appropriate cache line and 
	the memory location itself at the same time. This type 
	of caching provides worse performance than write-back, 
	but is simpler to implement and has the advantage of 
	internal consistency, because the cache is never out of 
	sync with the memory the way it is with a 
	write-back cache. 

Both write-back and write-through caches are used extensively, 
with write-back designs more prevalent in newer and more 
modern machines.


Transactional or Non-Blocking Cache

Most caches can only handle one outstanding request at a time. 
If a request is made to the cache and there is a miss, the cache 
must wait for the memory to supply the value that was needed, and 
until then it is "blocked". A non-blocking cache has the ability 
to work on other requests while waiting for memory to supply 
any misses.

The Intel Pentium Pro and Pentium II processors use this 
technology for their level 2 caches, which can manage up to four 
simultaneous requests. This is done by using a transaction-based 
architecture, and a dedicated "backside" bus for the cache that 
is independent of the main memory bus. Intel calls this "dual 
independent bus" (DIB) architecture.
