NUMA + Catalina = WIN!
In order to fully appreciate the memory architecture present in the Cisco UCS blade servers, we must first discuss the inherent memory architecture of the Nehalem Xeon processors. Prior to the release of Nehalem, all other x86 and x64 processors from Intel where based on a uniform memory access (UMA) architecture. AMD had introduced a non-uniform memory access (NUMA) architecture with their Opteron platform, which has proven to be a very effective memory architecture. So effective, that Intel decided to also implement this architecture.
So what is UMA exactly, anyway?
Let's look at an oversimplified block diagram of a standard UMA architecture:
|
The UMA architecture, is basically multiple CPU's accessing the same banks of memory through a shared bus and a shared memory controller. The upside of this architecture, is that either CPU, accessing, any bank of memory, would get the same performance, since they are sharing the same access mechanism to the RAM. The downside of this architecture, is that the frontside bus can become a bottleneck. This can be alleviated to some degree by adding a crossbar switching fabric, multiple front side busses, and mutliple memory controllers, but the complexity begins to drive up costs without a significant enough increase in performance. |
![]() |
OK, simple enough. Now what's NUMA all about?
Let's begin with another simple block diagram of Intel's implementation of NUMA on the Nehalem processors:

- Each CPU has three memory controllers (channel 0, 1, & 2), with each memory controller handling up to three DIMMs. Between the CPU's and the I/O controller hub is a point-to-point "Quick Path Interconnect" (QPI).
- Each CPU get's blazing fast access to the local memory since it isn't a shared bus (approximately 61 - 62 ns response time).
- If a CPU needs to access its neighbor CPU's memory, it can do so via the QPI connection. Since the memory access is being "proxied' by the neighboring processor, it takes longer (approximately 112 - 115 ns).
- The architecture gets its name because each CPU gets differing performance accessing memory from different locations, in other words, non-uniform performance.
- The upside of this architecture is the fast memory access.
- The downside of this architecture is that, in order to keep everything optimized, your OS or Hypervisor needs to be NUMA aware so that processes and the memory those processes use are allocated from the same CPU. Fortunately, most modern operating systems and hypervisors are NUMA aware.
Wow, what a difference. The "tri-channel" memory approach is effective, allowing for more memory bandwidth, and you get a plethora of DIMM slots (18 in total). So you could get some really nice memory densities using 8GB DIMMs, right? Well, let's take a closer look at what happens. Each memory channel can access RAM up to speeds of 1333MHz. Impressive. But as you add an additional DIMM, you add more noise and capicitence to the memory bus, thereby making 1333MHz impossible, so the bus speed drops to 1033MHz. Pretty good still, but when you add a third DIMM, the signal quality takes another hit, and the bus speed drops to 800MHz. Ouch! This is where the fine line between memory density and memory bandwidth becomes apparent. Now there are ways to get around some of the noise issues, for example using registered DIMMs instead of unbuffered DIMMs. Again, the details in registered vs. unbuffered memory is another topic we can delve into in another article, but suffice it to say, registered DIMMs increases memory cost as well as increases memory latency.
Well, that's quite a pickle. How can you get large memory densities, and not sacrifice on memory bandwidth?
Enter Cisco's Catalina chip. Let's take a look at another block diagram of a single Intel Nehalem CPU utilizing Cisco's Catalina chip.
|
The Catalina chip sits directly on the memory bus of each memory controller. It effectively acts as a quick circuit switch, and flips over to each bank of two DIMMs "on-the-fly". Actually, it takes approximately 5ns to switch over, and only affects the first word (8 bytes) retrieved from that memory bank, but once it is switched over, all subsequent memory accesses to that bank happen at wire speed. Speaking of wire speed, electrically, each memory bank is only two DIMMs, so all of the memory is running at 1066MHz. So, we get 24 DIMM slots per processor, all at 1066MHz. Without the Catalina, in order to get 1066MHz, you would be limitted to 6 DIMM slots per processor (two per memory channel). The Catalina chip is only utilized on the "expanded memory" B250M1 server blade, which gives you 48 DIMM slots. With 8GB DIMMs, you could get a 384GB maximum memory capacity, all at 1066MHz. |
![]() |
The Catalina chip allows us to get excellent memory bandwidth, with unheard of memory density, all for a 5ns switching delay on the access of the first 8 bytes of memory from a particular bank of memory, and no delay on all subsequent accesses (until it needs to switch to another bank). That's a small price to pay for having your cake, and eating it to.
That's the first taste of Cisco's "secret sauce", and I tell you, I think it tastes good. In the next article, we are going to look into the hypervisor bypass technologies implemented in Cisco's Palo CNA adapter.
| < Prev | Next > |
|---|
Last Updated on Sunday, 30 January 2011 17:44.




