Friday, July 20, 2012

Notes on the Sandy Bridge Uncore (part 1)

For those of you who never heard of something called Uncore on a processor: On recent chips Hardware performance monitoring (HPM) is different for the cores and the part of the chip shared by multiple cores (called the Uncore). Before the Intel Nehalem processor there was no Uncore. Hardware performance monitoring was limited to the cores. Still also at that time questions came up how the shared portions of the processor are measured. Nehalem introduced the Uncore, which are parts of the microarchitecture shared by multiple cores. This can be the shared last level cache, the main memory controller or the QPI bus connecting multiple sockets. The Uncore had its own hardware performance monitoring unit with eight counters per Uncore (means socket). For many tools which used sampling to relate hardware performance counter measurments to code lines this caused great headaches as the Uncore measurements cannot be  related to code executed on a specific core anymore.

The LIKWID tool suite has no problem with this as it restricts itself to simple end to end measurements of hardware performance counter data. The relation between the measurement and your code is realized through pinning the execution of code to dedicated cores (which the tool can also do for you). As you might think the Nehalem Uncore is a bad idea Intel introduced the EX type processors. This new design mainly introduced a completely new Uncore, which is now a system on a chip (Uncore HPM manual: Intel document reference number 323535). In its first  implementation this was very complex to program with tons of MSR registers which needed to be programmed and a lot of dependencies and restrictions. The new mainstream server/desktop SandyBridge microarchitecture also uses this system on a chip type Uncore design. Still the implementation of the hardware performance monitoring was changed.

First I have to warn you: Intel is not very strict about consistency in naming. E.g. the naming of the MSR registers in the SDM manuals can be different from the naming used for the same MSR registers in documents written in other parts of the company (e.g. the Uncore manuals). This is unfortunatly also true for the naming of the entities in the Uncore. The Uncore does not have one HPM unit anymore but a bunch of them. On NehalemEX and WestmereEX the different parts of the Uncore were called boxes, there were mbox's  (main memory controller) and cbox's (last level cache segments) and a bunch of others. While in SNB there still exist the same type of boxes they are named differently now, e.g. the mbox's are now called iMC and the cbox's are called CBo's. Still in LIKWID I stick with the old naming, since I want to build on the stuff I implemented for the EX type processors.

The mapping is as follows:

  • Caching agent,  SNB: CBo   EX: CBOX
  • Home agent,  SNB: HA   EX: BBOX
  • Memory controller, SNB: iMC  EX: MBOX
  • Power Control, SNB: PCU  EX: WBOX
  • QPI,  SNB: QPI  EX: SBOX/RBOX
The complexity comes in by the  large number of places you can measure something. So you have one Uncore per socket, each sockets has one or multiple performance monitoring units of different types. E.g. there are four iMC units, one per memory channel. And each of those has four counters. So you have 2*4*4=32  different memory related counters you can measure stuff on a two socket system.


Before hardware performance monitoring was controlled via writing/reading to MSR registers (model specific registers).  This was still true on EX type processors. Starting with SNB the Uncore is now partly programmed by PCI bus address space.  Some parts are still programmed using the MSR registers, e.g. the CBo boxes. Still most of the unit are now programmed with PCI config space registers.


I am no specialist on PCI buses, still for the practical part the operating system maps the the pci configuration space. For PCI this is 256bytes per device using usually 32bit addressing. The device memory is organized in BUS / DEVICE / FUNCTION . The BUS is the socket in the HPM sense, or the other way round there is one new BUS per socket in the system. DEVICE is the HPM unit type (e.g. main memory box) and the FUNCTION is then the concrete HPM unit.

On a two socket SandyBridge-EP system there are the following devices (this is taken from LIKWID source):




typedef enum {
    PCI_R3QPI_DEVICE_LINK_0 = 0,
    PCI_R3QPI_DEVICE_LINK_1,
    PCI_R2PCIE_DEVICE,
    PCI_IMC_DEVICE_CH_0,
    PCI_IMC_DEVICE_CH_1,
    PCI_IMC_DEVICE_CH_2,
    PCI_IMC_DEVICE_CH_3,
    PCI_HA_DEVICE,
    PCI_QPI_DEVICE_PORT_0,
    PCI_QPI_DEVICE_PORT_1,
    PCI_QPI_MASK_DEVICE_PORT_0,
    PCI_QPI_MASK_DEVICE_PORT_1,
    PCI_QPI_MISC_DEVICE_PORT_0,
    PCI_QPI_MISC_DEVICE_PORT_1,
    MAX_NUM_DEVICES
} PciDeviceIndex;





static char* pci_DevicePath[MAX_NUM_DEVICES] = {
 "13.5", "13.6", "13.1", "10.0", "10.1", "10.4",
 "10.5", "0e.1", "08.2", "09.2", "08.6", "09.6",
 "08.0", "09.0" };

So e.g. the memory channel 1 (PCI_IMC_DEVICE_CH_1) on socket 0 is: BUS 0x7f  DEVICE 0x10 FUNCTION 0x1 .
The Linux OS maps this memory in different locations in /sys and /proc filesystems. In LIKWID I use the /proc filesystem. Above device is accessible via the path: /proc/bus/pci/7f/10.1 .  Unfortunatly if you make a hexdump as user on such a file you only get the header part (first 30-40 bytes). The rest is only visible to root. For LIKWID this means you have to use the tool as root if you want to use direct access or you have to setup the daemon mode proxy to access these files.  In my next post I will explain how the SNB Uncore is implemented in likwid-perfctr and what performance groups are available.


No comments:

Post a Comment