NVIDIA’s Grace CPU Breaks Cover, Features 72 Arm v9.0 Cores Per Chip, 117 MB L3 Cache, 68 Gen 5 Lanes, All on TSMC 4N Process Node

NVIDIA first announced its Grace CPU and the respective Superchip design at GTC 2022. The Grace CPU is NVIDIA’s first processor based on a custom Arm architecture that will be aiming the server / HPC segment. The CPU comes in two Superchip configurations, a Grace Superchip module with two Grace CPUs and a Grace+Hopper Superchip with one Grace CPU connected to a Hopper H100 GPU. Some of the main highlights of Grace include:

High-performance CPU for HPC and cloud computing Super chip design with up to 144 Arm v9 CPU cores World’s first LPDDR5x with ECC Memory, 1TB/s total bandwidth SPECrate2017_int_base over 740 (estimated) 900 GB/s coherent interface, 7X faster than PCIe Gen 5 2X the packaging density of DIMM-based solutions 2X the performance per watt of today’s leading CPU Runs all NVIDIA software stacks and platforms, including RTX, HPC, AI, and Omniverse

Being NVIDIA’s first server CPU, Grace features 72 Arm v9.0 cores that offer support for SVE2 and various virtualization extensions such as Nested Virtualization and S-EL2. The CPU is fabricated on TSMC’s 4N process node, an optimized version of the 5nm process node which is made exclusively for NVIDIA. Grace is designed to be paired and as such, one of the most crucial aspects of the design is its C2C (Chip-To-Chip) interconnect. Grace achieves this with NVLINK which is used to make the Superchips and removes all bottlenecks that are associated with a typical cross-socket configuration. The C2C NVLINK interconnect provides 900 GB/s of raw bi-directional bandwidth (same bandwidth as a GPU to GPU NVLINK switch on Hopper), while running at a very low power interface of just 1.3 pJ/bit or 5 times more efficient than the PCIe protocol. The NVIDIA Grace CPU features a scalable coherency fabric with a distributed cache design. The chip has up to 3.225 TB/s of bi-section bandwidth, is scalable beyond 72 cores (144 on Superchip), integrates 117 MB of L3 cache, and features support for Arm memory partitioning and monitoring (MPAM). Grace also allows for a unified memory architecture with shared page tables. Two NVIDIA Grace+Hopper Superchips can be interconnected together through an NVSwitch and a Grace CPU on one Superchip can directly communicate to the GPU on the other chip or even access its VRAM at native NVLINK speeds. Getting a closer look at the memory design of Grace, NVIDIA is utilizing up to 512 GB of LPDDR5X across 32 channels, delivering up to 546 GB/s of memory bandwidth. NVIDIA states that LPDDR5X provides the best value when keeping in mind the overall bandwidth, cost, and power requirement. For I/O, you get 68 PCIe Gen 5.0 lanes, four of which can be used for x16 links at 128 GB/s, and the rest of the two are used for MISC. There are also 12 lanes of coherent NVLINK lanes shared with two Gen 5 PCIe x16 links. As for TDP, the NVIDIA Grace (CPU Only) Superchip is optimized for single-core performance and offers up to 1 TB/s of memory bandwidth and a TDP of 500W for the 144-core dual chip config. We have already put the numbers into perspective in a previous article which can be seen below: Now, this isn’t that big of a performance difference but what we would really like to see are the performance metrics. The Grace SUPERCHIPS are rated at around 500W while each AMD EPYC 7763 chip has a TDP of 280W so two of them will be around 560W and we aren’t adding the additional system wattage whereas NVIDIA’s 500W figure is for the whole GRACE SUPERCHIP package. NVIDIA states that its Grace is a highly specialized processor targeting workloads such as training next-generation NLP models that have more than 1 trillion parameters. When tightly coupled with NVIDIA GPUs, a Grace CPU-based system will deliver 10x faster performance than today’s state-of-the-art NVIDIA DGX-based systems, which run on x86 CPUs. It will definitely be interesting to see how the Grace CPUs stack up against x86 chips but by the time they release, they will be competing against AMD’s Genoa and Intel’s Sapphire Rapids CPUs. The NVIDIA Grace CPUs are planned to be used in the ATOS supercomputer as reported here. NVIDIA also

NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 74NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 43NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 83NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 75NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 93NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 71NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 35NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 99NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 34NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 53NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 84NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 48NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 59NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 26NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 64NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 3NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 36NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 73NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 31NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 26NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 19NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 81NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 33NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 62NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 12NVIDIA Grace CPU Detailed  72 Arm V9 0 Cores  117 MB L3 Cache  68 PCIe Gen 5 Lanes  TSMC 4N Process   500W TDP - 37