Micron HMC Controller
What HMC specification does the Pico HMC controller implement?
The full Hybrid Memory Cube Consortium’s Specification 1.1. This is the specification that corresponds to the second generation HMC.
What are the currently supported FPGA devices?
The HMC Controller currently works with Altera Stratix V and Kintex UltraScale devices. For more information see here.
What kind of interface does the HMC controller have?
The controller has a raw 640-bit interface or an AXI-4, multi-port interface.
What are the benefits of using one interface over the other?
The Multiport interface makes it a lot easier to generate packets. It takes care of creating well-formed packets according to the protocol rather than generating a kind of raw package. However, this feature can be turned off or on depending on your application needs.
If you use the Raw Interface, you will have to do almost all functions – build up the flit, use correct values for all the fields, use flags that won’t overlap, etc.
What are the bandwidth numbers using the HMC?
Bandwidth numbers can be viewed here. However, the controller itself is designed so it doesn’t have to throttle any of the speeds at any point. Therefore, anything that you see from the performance model of the chip is going to be available to you through our controller. We are not going to introduce any extra bottlenecks.
What clock speed is the controller running at?
The clock speeds is 375MHZ at 15Gbps and 250MHz at 10Gbps. As Gen 3 will have 30Gbps transceivers, either the 640-bit wide bus will have to lengthen or we will have to increase the clock speed.
What is the latency internal to the Controller?
The total combined latency for the HMC controller can vary from 100ns-300ns round trip (both the RX and the TX sides for a while round trip transaction). This range is dependent on how the controller is configured and what kind of features your application has. For example, if you are utilizing the multiport interface, the Pico controller will be taking care of creating well formed packets according to the HMC protocol rather than generating a kind of raw package. However, this feature can be turned off in order to reduce latency. Another big piece of the picture is the link retry feature and the link CRC. This feature requires that the controller do complete CRC checks on all the incoming data before it is actually delivered. This is one of the main features that increases the latency to ~300ns. Without this feature, the controller is going to be at ~140ns or even down to ~100ns if you are not utilizing the multi-port features and are just wanting the raw-interface. There are a few reasons why our customers might turn off the CRC checks on incoming data prior to delivery:
- Customers have an application architecture (sitting on top of the controller) that allows for the error to be squashed downstream. In other words, the controller is able to do CRC checks in parallel of the data being delivered by throwing an error flag that is then taken care of within the application architecture itself. In this way, the controller does not have to gate data until we are absolutely certain that it was received.
- Customers have hardare that is designed with enough margin that they are able to turn their retry features off or only keep on the feature that thows an error flag, but doesn’t actually retrain the link.
NOTE: In the rare event of a retry, a long tail will be added to the 300ns latency.
Why is the interface 640 bits wide rather than a binary multiple like 256/512/1024?
This came about as a natural compromise between a couple of different things. The transceivers for Xilinx and Altera both have slightly different gear boxes in terms of when they take in 16 streams of data and turn that into 640 bits and balancing that with clock speed. Narrow is better and 512 would have been a nice number in that it is a binary multiple, but we would have had to run at almost 450MHz to make that work which would just be pushing a bit too much. We are trying to get as narrow as possible without running the clock rate too fast. For example, we could go all the way up to 1024 which is what OpenSilicon was doing for a while, but it is too wide and too slow and causes more problems than it solves. Also, 512 sounds nice, but it actually doesn’t work out well with the packet sizes. For example, the biggest packet which is 128 Byte is going to be 8-flits, plus the header and tail which is going to be 9 flits which does not go into 512 well.
Is there a User Guide on how to best optimize the HMC?
We do not yet have a User Guide on how to best optimize the HMC. We are working on these numbers. However, we do have a User Guide on the HMC controller itself.
Is there actual command scheduling in the controller or is it all in order from the perspective of the user interface?
The memory cube itself may reschedule. This is because its got the performance to do multiple things simultaneously, so it will let requests pass each other. This could potentially mean that requests are coming back to the controller out of order. We do have the ability to configure on some logic to the controller that can reorder the data if your application requires it. At that point, it is a matter of determine requirements in terms of lowest latency versus in order transactions.
How much of the FPGA does the HMC controller use in terms of LUT, Block ram, interconnect, etc?
This has about as much variability as Latency numbers, but using all the debugging features and ports, the controller will use 40K ALMs in Altera FPGAs. This can go down to ~20K ALMs if only the raw interface is being used.
What kind of loss do you get when utilizing the HMC
The specification allows for 17-20dB of loss. The loss on the Pico boards has not been quantified.
Are there any design examples using the controller?
GUPs has been implemented on the SB-800 and the AC-510 (see ‘HMC Hardware and Systems’ for more information). This design will be included with your purchase of the board.
There are broad application examples listed here. More information and design examples are in the works.
What does your controller do to maximize throughput?
The HMC controller is a fully pipelined block designed to maximize throughput. While both read and write operations require multiple clock cycles to complete, the controller allows users to issue many read and/or write requests to the controller before the first response is returned by the HMC. This pipelining of read and write requests greatly improves the throughput of the memory for user applications.
Is ECC done within the HMC or within the Controller?
Find more information here.