Performance Considerations with the HMC
What are things I need to consider when designing around the HMC?
- User defined address mapping can increase performance. For example, accessing a vault from a local quadrant will have lower latency accessing that vault from a link outside the quadrant. Making small adjustments to ensure these efficiencies can increase performance results.
- The main point is that the basic mechanics of memory use will be the same, but now you have multiple banks in each vault, plus you can use the HMC to stir the address bits around however you want (i.e. you can do all the banks first and then the vaults or all the vaults first and then the banks). In this way, you have a lot of flexibility in how to order the address so that you can do what you are currently doing, but with greater flexibility.
- Because a read is going to have a load on both the TX and RX side of the link, it is useful to have more reads than writes. Based on simulation the below chart was created out of the University of Maryland:1
Size of requests
- If you are using large packets (i.e. 128 byte packets, 64 byte packets) you should see close to the theoretical numbers for HMC bandwidth. This is especially true if you are not using all four links on the part. If you start doing small scatter-gather traffic patterns, you may see that drop if you are using all four ports, but again if you are just using 1 to 2 ports, you may still see max bandwidth even with smaller packet sizes.
- Due to the internal 32-byte granularity of the DRAM data bus within each vault in the HMC, inefficient utilization of this bus occurs when starting or ending on a 16-byte boundary rather than a 32-byte boundary. An extreme example of this would be the host issuing a series of 16-byte read requests. Each read request would fetch 32 bytes from the DRAM and return half of the data in the response packet, throwing away the other 16 bytes of data. For bandwidth optimization it is advantages to issue request with 32-byte granularity. See page 118 of attached specification document.
Size of requests versus size of max block
- The address mode is selected such that sequential request are generally mapped to different vaults and then different banks within a vault, thus avoiding bank conflicts. This is enforced if the maximum block size chosen is the same size as the requested block size, then a request streams addressing pattern, which is generally incremental in nature, but when a maximum block size is chosen that is larger than the largest packet size used by the host, that results in the possibility of having multiple packets addressing the same MAX clock, thus the same bank within a vault. For example, if the host issues multiple 32B requests, it would issue a request stream that increments bits 5 and above. This may result in trying to access four 32B blocks within the same 128B MAX block. The bank conflict management within each vault controller can re-order certain number of requests to avoid bank conflicts, but if the requests stream floods a vault controller with many request trying to access the same bank, desired bandwidth will drop. Therefore, it is important for the system architect to configure the maximum block size based upon the most frequently requested data size. See page 49 and 118 of attached specification document.
- The HMC controller is a fully pipelined block designed to maximize throughput. While both read and write operations require multiple clock cycles to complete, the controller allows users to issue many read and/or write requests to the controller before the first response is returned by the HMC. This pipelining of read and write requests greatly improves the throughput of the memory for user applications.
What kind of addressing scheme should I use for my workload!
To be Determined
“A suboptimal address mapping scheme can cause memory requests to continuously cause conflicts to share resources and degrade performance. On the other hand, an optimal address mapping scheme can spread requests evenly and utilize the memory system’s full parallelism. This makes the address mapping scheme a vital consideration for performance. However, the bad news is that address mapping will always be workload specific: an address mapping that is ideal for one workload might cause performance degradation for another workload due to a difference in access pattern over time…Based on simulation, we expect that the vault:bank:partition address mapping scheme results in the best performance and lowest execution time among these workloads. If we consider a vault to be equivalent to a DDRx Channel and a partition to be equivalent to a DDRx rank, then we see that this mapping scheme corresponds exactly to the optimal closed page mapping scheme … the reasoning is fairly straight forward: since a closed page policy assumes a stream with little or no locality (spatial or temporal) then it is best to distance adjacent cache lines by spreading them among the most independent elements (i.e. vaults). Putting the partition bits higher than the bank bits allows the controllers to minimize bus switching time (i.e. reads to different partitions that incur a one-cycle turnaround penalty). Furthermore, placing the bank bits lower in the address reduces the chances of the bank conflict within a given partition.”1
If I want to use HMC Chaining, what are the most optimal configurations in terms of performance and latency?
To Be Determined