In IRAM systems [1], the interconnection scheme between DRAM memory and the processor is a critical system component for many reasons. Such a scheme has to meet the processor requirements for high memory bandwidth, low memory latency and the ability to serve multiple independent data (address) streams in parallel, in order to achieve high system performance. In the same time, it has to be area efficient, in order to reduce cost and increase the area available for on-chip memory, and have reduced power consumption. An additional constrain is that the interconnection scheme should not affect the normal operation of the memory blocks.
The potential floorplan of an IRAM system using a vector unit as its main processor is presented in figure 1. Each slice of the vector processor, called lane, contains one or more load/store units that need to be interconnected with the memory subsystem. Multiple load/store units would also exist on-chip if some other processor architecture such as multiscalar, VLIW or multiprocessor-on-a-chip, was to be adopted. It is desirable, for performance reasons, that each load/store unit is able to issue one memory operation on every clock cycle. In order to enable multiple memory operations to initiate simultaneously and to be served in parallel, the memory subsystem is divided into a number of memory sections, that operate in parallel. A memory section is defined as the minimum memory partition that can be addressed independently and, under specific conditions, can support throughput of one memory operation per clock cycle. Load/store units are connected to the memory sections through a crossbar switch. No other interconnection topology can be used, due to the requirements for full-interconnectivity between load/store units and memory sections and parallelism of accesses, both necessary for high performance.
In this work, we will address the memory-logic interconnection by examining the architecture and circuits of the its two basic components: the crossbar switch and the memory section.
Memory sections and load/store units are connected through a crossbar switch. There are two basic issues in the crossbar design: the specific architecture of the switch and type of bus used to implement it. An ideal solution should provide low power consumption and high speed with conservative area requirements. Noise immunity is another desirable feature.
It is important to understand the required functionality for the crossbar. On every clock cycle, the crossbar must be able to transfer a load or store request from every load/store unit to a memory section (assuming no conflicts). The width of each path established should be equal to the sum of the word length (64 bits), the address length (<32bits) and the number of control signals. In the same time, it must be able to transfer a word from every memory section to a load store unit. These words can be replies to previously issued read requests. The width of the path must be equal to that of a word plus a few control signals. This bidirectional functionality, with wide paths, is necessary for achieving the high parallelism and pipelining of memory accesses, which lead to low-latency and high-bandwidth. Its clear, that even for a small number of load/store units and memory sections, such a crossbar could lead to an increased wiring overhead. Moreover, its area complexity grows quadratically with the number of units/section and the with of the word.
Our first goal is to evaluate a number of crossbar switch architecture and structures, like those in [2], [3]. The issues to examine are the ability to efficiently lay out each architecture in an IRAM system, taking advantage the block placement in the system floorplan and the availability of multiple metal layers, the speed of these architectures and their scaling behavior. We also want to examine both parallel and serial implementation of such structures. A serial approach has the advantage of reduced area requirements, at the cost of additional delay for serial-to-parallel and parallel-to-serial conversions. Yet, the bandwidth requirements for the crossbar are such that a single wire serial approach may not be feasible. For this reason, we will also examine potential hybrid solutions.
The next goal of this work is to evaluate bus circuit for the crossbar implementation. The type of busses used will affect the speed, the area and the power of the switch. In addition, it will determine the amount of significant noise induced on the DRAM banks by the switch. Placing a high-frequency processor and a wide fast crossbar switch on the same die with DRAM memory could create noise to the storage cell capacitors and affect read/write operations. For these reasons, we want to evaluate full swing, reduced swing and differential busses with respect to their maximum data rate, power consumption and area requirements. Bus length will be an important parameter and this case and the results may indicate the need to use some pipelined bus structure in order to achieve high clock speeds.
A memory section in an IRAM system is defined as the memory subsystem that can perform one memory operation per clock cycle (under certain conditions). It consists of a section controller and a number of memory banks. The section controller receives word read/write operations at the maximum rate of one per clock cycle (assuming no misses for the corresponding DRAM pages) and generates the proper commands (like precharge, RAS, CAS etc) for the corresponding memory bank. Memory banks are connected to the section controller through some bus structure. Other than the proper decoders, multiplexers, sense-amplifiers and latches, banks contain no additional "intelligence". In other words, they just perform the operation instructed by the controller.
The first issue here is the potential section floorplan. Traditional DRAM have been structured with the assumption that only a few bits (1-4) need to be transfered from each bank to the "external world". In an IRAM section, we would like to be able to transfer a whole word (64 bits) from the bank to the section controller. Interleaving the bits of a word across multiple banks is not desirable for power consumption and performance reasons. Hence, we need to revisit the placement of memory blocks, decoders, sense-amplifiers and busses in order to achieve such functionality.
The next issue to examine is the interface (protocol) between the section controller and memory banks. The section must be able to deliver one word per clock cycle to the processor. Since, even without page-faults, sending a read request and receiving a reply within a single clock cycle at frequencies above 200MHz seems infeasible, a pipelined scheme has to be adopted. We intend to propose a flexible (both for performance and power savings) pipelined interface based on techniques used in SDRAM [5], prefetched SDRAM [6] and high-bandwidth embedded DRAM [7] or SRAM [8].
The factor that will determine the clock speed of the pipelined interface is the speed of the bus connecting the memory banks to the section controller. Traditionally, one would use a full-swing precharged bus. Yet, such a scheme can suffer from speed and power consumption. Another concern is the noise induced on the memory banks by the bus switching at full-swing at >200MHz. For these reasons, we will evaluate the speed, power consumption and area requirements in a DRAM process both for full-swing [9] and low-swing or differential busses [9][10][11][12]. Depending on the results, we may also want to examine a hierarchical bus structure (in the case of very large bus loading). A great number of work for this issue will be common with the corresponding bus structures evaluation for the crossbar switch. Yet, some evaluation parameters (like the bus length) and criteria in the two cases will differ.
A couple more issues than we can address are the following. First, knowing the bus structure and the necessary circuits per memory bank, one can examine the area overhead as a function of the bank size. From a performance standpoint, one would prefer many small banks, since this would provide many simultaneously open DRAM pages. Yet, there is an area penalty for the peripheral circuits of each bank. It is of great interest to quantify the tradeoff between area and number of banks. A second issue to examine is the area/speed penalty from adding multiple page buffers per bank (extending the scheme in [7]). Multiple page buffers can be used as page buffers in order to minimize the frequency of paying the latency cost for a RAS access. A following research project can use access traces from proper benchmarks to see the benefit from multiple banks and multiple page buffers per bank and use our results to decide the optimal number, both for increased performance and minimum area penalty.
Our main tool for our evaluations will be the information available to us (under NDA) for a 0.5um 16Mbits DRAM process by Texas Instruments. While such a process is rather poor for an IRAM system, we expect to get reasonable conclusions, especially for initial IRAM implementations, which will not use a significantly advanced DRAM process anyway. This process will be used for area calculations and comparisons, while transistor and interconnect models will be used for SPICE simulations of bus structures and circuits.
We were not able to contact any experiments so far, due to a delay in getting a special license for using TI models with the HSPICE circuit simulator. This license is expected to be available to us soon, so that actual HSPICE simulations can initiate. Until then, we mainly focus on architecture and area related issues.
There are certain interesting issues that we may not address in this work. First of all we will not examine the structure/functionality of the section controller. Design of memory controllers is a well understood issue, so once the desired functionality has been defined, its design should be straight-forward. A second issue is the effect on the storage cell and array read/write operations due to having a bus running on top or by the side of the array. Evaluating that would require access to the design details of a DRAM core, which is not available. Since we are not sure that we can accurately model this issue without this information, we may not work on this issue. Still, one should keep in mind that we will examine low-swing bus structures that could either eliminate or significantly reduce this problem.