ISCA'97 Mixed Logic-DRAM Workshop Notes

Workshop Notes
Workshop on Mixing Logic and DRAM:
Chips that Compute and Remember

Sunday, June 1st, 1997, 8:30am-5:30pm
Denver, Colorado

Organized as part of the 24th Annual
International Symposium on Computer Architecture

The following notes were recorded during the workshop by Ben Gribstad.
Please send corrections or comments to gribstad@cory.eecs.berkeley.edu

1. Beyond the Desktop

1.1 - Evaluation of existing architectures in IRAM systems

Application performance is what we are really looking for from IRAM systems.
evoluationary approach - know how to implement, software compatible
IRAM evoluationary approach only results in moderate performance gains
access latency as low as 21ns, no L2 cache, up 1.5 logic slowdown for initial implementations
24MB of on-chip DRAM not enough for high-end workstations. They are still enoiugh for portables /low-end PCs
Execution time analysis of Alpha 21064 & PPro
All benchmarks fit into the 24MB of onchip DRAM
memory bound applications can be sped up by 2x, but processor bound apps have minimal speedup
Also used SimOS to verify execution time analysis, used Spec95Int and Spec95Fp apps
Simulations allows to capture possible unanticipated interactions between events that the analytical models do not capture; also see the results from increasing the width of the memory bus. User/OS activity captured.
results - depending on configuration 333 or 500 MHz, access time, etc. some apps speed up, some slow down
Slowdown possible even for equal clk speeds if on-chip memory very slwo. This is because in that case we replace L2 caches with much slwoer on-chip DRAM.
IRAM needs a microarchitecture that can take advantage of BW to achieve significant performance benefits.

Questions

Q: What about bus utilization, prefetching, bank interleaving?
A: didn't try to model any changes - just evolution. On-chip DRAM was separated in two banks. No prefetching. If bus utilization is low, could do prefetching.
Q: spec95 benchmarks - but assuming DRAM not even out yet - what happens when you need to go off chip? Do they fit on-chip?
A: The working-set of the benchmarks used is less than 16MB. So assuming 30% increase in SW size per year, they will still fit on-chip at the time the 256Mbit generation will be commercially available. IRAM performance for such systems will get even worse if you need to go off chip
Q: latency of SRAM was what? DRAM and SRAM latency - You claim DRAM should be faster??? DRAM should be slower - SRAM will speed up in future generations as well
A: The 21ns for the DRAM are RAS+CAS. The are also a few cycles the bus (arbitration etc). It may be optimistic but in 1996 ISSCC there were DRAMs presented with 30ns RAS+CAS.

1.2 - The Smart Access Memory: An Intelligent RAM for Nearest Neighbor Database Searching

used for pattern matching, fuzzy problems w/ large, multi-dimensional databases
K Nearest neighbor - find the closest database items to the query and perform a decision/interpolation on the results - can use tree based searches or brute force (which has parallelism)
Memory search, distance calc., sorting
IRAM has high bandwidth, low BW for output and Input, large mem for DB
16mmx16mm, commodity parts from Hyundai, 64 Processors, 64 256K DRAM blocks, .35um DRAM, 1um logic
L1 Norm processor, bit-serial operation
Sorting is bit serial systolic operation, MSB first, queue is 64x64 routers
DRAM is 256 Rows by1024 Columns
little global control/data signals
single phase global clock , each processor generates 2 phase clock, for local communication only
209 ms from HP, 2.56 ms from SAM, 160us with 16 SAMs
full custom layout, relaxed design for periphery

Questions

Q: What was area impact of DRAM circuitry additions?
A: CAS circuitry altered for gate outputs, was easiest approach for both parties - area impact was small - had to add a lot of redundancy on colomn outputs, about 5% area penalty total
Comment: little noise during RAS since all logic values are established during access
Q: how did you test memory?
A: logic design using different tools, difficult to test together - using PCI bus to test - don't have silicon yet, triple well process, prototype, need to test (all four corners should be tested before being put in a PC). The simple serial/FIFO-type interface made testing easier since it is simple.
Q: Is the SAM limited by CAS or PE?
A: Performance is limited by PE.

1.3 - IRAM and Smart SIMM: Overcoming the I/O Bus Bottleneck

Current systems are IO limited, too many pins
serial lines can provide fast IO with smaller number of pins, chip can decide what to do with IO
number of IO lines determines BW
basic result - serial lines provide high IO BW
expand system with more IRAMs, get memory + processing - "SmartSIMM"
Case study - external sorting - how much can you sort in a minute
we are using a 2 pass sort
with 8 serial lines - 9.0 GB can be sorted with 24 MB IRAM
with PCI, 2.2 GB can be sorted with ~80 pins
108 GB sorted in 1999 for $450,000
need to vectorize sort apps (well understood) to get processing for free

Questions

Q: IO to disk or to network? how to get serial IO BW ?
A: IO to either disk or network, 100 MB/sec on FCAL
Q: global clock for IO? what about crossbar? can it sustain the speed you are talking? How to do resynchronize across multiple crossbars?
A: serial crossbar will be OK since we are transferring large blocks of data for sort, PLL lock time amortized
Q: do you need all the bandwidth? are you using everything that is there? what about solid state disks?
A: solid state disks not going to happen -- In 1997 Price per MB of DRAM is 50 times price per megabyte of magnetic disk, and price per MB of DRAM improves on average at 25% per year while price per MB of disk recently is improving at 60% per year
Q: comparing against 64 bit PCI - what about moving IO devices and graphics - if PCI is larger BW, would it be a comparable system?
A: pin requirements would be much higher, and PCI doesn't scale
Q: there is no standard for disks yet, FCAL is a start - is it realistic to say serial lines will be accessible to the world, rather than just to IO buses/controller?
A: Fibre Channel is a standard, we speculate that serial line communication can be a standard
Comment: reconfigurable logic for IO protocol won't run at 2 GB/s
A: we don't need to reconfigure at that rate, we just configure once at system boot time

1. Roundtable

Summary of lessons learned in building chips with mem and proc for SAM paper
- Building test chip - full custom design
- Design issues from SAM - interface to DRAM (difficult to change) put logic near DRAM, design rule difference, 4 poly, 2 metal for DRAM vs. 4 metal, 2 poly for logic, via contacts in logic almost as large as DRAM cell size, clearly room for improvement
- Architecture of DRAM w/ processor - separate DRAM into multiple banks
- Pitchmatching - number of bits that can be pulled out from SA? 8 for SAM, but error correction is a big concern - how manufacturable it is is very important - the more you take out opens possibilities for error from alpha particles and cosmic rays
Q: How can IRAM be used for general pupose computing?
A: Can be done, but should take advantage of parallelism for higher performance or be a low power application that can fit on-chip
Comment: Can't fully utilize internal memory bandwidth since IO is still a bottleneck, but it is OK to NOT use all the new bandwidth available!
A: Traditional chips have trouble with accesses, certain algorithms map very well to IRAM because of memroy access patterns
Q: Real world applications that will take advantage of speedup?
A: Image data base, OCR, robot kinematics with large datasets
Q: What about testing?
A: First test logic, then if you have a vector unit, test memory with it, generate test patterns on the fly to cut down test time
Q: SAM interface simulations?
A: parallel/ serial buffer to interface DRAM with bit serial processing
Q: Noise issues with logic and DRAM on the same chip?
A: for SAM during RAS there is nothing to do, normally noise may cause problems and may need to stall parts of chip - will need separate wells and VDD, GND lines
Q: limitations with redundancy - what is ratio of BW from on-chip RAM with redundancy limits compared with RAMBUS?
A: Could pay penalty for smaller DRAM blocks and get more bits at a time - tradeoff with power - also, as DRAM gets larger, you get more blocks. For 1Gb generation, you get 1 bit per block. With 1K blocks, can get 1Kb per access -> power will be limitation.
Q: Crossbar question: how do you get all the data from the blocks to the processor?
A: This is an interesting problem - need to deliver enough bandwidth for CPU and IO while trying to minimize area and power. Many factors involved.
Q: How do you replace DRAM blocks? prefetching ?
A: cluster algorithms, NUMA-like, can you dual port the DRAM and do replacement through that(?) OS can treat on-chip DRAM as page cache or fast portion of the main memory. Prefetching can be done at a page or smaller granularity. Maybe crucial in order to reduce the off-chip access penalty.
Q: How is DRAM low power?
A: BW vs energy. RAS is very expensive - multile bank activation causes power surge - may have problems when activating many banks at the same time
Comment: Will the apps we are targeting now still be appropriate in 5 years? We have a moving target.
Comment: We still have a latency problem, we are just pushing it back a few years by putting memory on chip. This doesn't scale with processor, so it will come up again.
Q: Can you reduce VDD when combining logic and DRAM when you have to worry about noise?
A: Good question, simulations?
Comment: Serial crossbar has to do synchronization, need lots of pipeline stages, lots of buffers, crossbar has long setup time - will there be contention problems? what if it is fully connected?

2. SIMD and Vector Approaches

2.1 - Computational RAM: The Case for SIMD Computing

Options for intergration:
- microprocessor + DRAM
- Vector processor
- MIMD
- SIMD - Highest bandwidth utilization, low overhead, unfamiliar programming model
- Hybrid
NOW: as data is multiplexed, bandwidth is lost (at SA, 2.9 TB/s, col 49 GB/s, pins 6.2GB/s, bus 190 MB/s, cache,CPU 1.6 GB/s)
processor + DRAM: at SA 48 GB/s, L2 - 750 MB/s, L1 - 94MB/s, Cache, CPU 4 GB/s
internal mem BW grows with square root of mem density
assumes only 1 outstanding mem access - could be pipelined and could add more banks
for low power, need more bits per row decode op
Why SIMD?: bad: some PE's sit idle, limits on apps than can be run, use long words, common row addr -> no MIMD
CRAM: = RAM + SIMD
Commericial spinoff: graphics accelerator by Accelerix
PEs at sense amps, 64 PE in 8Kb, 512 in 240 Kb
DRAM design at 4 and 16 Mb gen
PE ALU is a mux (always in use), each PE sees its column as local memory, bit serial arithmetic, 75 xtors!
for several programs, 1000x speedup by doing computation within the memory
highest BW is at SA, must do processing there

Questions

Q: 1 bus - how?
A: simplistic approach beats NlogN, N^2 area networks - move away from general networks
Q: layout area penalty for this implementation?
A: 3% in SRAM, 9% in this, 18% max in DRAM
Q: pitch size problem ?
A: periphery design rules are terrible - 1/4 design rules, SA is 4 Metal widths wide, PE is 7 M lines and pitch matched with 4 SA
Q: is SPEC or winmark just as fast?
A: only certain apps speed up here, could be used in graphics, MMX, yet for 3d graphics you need floating point

2.2 - Distributed Vector Architecture: Beyond a Single Vector-IRAM

Vector prcoessing on IRAM - LARGE scientific apps will cause thrashing
Solution: statically distribute dataset, parallelize vector instructions, bottleneck: external traffic dynamic computation by knowing where data is initially to minimize external traffic
DIVA: Commodity V-IRAM, software DIVA OR modified V-IRAM (hardware DIVA) transparent parallelization, and all nodes see the same intructions and execute them (scalar inst. will be redundant) must spread memory vectors across nodes
dynamically map: create element mappings before each computation slice
mapping cost amortized over the slice, single mapping for all registers of a slice, need mapping vector registers
used trace driven simulation, 6 NAS benchmarks, compare DIVA with a cache-based system
compared 4 node DIVA with 1/16 cache on chip, wins in most cases, but fails when more sophisticated mapping is required

Questions

Q: How sensitive are your results to compiler technology?
A: Results show with simplistic mapping, we have a pretty good win, better techniques should provde better performance
Q: How do you cover cost of SETMV with chaining since in scatter-gather ops you don't know how much time it will take ?
A: SETMV executed once per instructions group so cost is amortized. In addition, cost of mapping is much lower than transferring data, and can chain vector load off of first mapping
Q: Sounds non IRAM specific, have heard this idea before with FORTRAN 90, this is a hard problem? (internode traffic is not easy)
A: Trying dynamic solution, doing this at compile time is very difficult, implicit paralellism would be nice

2.3 - Using MML to Simulate Multiple Dual-Ported SRAMs: Parallel Routing Lookups in an ATM Switch Controller

Harvard credit switch: 10 Gb aggregrate throughput/sec, 46 million cells/sec
ATM switch gets 16 inputs, needs to decide which output each input cell goes to via virtual circuit map, worst case is looking up and mapping 16 input cells to 1 output port
lists are hardware tables, abd accesses are in the form of RMW, table access time must be < 20.4 ns, and each table size is 20 Mb, 272 SRAMs (dual ported) used to accomplish this
Integration: can't use SRAM, too big, DRAM alone is too slow, needs refresh
each RMW op is broken to 3 stages, and processing is pipelined ( 33ns each)
duplicated decoders and latches at outputs and at addresses to accomplish reads and writes within 33ns

Questions

Q: How do you speed up the row addresing?
A: bypassing
Q: Huge queues normally used, multiple accesses required, can a single DRAM do this?
A: would need to speed up DRAM to accomplish this
Q: Is a floorplan available?
A: We haven't thought about it yet
Q: Can you really cut the array like that?
A: It is possible, pitch matching may be a problem and may need to change. Can shrink cycle time from 33ns to 10ns by optimizing array for speed rather than density. 2x-3x cost for IRAM vs. DRAM may be conservative.
Q: Have you though about how your idea can be used in the case that the memory for queues is shared among all ports (much smaller mem requirements)?
A: (I didn't get it)

2. Roundtable

Q: (FOR ATM switch paper) Can we pipline all DRAMs to get better BW instead of only for RMW?
A: can still only do 1 read/cycle and 1 write/cycle ( you always get a write with a read for us, we are just using it diferently)
Q: Rather than splitting cycle, can you read and write in each 1/2 cycle since you know what you are writing before hand?
A: we don't know what to write always
Q: How much memory do you really need to run useful apps, if it is too much how to deal with it?
A: no Cray has had this before, limitations depend on the applications, can add additional IRAMs, but adding plain DRAM could be much cheaper if you only need the memory, also depends on parallelism, low end systems on a chip are demonstrated, high end will take some time to integrate and may take multiple chips, also this question is very dependent on BW of each chip
Q: Knee on curve of BW vs area and interconnect penalty?
A: 256 or 512 K is what is done now with factor of 4, needs to be explored
Q: Decode bits are latched, problems with pitch matching for ATM switch with word line?
A: Matched with 4 word lines and have decoder after the latch
Q: Estimate of minimum cycle time?
A: 10ns minimum with shrinking array and pushing hard (with pipelining)
Q: What performance advantange is needed for market acceptance?
A: If you don't change programming model, 10x (jump a generation), if you do, 2-3 generations ahead since more investment involved, could target apps where code is only written once, like graphics device drivers, Chromatic is changing user's model with hardware and software, Java is also causing many people to adapt software for compatibility, DSP people have been doing this for a long time, Mitsubishi is claiming to be ~20% of DRAM cost, video RAM was more costly but is coming down, testing cost may increase but may be able to decrease testing cost with IRAM

3. Multiprocessors

3.1 - How Processor- Memory Integration Affects the Design of DSMs

adding processors + mem means we need to redesign the nodes and reorganzize the whole machine memories as caches
How do you benefit from this? - need to shoot for good geographical locality
each block has address + state, so you don't have to check directory
memory is treated as a cache, data replication + migration. Not COMA
putting directory controller seperately
- On chip directories are useless on uniprocessors
- alows varying number of directories depending on app.
- little is lost by placing directory off chip since DRAM cache has few dir access and most ir accesses are due to coherency -> off chip
Result: processing nodes using memory as cache, and directory nodes to keep distributed directory and execute coherency protocol memory is used as backup to cover overflow from P nodes
D nodes are only heavilyu sed during startup, otherwise just coherency maintenance
D nodes don't need much memory since they only handle overflow of P nodes, so you have multiple P nodes per D node directory + memory state is cached in the caches of the D node
elimiates reduntant RAS with row buffers
uses only off the shelf components (?)

Questions

Q: Results were fast - what benchmarks / what speedup/ scaling?
A: We used bad benchmarks
Q: What about large datasets?
A: Need more memory

3.2 - A Single Chip Multiprocessor Integrated with DRAM

Single chip multiprocessor required to take advantage of DRAM bandwidth
used SimOS for simulations of 4x2 way superscalar and single superscalar with 32 MD DRAM, 2 and 4 MB SRAM
used 5 benchmarks, 4 of which were parallelized
compress and eqntott slow down with DRAM - integers apps with small working sets which originally fit inside SRAM caches will slow down with DRAM on-chip
With uniprocessor, even FP apps have little speedup with big working sets
for MP, FP apps have performance improvement
possible solution for paging: software protocol. This area needs work as us delays for pages negates benefit of on-chip DRAM
Conclusions: FP apps with MP can be much faster, as long as dataset fits on-chip

Questions

Q: Is floorplan in paper?
A: yes, that is what we are thinking, in 0.25 um technology
Q: It is 24mm on a side, isn''t that big?
A: Yes, but in my experience this isn't unreasonable
Q: Off chip mem - how much traffic was due to coherence? no problem with 4 proc, on chip? A: Applications used were too small.
Q: What will be the impact of off-chip acesses with larger programs - how big is BW a limit? are you numbers conservative?
A: bank conflicts can't handle extra BW, 30 ns access time to DRAM picked and not examined(???)
Q: speed vs bit density tradeoffs - can you reduce memory density and speed it up?
A: there are lots of tradeoffs, you lose area for speed then you have to go off chip more often with page faults

3.3 - A Multiprocessor Memory Processor for Efficient Sharing and Access Coordination

problems - misses can be 100s of cycles, big impact on some programs - synchronization overhead, locks, etc.
result is some programs slow down, some unusable
memory modules have DRAM banks, exoprocessor and packet buffer, performs arithmetic and logical operations
exoprocessor loads and stores only to its memory bank
exopacket contains PID, control state, operands, mem addr of operands
for execution, CPU sends exopacket to proper mem unit, which execs packets and may send anther packet to another mem
need to know when they will finish these ops and to not send too many -> counters
with this scheme, don't wait for barriers but hard limits, so more useful work can be done with less waiting
used fragments of code to test to demonstrate various aspects of the system
16 procs, 4x4, 21 cycle exoop exec latency, 42 cycle mem latency
some test progs have significant speedup

Questions

Q: Why is exo better to handle mem latency?
A: out of order has significant delays which are attempted to be hidden by moving ahead anyways. With many threads, this is ok, but with fewer threads it is more of a problem
Q: Tags give migrated synchronization, what fraction of performance benefits are given by loss of barriers vs. mem latency avoided ?
A: About 1/2 , and tags in memory avoids flurry of coherency activity in a processor

3. Roundtable

Q: Going beyond 4 procesors on a chip?
A: Not really becuase general purpose apps can't use it, however multimedia or spec95FP apps can do well. Memory capacity and processing power needs to be balance to get best utilization
Q: Any remarks on using the new memory bandwidth that is now available?
A: 40-50% on spec95 for 4 MP, cycle time is key point, adding extra banks increases BW but cycle time is also important, less conflicts with multithreads on one chip - some can stall and others can still use the BW
Q: Exoprocessing is intriguing idea- any breakdown between data and synchronization?
A: Both are equally important, wasn't as much of a need for this before but now we need to avoid coherency. By doing things in memory you avoid invalidations right after proc is finished, envisioned for computation and synchronization both
Q: Is there enough spatial locality to use BW effectively?
A: Need to transfer more data at a time but there is less spatial locality
Q: What is the unit of memory? a page? a cache line? a word?
A: Cache lines for directory entries
can't do parallel DRAM and directory access - will need to redesign interface

4. Limits and Future

4.1 - The Relative Importance of Memory Latency, Bandwidth, and Branch Limits to Performance

memory latency wall, bandwidth wall, branch predictability wall (wider issue, deeper pipelines)
Branch wall is most significant future problem (with future compilers and hardware)
16k instruction window with aggressive processor
ideally would like to use 2010 benchmarks - instead use spec92 with small caches and scale results up
need > 100 bytes per CPU cycle in the future, or 800 IO pins. Future trends say this is OK.
mem BW can be solved by good eng., latencies < 100 cycles can be tolerated with pipelining and other techniques
impact based branch prediction would help

Questions

Comment: Perfect branch prediction is not a valid assumption to hold. The result that one can make from your work is that "since we cannot achieve the perfect branch prediction, we should work on the memory bw/latency requirements"
Comment: There is no bandwidth wall - with enough money you can get the bandwidth
Q: what happens when you increase clock speed?
A: we assume clock speed increases the same (???)

4.2 - Processing in Memory: Chips to Pentaflops

Barriers are physics based, programming based (need million+ parallelism)
gap between CPU and number of mem chips for pentaflops is narrowing - need to integrate
"Shamrock" - memory is rotated to alow access from CPU and silicon macro is tiled so that memory if physically and logically shared with its 4 nearest neighbors
concurrent access to up 256 bits in each 128Kb to 1 Mb staced array
Multhreading CPUs will run at 4 degrees K, CryoSRAM will be at 77K, (100+ GHz)
optical switch interface (Princeton) and 3D optical RAM farm (at Caltech)
proactively and preemptively manage memory heirarchy
tiled PIM, integration with HTMT need overall programming model

Questions

Q: What factors are most limitng?
A: how to program this thing is going to be a challenge
Q: will VLIW be an option?
Q: Can you reuse some of the computation after a branch miss?
A: Depends

4. Roundtable

PIM: Q: Movement of data under software control?
A: We are looking into options
PIM: Q: how do you move data from one corner to another?
A: Looked at bouncing along memory banks, or may use memory bus on top of CPUs
Q: What will mem hierarchy look like in 2010? processing on chip with low traffic or -? A: PIM: lots of things to look at, 3D IO, supercooled mem, more conventional case? Lots of heirarchies examined in the paper not presented, lots of cases and depends on applications
A: Is processing centralized or processing in the memory? When problem is distributed procs are hard to keep busy and app specific
A: Application specific, may want to throw more hardware to keep things busy
Q: Who funds pentaflops?? Good idea to bank on all these new technologies? A: NSF,
MIMD paper: Q: one processor can utilize the mem BW on a chip, multiprocs would fight about it. how much is to be believed about these roadmaps? what about fab costs and other factors?
A: We reevaluate the roadmap every two years
Q: MIMD with one app is ok but more is not? A: 4 working sets in the same space can result it BW requirements if no overlap is present

Open Mike session - 5 minute presentations

Corinna Lee, Toronto
Computational Power of Vector processing

vector processor may be good idea
preview of vector simulations - T0 vector vs. 6-way R10000 Superscalar using PGP decryption
vector processor significantly outperforms aggressive superscalar
20 mm² vs 255 mm² in 0.25 um process
relying on compiler is the way to go, rather than throwing hardware at it

Mateo Valero, Barcelona
Vector + IRAM

superscalar vs vector
vectors can deal with the latency of memory
R10000 + out of order + multithreading much slower than vector processor

Yunho Choi, Samsung
Technology

Process difference of High Performance Logic and High Density DRAM is getting wider and wider
A major process change for high density DRAM for the small portion of silicon area to be integrated (for logic) as CPU will not be cost effective
0.35um Logic + 16Mbit DRAM Merged Product has been recently achieved and proved at the system level with reasonable cost overhead
Up to 0.25um Logic + 64Mbit DRAM integration may be achievable with reasonable cost overhead
But, beyond 0.18um Logic + 256Mbit DRAM the cost overhead for small amount of logic + large density of DRAM will be extremely high
So, Architecture Research Based on Loose Performance of Logic such as DRAM should be encouraged

Steve Przbylski, Principal Consultant, Verdande Group, sp@verdande.com
Money and Memory

Commodity DRAM misconceptions

estimating cost based on future price of memory is decoupled from cost of production of memory
In most MDL, logic yield < DRAM yield
Cost of MLL = CL*((Al+Ad)/Al)*alpha (Al = area of logic, Ad = area of DRAM, CL is the cost of manufacturing the logic die on a logic process, alpha is a factor to account for the increased cost (if any) of manufacturing on a blended DRAM/logic process)
0.18 um is 1 Gb tech, in theory yes, in practice, no
1Gb will be 0.13 or 0.10 tech, 3-5 years away for commodity prices
transition beyond 0.18 is end of line for optical lithography, need to change to improve technology beyond this
useful tool: fill frequency: ratio of Bw in/out over capacity is the frequency at which to can hit every bit (fill freq, or FF)
today commodity DRAMs about 10-60 Hz
crucial observations
- market have requirements for fill freq., minimum must be met to be successful
- if memory system's FF < market required FF - product will fail
- therefore if DRAM FF is too low, product it is used in will fail
- FF for subsections are 50-200 Hz, at SA FF is 8-32 KHz
- thus if the app needs > 100-200 FF then need to redesign core to expose SA (OUCH!!, don't)
- observation - very few such apps, need to be aware of this
- reference: "New DRAM Technologies: A Comprehensive Analysis of the New Architectures", MicroDesign Resources, 1996. www.mdronline.com

Kevin Kissel, SGI
IRAM: The "Right" Way and The "Wrong" Way

CPU-centric approach is the wrong way to look at the problem, IRAM won't be a SPEC engine
Compromises on core size and design rules will mitigate against absolute performance of IRAM versus logic-process CPUs.
DRAM bandwidth not necessarily easy to use: MP and vectors are being proposed for IRAMs to deal with this "embarasment of riches".
Memory-centric approach is the more promising path.
Relatively small amounts of logic can be made to do interesting things in conjunction with DRAM arrays.
Not general purpose, but interesting and cost-effective solutions are already being found.
This is a Good THing for research, as there are more degrees of freedom away from the SPEC world of mature ISAs.
DRAM and Microprocessors will be combined, but it will be on DRAM's terms.

Chip Weems, Umass/Amherst
Fallacies and pitfalls for SIMD-IRAM

SYNCHRONIZATION:
- clock distribution -- tight skew tolerance across many processing elements can severely limit clock rate,
- synch communication -- requires dedicated communication lines that support either peak BW or sit idle
- supply-induced noise -- huge transients from parallel memory accesses, arithmetic ops, etc.
DRAM:
- matching pitch of processors to memory -- typically have just the equivalent width of a few metal lines in which to pack an entire ALU bit slice
- RMW timing constrains logic -- DRAM people don't want to adjust their timing, so the data path must be timed to match the DRAM. If the DRAM timing changes in the next generation, then the datapath may have to be completely redesigned.
IRAM:
- Let them eat IRAM (assuming they will just buy more IRAM for more memory or more processing) -- people don't want to accept this as their only alternative
- "The DRAM substitute" - Like trying to sell sugar as nutrasweet substitute for a higher price. You can eat up all their SIMM slots, have to deal with multiple SIMM/DIMM form factors, and may end up making things unexpandable or unportable. Saying that cost is only 2 to 3 times DRAM doesn't help -- the CAM community has been saying this for 30 years, with very limited success.
- Off chip BW limits -- no SIMD system has ever been big enough -- there always has to be provision for off-chip bandwidth (both for memory and I/O)
SIMD PE:
- Utilization -- dual of branch mispredict, very hard to keep a significant portion of processors busy, and it's hard to predict which ones will be idle and then make use of them in some other way.
- Partitioning -- one monolithic array almost guarantees low utilization of processing elements
- Data alignment -- the data is almost never where you need it. The bandwidth for moving it can exceed the bandwidth between the RAM and the processors.
- Indexing -- a useful feature in many SIMD programs, but the addresses originate from a source that is orthogonal to the normal address path.
Bit serial:
- Corner turning -- orthogonal access is needed to provide bit parallel access by external devices
- Variable-length operands -- almost impossible to make work because there isn't any language support, and even if it existed, it would require both new compiler technology and a willing user community.
- N-squared factor -- multiply, divide, shift, etc. require N-squared area-time product. This results in unacceptable performance on common operations, especially FP.
System level:
- Global reduction -- costly operation for large arrays, but necessary for branch control
- Global control -- separate, costly chunk of external logic. Suffers same problem as clock in terms of distribution.
Software:
- The library approach is an admission that the system is only useful in a niche market (the one the library addresses).
- Need good compiler to go beyond niche markets, and probably a new programming model (if not a new language) which faces an uphill battle for acceptance.

Barry Fagin
Relevance of "Whack-a-Mole" to IRAM (an observer's perspective)

How to solve important problems in cost effective way should be main goal
Technology will drive architecure for once?
What is wack-a-mole? lots of other things to keep your attention on besides a fancy 100x speedup

Return to workshop web page

This page is located at
http://iram.cs.berkeley.edu/isca97-workshop/notes.html

Workshop Notes Workshop on Mixing Logic and DRAM: Chips that Compute and Remember

Sunday, June 1st, 1997, 8:30am-5:30pm Denver, Colorado

Organized as part of the 24th Annual International Symposium on Computer Architecture

1. Beyond the Desktop

1.1 - Evaluation of existing architectures in IRAM systems

Questions

1.2 - The Smart Access Memory: An Intelligent RAM for Nearest Neighbor Database Searching

Questions

1.3 - IRAM and Smart SIMM: Overcoming the I/O Bus Bottleneck

Questions

1. Roundtable

2. SIMD and Vector Approaches

2.1 - Computational RAM: The Case for SIMD Computing

Questions

2.2 - Distributed Vector Architecture: Beyond a Single Vector-IRAM

Questions

2.3 - Using MML to Simulate Multiple Dual-Ported SRAMs: Parallel Routing Lookups in an ATM Switch Controller

Questions

2. Roundtable

3. Multiprocessors

3.1 - How Processor- Memory Integration Affects the Design of DSMs

Questions

3.2 - A Single Chip Multiprocessor Integrated with DRAM

Questions

3.3 - A Multiprocessor Memory Processor for Efficient Sharing and Access Coordination

Questions

3. Roundtable

4. Limits and Future

4.1 - The Relative Importance of Memory Latency, Bandwidth, and Branch Limits to Performance

Questions

4.2 - Processing in Memory: Chips to Pentaflops

Questions

4. Roundtable

Open Mike session - 5 minute presentations

Workshop Notes
Workshop on Mixing Logic and DRAM:
Chips that Compute and Remember

Sunday, June 1st, 1997, 8:30am-5:30pm
Denver, Colorado

Organized as part of the 24th Annual
International Symposium on Computer Architecture