Workshop on Mixing Logic and DRAM:
Chips that Compute and Remember
Sunday, June 1st, 1997, 8:30am-5:30pm
The following notes were recorded during the workshop by Ben Gribstad.
Please send corrections or comments
1. Beyond the Desktop
1.1 - Evaluation of existing architectures in IRAM systems
- Application performance is what we are really looking for from IRAM systems.
- evoluationary approach - know how to implement, software compatible
- IRAM evoluationary approach only results in moderate performance gains
- access latency as low as 21ns, no L2 cache, up 1.5 logic slowdown for initial implementations
- 24MB of on-chip DRAM not enough for high-end workstations. They are still enoiugh for portables
- Execution time analysis of Alpha 21064 & PPro
- All benchmarks fit into the 24MB of onchip DRAM
- memory bound applications can be sped up by 2x, but processor bound apps have minimal speedup
- Also used SimOS to verify execution time analysis, used Spec95Int and Spec95Fp apps
- Simulations allows to capture possible unanticipated interactions between events that the
analytical models do not capture; also see the results from increasing the width of the memory
bus. User/OS activity captured.
- results - depending on configuration 333 or 500 MHz, access time, etc. some apps speed up,
some slow down
- Slowdown possible even for equal clk speeds if on-chip memory very slwo. This is because in
that case we replace L2 caches with much slwoer on-chip DRAM.
- IRAM needs a microarchitecture that can take advantage of BW to achieve significant
- Q: What about bus utilization, prefetching, bank interleaving?
A: didn't try to model any changes - just evolution. On-chip DRAM was separated in two banks.
No prefetching. If bus utilization is low, could do prefetching.
- Q: spec95 benchmarks - but assuming DRAM not even out yet - what happens when you need to go
off chip? Do they fit on-chip?
A: The working-set of the benchmarks used is less than 16MB. So assuming 30% increase in SW
size per year, they will still fit on-chip at the time the 256Mbit generation will be
commercially available. IRAM performance for such systems will get even worse if you need
to go off chip
- Q: latency of SRAM was what? DRAM and SRAM latency - You claim DRAM should be faster???
DRAM should be slower - SRAM will speed up in future generations as well
A: The 21ns for the DRAM are RAS+CAS. The are also a few cycles the bus (arbitration etc).
It may be optimistic but in 1996 ISSCC there were DRAMs presented with 30ns RAS+CAS.
1.2 - The Smart Access Memory: An Intelligent RAM for Nearest Neighbor Database Searching
- used for pattern matching, fuzzy problems w/ large, multi-dimensional databases
- K Nearest neighbor - find the closest database items to the query and perform a decision/interpolation on the results - can use tree based searches or brute force (which has parallelism)
- Memory search, distance calc., sorting
- IRAM has high bandwidth, low BW for output and Input, large mem for DB
- 16mmx16mm, commodity parts from Hyundai, 64 Processors, 64 256K DRAM blocks, .35um DRAM, 1um logic
- L1 Norm processor, bit-serial operation
- Sorting is bit serial systolic operation, MSB first, queue is 64x64 routers
- DRAM is 256 Rows by1024 Columns
- little global control/data signals
- single phase global clock , each processor generates 2 phase clock, for local communication only
- 209 ms from HP, 2.56 ms from SAM, 160us with 16 SAMs
- full custom layout, relaxed design for periphery
- Q: What was area impact of DRAM circuitry additions?
A: CAS circuitry altered for gate outputs, was easiest approach for both parties - area
impact was small - had to add a lot of redundancy on colomn outputs, about 5% area penalty
- Comment: little noise during RAS since all logic values are established during access
- Q: how did you test memory?
A: logic design using different tools, difficult to test together - using PCI bus to
test - don't have silicon yet, triple well process, prototype, need to test (all four
corners should be tested before being put in a PC). The simple serial/FIFO-type interface
made testing easier since it is simple.
- Q: Is the SAM limited by CAS or PE?
A: Performance is limited by PE.
1.3 - IRAM and Smart SIMM: Overcoming the I/O Bus Bottleneck
- Current systems are IO limited, too many pins
- serial lines can provide fast IO with smaller number of pins, chip can decide what to do
- number of IO lines determines BW
- basic result - serial lines provide high IO BW
- expand system with more IRAMs, get memory + processing - "SmartSIMM"
- Case study - external sorting - how much can you sort in a minute
- we are using a 2 pass sort
- with 8 serial lines - 9.0 GB can be sorted with 24 MB IRAM
- with PCI, 2.2 GB can be sorted with ~80 pins
- 108 GB sorted in 1999 for $450,000
- need to vectorize sort apps (well understood) to get processing for free
- Q: IO to disk or to network? how to get serial IO BW ?
A: IO to either disk or network, 100 MB/sec on FCAL
- Q: global clock for IO? what about crossbar? can it sustain the speed you are talking?
How to do resynchronize across multiple crossbars?
A: serial crossbar will be OK since we are transferring large blocks of data for sort,
PLL lock time amortized
- Q: do you need all the bandwidth? are you using everything that is there? what about solid
A: solid state disks not going to happen --
In 1997 Price per MB of DRAM is 50 times price per megabyte of magnetic disk,
and price per MB of DRAM improves on average at 25% per year while
price per MB of disk recently is improving at 60% per year
- Q: comparing against 64 bit PCI - what about moving IO devices and graphics - if PCI is
larger BW, would it be a comparable system?
A: pin requirements would be much higher, and PCI doesn't scale
- Q: there is no standard for disks yet, FCAL is a start - is it realistic to say serial
lines will be accessible to the world, rather than just to IO buses/controller?
A: Fibre Channel is a standard, we speculate that serial line communication can be a standard
- Comment: reconfigurable logic for IO protocol won't run at 2 GB/s
- A: we don't need to reconfigure at that rate, we just configure once
at system boot time
- Summary of lessons learned in building chips with mem and proc for SAM paper
- Building test chip - full custom design
- Design issues from SAM - interface to DRAM (difficult to change) put logic near DRAM,
design rule difference, 4 poly, 2 metal for DRAM vs. 4 metal, 2 poly for logic, via contacts in logic almost as
large as DRAM cell size, clearly room for improvement
- Architecture of DRAM w/ processor - separate DRAM into multiple banks
- Pitchmatching - number of bits that can be pulled out from SA?
8 for SAM, but error correction is a big concern - how manufacturable
it is is very important - the more you take out opens possibilities for error from
alpha particles and cosmic rays
- Q: How can IRAM be used for general pupose computing?
A: Can be done, but should take advantage of parallelism for higher performance
or be a low power application that can fit on-chip
- Comment: Can't fully utilize internal memory bandwidth since IO is still a bottleneck, but
it is OK to NOT use all the new bandwidth available!
- A: Traditional chips have trouble with accesses, certain algorithms map very well to IRAM
because of memroy access patterns
- Q: Real world applications that will take advantage of speedup?
A: Image data base, OCR, robot kinematics with large datasets
- Q: What about testing?
A: First test logic, then if you have a vector unit, test memory with it, generate
test patterns on the fly to cut down test time
- Q: SAM interface simulations?
A: parallel/ serial buffer to interface DRAM with bit serial processing
- Q: Noise issues with logic and DRAM on the same chip?
A: for SAM during RAS there is nothing to do, normally noise may cause problems and
may need to stall parts of chip - will need separate wells and VDD, GND lines
- Q: limitations with redundancy - what is ratio of BW from on-chip RAM with redundancy
limits compared with RAMBUS?
A: Could pay penalty for smaller DRAM blocks and get more bits at a time - tradeoff with power -
also, as DRAM gets larger, you get more blocks. For 1Gb generation, you get 1 bit per block.
With 1K blocks, can get 1Kb per access -> power will be limitation.
- Q: Crossbar question: how do you get all the data from the blocks to the processor?
A: This is an interesting problem - need to deliver enough bandwidth for CPU and IO while
trying to minimize area and power. Many factors involved.
- Q: How do you replace DRAM blocks? prefetching ?
A: cluster algorithms, NUMA-like, can you dual port the DRAM and do replacement
through that(?) OS can treat on-chip DRAM as page cache or fast portion
of the main memory. Prefetching can be done at a page or smaller
granularity. Maybe crucial in order to reduce the off-chip access penalty.
- Q: How is DRAM low power?
A: BW vs energy. RAS is very expensive - multile bank activation causes power surge -
may have problems when activating many banks at the same time
- Comment: Will the apps we are targeting now still be appropriate in 5 years? We have a moving
- Comment: We still have a latency problem, we are just pushing it back a few years by putting
memory on chip. This doesn't scale with processor, so it will come up again.
- Q: Can you reduce VDD when combining logic and DRAM when you have to worry about noise?
A: Good question, simulations?
- Comment: Serial crossbar has to do synchronization, need lots of pipeline stages, lots of
buffers, crossbar has long setup time - will there be contention problems? what if it is fully
2. SIMD and Vector Approaches
2.1 - Computational RAM: The Case for SIMD Computing
- Options for intergration:
- microprocessor + DRAM
- Vector processor
- SIMD - Highest bandwidth utilization, low overhead, unfamiliar programming model
- NOW: as data is multiplexed, bandwidth is lost (at SA, 2.9 TB/s, col 49 GB/s, pins
6.2GB/s, bus 190 MB/s, cache,CPU 1.6 GB/s)
- processor + DRAM: at SA 48 GB/s, L2 - 750 MB/s, L1 - 94MB/s, Cache, CPU 4 GB/s
- internal mem BW grows with square root of mem density
- assumes only 1 outstanding mem access - could be pipelined and could add more banks
- for low power, need more bits per row decode op
- Why SIMD?: bad: some PE's sit idle, limits on apps than can be run, use long words, common row addr -> no MIMD
- CRAM: = RAM + SIMD
- Commericial spinoff: graphics accelerator by
- PEs at sense amps, 64 PE in 8Kb, 512 in 240 Kb
- DRAM design at 4 and 16 Mb gen
- PE ALU is a mux (always in use), each PE sees its column as local memory, bit serial arithmetic, 75 xtors!
- for several programs, 1000x speedup by doing computation within the memory
- highest BW is at SA, must do processing there
- Q: 1 bus - how?
A: simplistic approach beats NlogN, N^2 area networks - move away from general
- Q: layout area penalty for this implementation?
A: 3% in SRAM, 9% in this, 18% max in DRAM
- Q: pitch size problem ?
A: periphery design rules are terrible - 1/4 design rules, SA is 4 Metal widths
wide, PE is 7 M lines and pitch matched with 4 SA
- Q: is SPEC or winmark just as fast?
A: only certain apps speed up here, could be used in graphics, MMX, yet for 3d
graphics you need floating point
2.2 - Distributed Vector Architecture: Beyond a Single Vector-IRAM
- Vector prcoessing on IRAM - LARGE scientific apps will cause thrashing
Solution: statically distribute dataset, parallelize vector instructions, bottleneck: external traffic
dynamic computation by knowing where data is initially to minimize external traffic
- DIVA: Commodity V-IRAM, software DIVA OR modified V-IRAM (hardware DIVA)
transparent parallelization, and all nodes see the same intructions and execute them (scalar inst. will be redundant)
must spread memory vectors across nodes
- dynamically map: create element mappings before each computation slice
- mapping cost amortized over the slice, single mapping for all registers of a slice, need mapping vector registers
- used trace driven simulation, 6 NAS benchmarks, compare DIVA with a cache-based system
- compared 4 node DIVA with 1/16 cache on chip, wins in most cases, but fails when more sophisticated mapping is required
- Q: How sensitive are your results to compiler technology?
A: Results show with simplistic mapping, we have a pretty good win, better techniques should provde better performance
- Q: How do you cover cost of SETMV with chaining since in scatter-gather ops you don't know how much time it will take ?
A: SETMV executed once per instructions group so cost is amortized. In addition, cost of mapping
is much lower than transferring data, and can chain vector load off of first mapping
- Q: Sounds non IRAM specific, have heard this idea before with FORTRAN 90, this is a
hard problem? (internode traffic is not easy)
A: Trying dynamic solution, doing this at compile time is very difficult, implicit
paralellism would be nice
2.3 - Using MML to Simulate Multiple Dual-Ported SRAMs: Parallel Routing Lookups in an ATM Switch Controller
- Harvard credit switch: 10 Gb aggregrate throughput/sec, 46 million cells/sec
- ATM switch gets 16 inputs, needs to decide which output each input cell goes to via
virtual circuit map, worst case is looking up and mapping 16 input cells to 1 output port
- lists are hardware tables, abd accesses are in the form of RMW, table access time must
be < 20.4 ns, and each table size is 20 Mb, 272 SRAMs (dual ported) used to accomplish this
- Integration: can't use SRAM, too big, DRAM alone is too slow, needs refresh
- each RMW op is broken to 3 stages, and processing is pipelined ( 33ns each)
- duplicated decoders and latches at outputs and at addresses to accomplish reads and writes within 33ns
- Q: How do you speed up the row addresing?
- Q: Huge queues normally used, multiple accesses required, can a single DRAM do this?
A: would need to speed up DRAM to accomplish this
- Q: Is a floorplan available?
A: We haven't thought about it yet
- Q: Can you really cut the array like that?
A: It is possible, pitch matching may be a problem and may need to change. Can shrink cycle time
from 33ns to 10ns by optimizing array for speed rather than density. 2x-3x cost for IRAM
vs. DRAM may be conservative.
- Q: Have you though about how your idea can be used in the case that the memory for queues is
shared among all ports (much smaller mem requirements)?
A: (I didn't get it)
- Q: (FOR ATM switch paper) Can we pipline all DRAMs to get better BW instead of only for RMW?
A: can still only do 1 read/cycle and 1 write/cycle ( you always get a write with a read for
us, we are just using it diferently)
- Q: Rather than splitting cycle, can you read and write in each 1/2 cycle since you know what
you are writing before hand?
A: we don't know what to write always
- Q: How much memory do you really need to run useful apps, if it is too much how to deal with it?
A: no Cray has had this before, limitations depend on the applications, can add additional
IRAMs, but adding plain DRAM could be much cheaper if you only need the memory, also depends
on parallelism, low end systems on a chip are demonstrated, high end will take some time
to integrate and may take multiple chips, also this question is very dependent on BW of
- Q: Knee on curve of BW vs area and interconnect penalty?
A: 256 or 512 K is what is done now with factor of 4, needs to be explored
- Q: Decode bits are latched, problems with pitch matching for ATM switch with word line?
A: Matched with 4 word lines and have decoder after the latch
- Q: Estimate of minimum cycle time?
A: 10ns minimum with shrinking array and pushing hard (with pipelining)
- Q: What performance advantange is needed for market acceptance?
A: If you don't change programming model, 10x (jump a generation), if you do,
2-3 generations ahead since more investment involved, could target apps where
code is only written once, like graphics device drivers, Chromatic is changing
user's model with hardware and software, Java is also causing many people to adapt
software for compatibility, DSP people have been doing this for a long time,
Mitsubishi is claiming to be ~20% of DRAM cost, video RAM was more costly but is
coming down, testing cost may increase but may be able to decrease testing cost with IRAM
3.1 - How Processor- Memory Integration Affects the Design of DSMs
- adding processors + mem means we need to redesign the nodes and reorganzize the whole machine
memories as caches
- How do you benefit from this? - need to shoot for good geographical locality
- each block has address + state, so you don't have to check directory
- memory is treated as a cache, data replication + migration. Not COMA
- putting directory controller seperately
- On chip directories are useless on uniprocessors
- alows varying number of directories depending on app.
- little is lost by placing directory off chip since DRAM cache has
few dir access and most ir accesses are due to coherency -> off chip
- Result: processing nodes using memory as cache, and directory nodes to keep distributed directory and execute coherency protocol
memory is used as backup to cover overflow from P nodes
- D nodes are only heavilyu sed during startup, otherwise just coherency maintenance
- D nodes don't need much memory since they only handle overflow of P nodes, so you have multiple P nodes per D node
directory + memory state is cached in the caches of the D node
- elimiates reduntant RAS with row buffers
- uses only off the shelf components (?)
- Q: Results were fast - what benchmarks / what speedup/ scaling?
A: We used bad benchmarks
- Q: What about large datasets?
A: Need more memory
3.2 - A Single Chip Multiprocessor Integrated with DRAM
- Single chip multiprocessor required to take advantage of DRAM bandwidth
- used SimOS for simulations of 4x2 way superscalar and single superscalar with 32 MD DRAM, 2 and 4 MB SRAM
- used 5 benchmarks, 4 of which were parallelized
- compress and eqntott slow down with DRAM - integers apps with small working sets which originally fit inside SRAM caches will slow down with DRAM on-chip
- With uniprocessor, even FP apps have little speedup with big working sets
- for MP, FP apps have performance improvement
- possible solution for paging: software protocol. This area needs work as us delays for pages negates benefit of on-chip DRAM
- Conclusions: FP apps with MP can be much faster, as long as dataset fits on-chip
- Q: Is floorplan in paper?
A: yes, that is what we are thinking, in 0.25 um technology
- Q: It is 24mm on a side, isn''t that big?
A: Yes, but in my experience this isn't unreasonable
- Q: Off chip mem - how much traffic was due to coherence? no problem with 4 proc, on chip?
A: Applications used were too small.
- Q: What will be the impact of off-chip acesses with larger programs - how big is BW a limit?
are you numbers conservative?
A: bank conflicts can't handle extra BW, 30 ns access time to DRAM picked and not examined(???)
- Q: speed vs bit density tradeoffs - can you reduce memory density and speed it up?
A: there are lots of tradeoffs, you lose area for speed then you have to go off chip more often with page faults
3.3 - A Multiprocessor Memory Processor for Efficient Sharing and Access Coordination
- problems - misses can be 100s of cycles, big impact on some programs - synchronization overhead, locks, etc.
- result is some programs slow down, some unusable
- memory modules have DRAM banks, exoprocessor and packet buffer, performs arithmetic and logical operations
- exoprocessor loads and stores only to its memory bank
- exopacket contains PID, control state, operands, mem addr of operands
- for execution, CPU sends exopacket to proper mem unit, which execs packets and may send anther packet to another mem
- need to know when they will finish these ops and to not send too many -> counters
- with this scheme, don't wait for barriers but hard limits, so more useful work can be done with less waiting
- used fragments of code to test to demonstrate various aspects of the system
- 16 procs, 4x4, 21 cycle exoop exec latency, 42 cycle mem latency
- some test progs have significant speedup
- Q: Why is exo better to handle mem latency?
A: out of order has significant delays which are attempted to be hidden by moving ahead anyways.
With many threads, this is ok, but with fewer threads it is more of a problem
- Q: Tags give migrated synchronization, what fraction of performance benefits are given by
loss of barriers vs. mem latency avoided ?
A: About 1/2 , and tags in memory avoids flurry of coherency activity in a processor
- Q: Going beyond 4 procesors on a chip?
A: Not really becuase general purpose apps can't use it, however multimedia or spec95FP
apps can do well. Memory capacity and processing power needs to be balance to get best utilization
- Q: Any remarks on using the new memory bandwidth that is now available?
A: 40-50% on spec95 for 4 MP, cycle time is key point, adding extra banks increases BW but
cycle time is also important, less conflicts with multithreads on one chip - some can stall
and others can still use the BW
- Q: Exoprocessing is intriguing idea- any breakdown between data and synchronization?
A: Both are equally important, wasn't as much of a need for this before but now we need to
avoid coherency. By doing things in memory you avoid invalidations right after proc is
finished, envisioned for computation and synchronization both
- Q: Is there enough spatial locality to use BW effectively?
A: Need to transfer more data at a time but there is less spatial locality
- Q: What is the unit of memory? a page? a cache line? a word?
A: Cache lines for directory entries
- can't do parallel DRAM and directory access - will need to redesign interface
4. Limits and Future
4.1 - The Relative Importance of Memory Latency, Bandwidth, and Branch Limits to Performance
- memory latency wall, bandwidth wall, branch predictability wall (wider issue, deeper pipelines)
- Branch wall is most significant future problem (with future compilers and hardware)
- 16k instruction window with aggressive processor
- ideally would like to use 2010 benchmarks - instead use spec92 with small caches and scale results up
- need > 100 bytes per CPU cycle in the future, or 800 IO pins. Future trends say this is OK.
- mem BW can be solved by good eng., latencies < 100 cycles can be tolerated with pipelining and other techniques
- impact based branch prediction would help
- Comment: Perfect branch prediction is not a valid assumption to hold. The result
that one can make from your work
is that "since we cannot achieve the perfect branch prediction, we should work on the
memory bw/latency requirements"
- Comment: There is no bandwidth wall - with enough money you can get the bandwidth
- Q: what happens when you increase clock speed?
A: we assume clock speed increases the same (???)
4.2 - Processing in Memory: Chips to Pentaflops
- Barriers are physics based, programming based (need million+ parallelism)
- gap between CPU and number of mem chips for pentaflops is narrowing - need to integrate
- "Shamrock" - memory is rotated to alow access from CPU and silicon macro is tiled so that memory if physically and logically shared with its 4 nearest neighbors
- concurrent access to up 256 bits in each 128Kb to 1 Mb staced array
- Multhreading CPUs will run at 4 degrees K, CryoSRAM will be at 77K, (100+ GHz)
- optical switch interface (Princeton) and 3D optical RAM farm (at Caltech)
- proactively and preemptively manage memory heirarchy
- tiled PIM, integration with HTMT need overall programming model
- Q: What factors are most limitng?
A: how to program this thing is going to be a challenge
- Q: will VLIW be an option?
- Q: Can you reuse some of the computation after a branch miss?
- PIM: Q: Movement of data under software control?
A: We are looking into options
- PIM: Q: how do you move data from one corner to another?
A: Looked at bouncing along memory banks, or may use memory bus on top of CPUs
- Q: What will mem hierarchy look like in 2010? processing on chip with low traffic or -?
A: PIM: lots of things to look at, 3D IO, supercooled mem, more conventional case?
Lots of heirarchies examined in the paper not presented, lots of cases and depends on
A: Is processing centralized or processing in the memory?
When problem is distributed procs are hard to keep busy and app specific
A: Application specific, may want to throw more hardware to keep things busy
- Q: Who funds pentaflops?? Good idea to bank on all these new technologies?
- MIMD paper: Q: one processor can utilize the mem BW on a chip, multiprocs would fight about it.
how much is to be believed about these roadmaps? what about fab costs and other factors?
A: We reevaluate the roadmap every two years
- Q: MIMD with one app is ok but more is not?
A: 4 working sets in the same space can result it BW requirements if no overlap is present
Open Mike session - 5 minute presentations
- Corinna Lee, Toronto
Computational Power of Vector processing
- vector processor may be good idea
- preview of vector simulations - T0 vector vs. 6-way R10000 Superscalar using PGP decryption
- vector processor significantly outperforms aggressive superscalar
- 20 mm2 vs 255 mm2 in 0.25 um process
- relying on compiler is the way to go, rather than throwing hardware at it
- Mateo Valero, Barcelona
Vector + IRAM
- superscalar vs vector
- vectors can deal with the latency of memory
- R10000 + out of order + multithreading much slower than vector processor
- Yunho Choi, Samsung
- Process difference of High Performance Logic and High Density DRAM is
getting wider and wider
- A major process change for high density DRAM for the small portion of
silicon area to be integrated (for logic) as CPU will not be cost
- 0.35um Logic + 16Mbit DRAM Merged Product has been recently achieved
and proved at the system
level with reasonable cost overhead
- Up to 0.25um Logic + 64Mbit DRAM integration may be achievable with
reasonable cost overhead
- But, beyond 0.18um Logic + 256Mbit DRAM the cost
overhead for small amount of logic + large density of DRAM will be extremely high
- So, Architecture Research Based on Loose Performance of Logic such as
DRAM should be encouraged
- Steve Przbylski, Principal Consultant, Verdande Group,
Money and Memory
Commodity DRAM misconceptions
- estimating cost based on future price of memory is decoupled
from cost of production of memory
- In most MDL, logic yield < DRAM yield
- Cost of MLL = CL*((Al+Ad)/Al)*alpha (Al = area of logic, Ad = area of
DRAM, CL is the cost of manufacturing the logic die on a logic process,
alpha is a factor to account for the increased cost (if any) of
manufacturing on a blended DRAM/logic process)
- 0.18 um is 1 Gb tech,
in theory yes, in practice, no
- 1Gb will be 0.13 or 0.10 tech, 3-5 years away for commodity prices
- transition beyond 0.18 is end of line for optical lithography, need to change to
improve technology beyond this
- useful tool: fill frequency: ratio of Bw in/out over capacity is the
frequency at which to can hit every bit (fill freq, or FF)
- today commodity DRAMs about 10-60 Hz
- crucial observations
- market have requirements for fill freq., minimum must be met to be successful
- if memory system's FF < market required FF - product will fail
- therefore if DRAM FF is too low, product it is used in will fail
- FF for subsections are 50-200 Hz, at SA FF is 8-32 KHz
- thus if the app needs > 100-200 FF then need to redesign core to expose SA (OUCH!!, don't)
- observation - very few such apps, need to be aware of this
- reference: "New DRAM Technologies: A Comprehensive Analysis of the
New Architectures", MicroDesign Resources, 1996. www.mdronline.com
- Kevin Kissel, SGI
IRAM: The "Right" Way and The "Wrong" Way
- CPU-centric approach is the wrong way to look at the problem,
IRAM won't be a SPEC engine
- Compromises on core size and design rules will mitigate against
absolute performance of IRAM versus logic-process CPUs.
- DRAM bandwidth not necessarily easy to use: MP and vectors are
being proposed for IRAMs to deal with this "embarasment of riches".
- Memory-centric approach is the more promising path.
- Relatively small amounts of logic can be made to do interesting
things in conjunction with DRAM arrays.
- Not general purpose, but interesting and cost-effective solutions
are already being found.
- This is a Good THing for research, as there are more degrees of
freedom away from the SPEC world of mature ISAs.
- DRAM and Microprocessors will be combined, but it will be on DRAM's
- Chip Weems, Umass/Amherst
Fallacies and pitfalls for SIMD-IRAM
- clock distribution -- tight skew tolerance across many
processing elements can severely limit clock rate,
- synch communication -- requires dedicated communication
lines that support either peak BW or sit idle
- supply-induced noise -- huge transients from parallel
memory accesses, arithmetic ops, etc.
- matching pitch of processors to memory -- typically have
just the equivalent width of a few metal lines in which to
pack an entire ALU bit slice
- RMW timing constrains logic -- DRAM people don't want to
adjust their timing, so the data path must be timed
to match the DRAM. If the DRAM timing changes in the next
generation, then the datapath may have to be completely
- Let them eat IRAM (assuming they will just buy more
IRAM for more memory or more processing) -- people don't
want to accept this as their only alternative
- "The DRAM substitute" - Like trying to sell sugar as
nutrasweet substitute for a higher price. You can
eat up all their SIMM slots, have to deal with multiple
SIMM/DIMM form factors, and may end up making
things unexpandable or unportable. Saying that cost
is only 2 to 3 times DRAM doesn't help -- the CAM
community has been saying this for 30 years, with
very limited success.
- Off chip BW limits -- no SIMD system has ever been big
enough -- there always has to be provision for off-chip
bandwidth (both for memory and I/O)
- SIMD PE:
- Utilization -- dual of branch mispredict, very hard to keep a
significant portion of processors busy, and it's hard to
predict which ones will be idle and then make use of them
in some other way.
- Partitioning -- one monolithic array almost guarantees low
utilization of processing elements
- Data alignment -- the data is almost never where you need it.
The bandwidth for moving it can exceed the bandwidth
between the RAM and the processors.
- Indexing -- a useful feature in many SIMD programs, but the
addresses originate from a source that is orthogonal to
the normal address path.
- Bit serial:
- Corner turning -- orthogonal access is needed to provide
bit parallel access by external devices
- Variable-length operands -- almost impossible to make work
because there isn't any language support, and even if
it existed, it would require both new compiler technology
and a willing user community.
- N-squared factor -- multiply, divide, shift, etc. require
N-squared area-time product. This results in unacceptable
performance on common operations, especially FP.
- System level:
- Global reduction -- costly operation for large arrays, but
necessary for branch control
- Global control -- separate, costly chunk of external logic.
Suffers same problem as clock in terms of distribution.
- The library approach is an admission that the system is only
useful in a niche market (the one the library addresses).
- Need good compiler to go beyond niche markets, and
probably a new programming model (if not a new language)
which faces an uphill battle for acceptance.
- Barry Fagin
Relevance of "Whack-a-Mole" to IRAM (an observer's perspective)
- How to solve important problems in cost effective way should be main goal
- Technology will drive architecure for once?
- What is wack-a-mole? lots of other things to keep your attention on besides a
fancy 100x speedup
workshop web page