Winter 2000 IRAM Retreat Feedback ================================= Harvey Stiegler (TI) - collect presentations and post them on the web - in PowerPoint, hopefully also PDF and HTML - [Tetzlaff: keep it up between retreats] - AME - benchmarks define the field == what gets measured gets done - bencmarks will define what you look at - don't do marketing in your own building - go talk to outside customers, find out what they want you to measure; it will determine what you base your research on later - Do the projects overlap - more synergy here than people think - draw a diagram of how IRAM, ISTORE, OceanStore relate (for outsiders) - IRAM apps - PDA, multi-node, SmartSIMM - glad Jim is proposing to build all of them - probably design is not optimized for all of those - think about what would change if optimized for each - Current IRAM power estimate? - IRAM Testing - really important to do coverage analysis - have you exercised all design features, gates - want close to 100% with your test sequences - FFT - chart shows architecture of VIRAM is approx. comparable to modern DSPs; you're really comparing the architectures, but doesn't take into account advantage of having the big memory on-board - published FFT numbers from company assume all data stored in mem, but they have less mem than you - if you start making FFTs bigger or have apps that want to do many of them, or want to do 2D FFT, when does that break a conventional DSP but IRAM continues OK - corrolary: on what apps is that impt - advantage of big on-chip mem is not showing up here Konrad Lai (Intel) - last time here 2 years ago, a lot is new - IRAM - happy to see settled down on some features, near tape-out - glad you dropped a lot of features - for next 6 months, testing is important - also full-chip models - need to test using JTAG; make sure JTAG, other debugging support works - think about packaging - need to demonstrate advantage of on-chip RAM - could use do same using DSP + integrated mem - perhaps compare using clock cycles instead of absolute time - ISTORE - getting interesting - need to look at how other people run same projects - building hw took a lot of (too much) time - except for DP, you're building a $400 PC - ability to inject faults is important, though - walk through qualitatively; thinking process more important than building it; can find software mechanism to do equivalent - you may have forgotten some stuff, like battery sensor - s/w management - supercomputer people all have emergency management port on Linux clusters - major issue is lack of s/w for how to use it - VA-Linux has a project in that area - 256 or 512-node cluster sold to Argonne National Lab - cheaper way to get into problem earlier - Dave P: We're using ISTORE-0 - focus on maintainenance is interesting - people trying to do OLTP on PCs (e.g. CMU) - more interesting to do mgmt research - Net-PC stuff is related - e.g. control booting through ethernet card - LanDesk, Tivoli, etc. - OceanStore - very interesting - sounds like lot of progress in only 6 months Bill Bolosky (Microsoft) - IRAM - Speech recognition - if you'd build it, I'd buy it - ISTORE - sounded like solution looking for a problem - sounds like you've iterated over apps and designs last couple of years - what is it -- need an answer to sell the thing, explain it to people - design principles are right on - history is evolution: HW provide more perf, SW people build more features (spend perf to produce features) - features we need to spend perf on now are to make the system *work* - working correctly is a feature - tradeoff perf, functionality, ship date - spend development time, money, cycles on making things easy to use and maintain - AME benchmarks - response of system to events measurement good, but what is frequency of the events - concrete suggestion: why do computer systems fail, redo that paper (Jim Gray paper) - today: web server, big DBs, workstations - figure out why stuff stops today - also is essential input to your simulations - I tink you'll find people mismanaging computers is main problem - mismanagement => lack of availability - go from AME to MAE - make manageability go first - find friendly people who run these systems - e.g. talk to Brewster - Hotmail, Amazon, ... - Dave: why ISPs stop (e.g. Inktomi) - Konrad: difficult to generalize since lots of custom stuff at ISPs - Bill: real world is heterogenous - have neutral party coordinate (auditing/accounting firm) - OceanStore - giant vision of what you're going to do, impossible to solve in total - need to narrow your focus - what pieces are you going to do - security and denial of service fascinating, getting reliability out of unreliable parts - claiming to run on a truly impressive scale - everyone on planet has 10k files => 10^14 files - right now mean file size at MS is 32-64K - push to 100K you get 10M TB before replicate - Kubi: what's most important? - you have to get them all right - geographic scale stuff, hetero network probably most impt - must have introspection work right in this env - Bloom Filter work was way off-scale - extraordinary claims require extraordinary proof - need to weaken file consistency model (like web) - they have Sigmetrics paper on file size, etc. distribution Brian Hold (Micron) - IRAM - The chip size might be a problem when we put it on a wafer (Response from Steve: IBM has said they'll fab this.) - Cost: about $1M to fab (if we were to pay for it). Could be up in the $10M range depending on size. Retooling and engineering time are expensive: piggybacking on an existing project would be a bit savings. - The trick to piggybacking is getting the chip in some normal size. (Dave responsed: the is more like an ASIC fab, which is why they are willing to do this specialized kind of run.) - Brian is working on an "Active Memory" project at Micron that is very similar to IRAM. - ISTORE - Look at MS specs for Wired for Management (WfM) - for ideas on how to make the data useful (flight recorder data) - WfM is tool that makes it useful - sits in BIOS of PC, so net manager can monitor PC - not on scale you're talking about, but is corporation-wide type of thing - to give you ideas on how and what to do with info you're collecting - generate red flags for apps; you're managing a network - OceanStore - I agree with Bill on OceanStore; should be more like LakeStore - address at corporation level first - Can you convince people to put their confidential data there? - corporate intellectual property? - someone else handling the hardware spooks Fortune-500 - Ric Wheeler: mix of AOL servers and laptops is interesting model too [Great Lakes Store?] Bill Tetzlaff (IBM) - IRAM hardware - things progressed slowly since last summer - never mentioned IBM and no IBM people here - should be more IBM people involved - ISTORE hardware - wish I heard more from IBM on that (disks coming from IBM) - Diagnostic Processor a good idea - we've done it with mainframe - dual-processor with hot standby and journaling state to other node - I'll get open literature references to this stuff; send me email - Aaron's stuff interesting; like idea of finding out how often these things really happen - Run a workshop, invite people to come talk about their problems - Objectives and principles - less dismayed than Bill - looks like a rethink from broad/global - not far enough along to know what doing next, but good to be going through the exercise - OceanStore - I like the enormous vision - biggest challenge is to cope with the scale - disappointed with IBM participation; probably my fault Ric Wheeler (EMC) - great few days, learned a lot - great to see something vertical HW -> OS -> new apps - like what real systems companies do - e.g. EMC - good experience for grad students - enjoyed the AME Benchmarking talk; great way to look at systems - we have lots of failure data, can't tell you, but some of our customers might be willing to share it with you - I'd like to give a talk/write a paper on how enterprise storage differs from the kind of research academics do; you come out here too - can get data from big customers of system vendors (e.g. AOL) - our experience has been you have to use every kind of redundant hardware you can, because when you need it it won't work; so use everything all the time, make sure you can get by with less of it - everyone doing work, but you can get by with half - modeling things like 2x or 3x mirroring good; also steep cost factor - reiterate importance of configure, administer big boxes - Brewster's NFS mount points hellish to use - need lots of tools - Beowulf cluster people have such tools - VA-Linux 24 quad Xeons; important headless BIOS boot but working on stuff but not there yet; think about how install, update, rev software Jim Kohn (SGI) - IRAM - interested in s/w, perf issues - IRAM hw - pleased you're still planning to tape out this year - dropping things is fine; seems you've made reasonable tradeoffs - dismayed that scalar processor has changed at this late date - change towards more simple, classic processor - scalar unit raises yellow flags if target is high-perf vector processor - mem consistency model - could dramatically impact vector perf, particularly high-evel code with conventions - but you have direct control of your app space, so should be able to impact choices of mem consistency model, etc. - IRAM SW - good you have a working processor - wouldn't rush into tuning right away; might be opportunity to provide flexibility in the what-if area - at Cray we're trying to provide a little flexibility in s/w conventions in case - recommend collaboration with us (we're at finalization stage) - focus on memory consistency software model (already well defined in hw) - need sw model that delivers high perf - capture key kernels as libraries to use in many apps - expectations - as you tune perf, do some eval. of key IRAM applications to see how perf breaks out -- what was root cause - collaborate on SW conventions, start up a library