White Paper for Galaxy (Version.04.17.97)

Yuefan Deng and James Glimm
Center for Scientific Computing
Department of Applied Mathematics and Statistics
SUNY, Stony Brook, NY 11794-3600
http://ams.sunysb.edu/{~deng, ~glimm}

I. Introduction

We propose to gather computing expertise at Stony Brook (USB) to build a sub-Teraflop parallel computer for scientific and engineering applications in fluid dynamics, materials modeling, molecular biology and medical simulation, as well as fundamental physics. This technology drives the computation cost down by employing on-the-shelf yet most advanced computer components and it is scalable and cost-effective. This project will benefit USB in the following three aspects: (1) It will boost USB computing power by 100 times, which should enable many projects in applied mathematics, physics, molecular biology, medicine, and engineering; (2) It will provide significant visibility for USB in this new approach of doing large-scale science at lower cost; (3) It may foster stronger USB-industrial partnership.

II. The Goal

  1. Cost-performance: > 50 Mflops/$K
  2. Nodes: 32
  3. Per-Node speed: 1 Gflops. Candidates are:
    CPUs Expected speed Estimated cost Available
    2 600-MHz alpha 1 Gflops++ $25,000 6/97
    4 500-MHz alpha 1 Gflops++ $14,000 9/96
    8 Pentium II 1 Gflops+ $8,000 9/97
  4. Total speed: 32 Gflops (12x Paragon-128)
  5. Total memory: 1024MB x 32 = 32 GB
  6. Total disk space: 9 GB x 32 = 288 GB (9GB per node)
  7. Total cost (ROUGH estimates) (see below):
    Price Range Configuration Nodes (CPU+Mem) Cost Network cost Misc (Disk+Power Rack+Tape Drive) Cost Total
    Lower Bound 8x32 Pentium-II 32x$8,000 $50,000 32x$1,000 + $10,000 $348,000
    Mid Level 4x32 Alpha-500 32x$14,000 $50,000 32x$1,000 + $10,000 $540,000
    Upper Bound 2x32 Alpha-600 32x$25,000 $50,000 32x$1,000 + $10,000 $892,000
  8. Network:
    1. Board-level bus: 500 MB/s
    2. ATM cluster of 32 nodes (max) with bandwidth 100-1000 Mbps
      Note: (1) Latency depends on Node_OS(Linux)+Parallel_OS(MPI)+USBWrap (2) Expected latency around O(10) micro-seconds; (3) Further addition of nodes may require yet another level of ATM; (4) Further addition of nodes may require larger ATM with more than 32 ports; (5) Latency and bandwidth depend on configuration. (6)Alternative: Ultra Wide Ethernet: 300 MB/s
    3. On-board ATM
  9. Distributed-memory
  10. MIMD
  11. Node OS: Linux (Alternative: Windows NT)
  12. Message Passing: MPI
  13. Axuilary Parallel OS: Stony Brook Wrapper
  14. Machine to model: Paragon

III. Comparative Analysis Of Various Systems

System
Pros
Cons
More Remarks
Node CPU
Alpha-600 Fast, min network load Expensive, untested Better Future
Alpha-500 Fast,low network load Expensive --
Pentium II Cheap,popular High network load Better Now
Network
Stand-alone ATM Fast, covenient Expensive Better Now
On-board ATM Fast,modular Expensive,incovenient Better Future
Ultra Ethernet Cheap,popular Slow,replaced soon(?) --
Node OS
NT Covenient,popular Cost$$$,Unnecessary Large --
Linux Source, $0 More Work Better Now

IV. Semi-Final Choices

  1. Fast Ethernet-Networked Pentium II system.
  2. ATM-Networked Alpha system.

V. Plan Leading To 32 Gflops And Beyond

Dates Expectation Remarks
04/97--06/97 Fully test 1-node (4-processor/node Pentium Pro) shared-memory on OS+MPI+ USB-Wrap On 2DFT, MD
06/97--09/97 Fully test 2-node (8-processor/node Pentium II) distributed shared-memory on OS+MPI+USB-Wrap on 3DFT 1X Paragon
09/97--12/97 Fully test 8-node (8-processor/node Pentium II) distributed shared-memory on OS+MPI+USB-Wrap 4X Paragon
12/97--05/98 Fully test 32-node (8-processor/node Pentium II) distributed shared-memory on OS+MPI+USB-Wrap 12X Paragon
05/97--12/98 Fully test 4X 32-node (8-processor/node Pentium II) distributed shared-memory on OS+MPI+USB-Wrap 50X Paragon; $1M (Speculative)

VI. Current Paragon in AMS

  1. 128-Node, 2D mesh, MIMD, distributed-memory
  2. Memory: 32 MB x 128 = 4.0GB
  3. Total disk: 20GB
  4. Network bandwidth: 200 MB/s (peak); 30-75 MB/s (observed)
  5. Network Latency: below 10 micro-seconds (observed)
  6. Peak speed: 75x128 = 9.6 Gflops
  7. Typical speed: (75/3)x128 = 25x128 = 3 Gflops
  8. Cost: $1M (1991 Price)
  9. History: 5/91, first 56 nodes; 8/93, 8 nodes; 4/94: 64 nodes