100mW 1TFLOPS Exascale Processor

2017-06-09

◈ Title : 100mW 1TFLOPS Exascale Processor
◈ Speaker : Sung Bae Park  Vice President (SAMSUNG Electronics)
◈ Date & Time : Friday, March 16, 2012 (14:00pm ~ 15:30pm)
◈ Place : LG Research Building, Room #101
◈ Host : Prof.  Hong Jun Park (Tel. 2234)
            BK21 Educational Institute of Future Information Technology
◈ Abstract : 1 Yotta (1024) FLOPS is needed for 1ms 20K atoms protein folding simulation. The quest for high performance processor has been limited by power and price. It has been known performance per 100mW as 0.05GF for CPU, 0.2GF for GPGPU, 1GF for DSP and 10GF for dedicated HW (all normalized to H.264 codec work load in 32nm technology node) in year 2011. For a 100mW 1TFLOPS Exascale Processor in year 2020, 1/1000 power reductions is needed as: 1) 1/10 by architecture innovation: Only 1% in CPU and 10% in dedicated HW nodes are switching even for heavy work load, and there are plenty of room to leverage extensive clock gating and/or to implement the best-fit-in design for a given silicon budget.  Borrowing from dedicated HW’s complex operations, and implicitly addressable distributed stack queue , micro-threaded with extreme WideIO main memory and massive simultaneous thread handling processor architecture will reduce the number of execution cycles greatly while the most of the complexity can be handled by compiler and libraires to keep the simple, regular, deep ULIW-arrayed pipeline to enable the highest clock frequency. The talk will provide the inspiration how to unify the computing architecture from very fine grain threaded applications such as CS (Communicative Sequential) to massive threaded one such as 3D graphics, medical and atom level simulations, leveraging various levels of SW and HW libraries. 2) 1/10 by technology migration:  45nms to 8nm migration may result the larger silicon budget than scaled one to compensate the margin for  too sensitive dynamic PVT variations. LAGS (Local Asynchronous Global Synchronous) design will be helpful for highly tolerable to best use of the extreme scaling.. The need of fast and simple MUX circuits are getting more and more important for the speed, area and power. For example 256:1 forwarding path, 256-bit SIMD shuffling paths, 128-bit x 128-bit cross bar  with 128-layers up to 8-bit shuffling network, all we need is super-MUX, and this talk will address how to handle it. 3) 1/10 by ELV (Extreme Low Voltage) engineering: To reduce the VDD down to the lowest such as 0.1V while maintaining the competitive Ion as 1mA/um for maximum speed and Ioff as 1nA/um for minimum static power leveraging the most sweet spot ABB for Vbackbias for bulk and Vbackgate and Vchannel for SOI .

 

List