Scientific Computing World June/July 2019

HIGH PERFORMANCE COMPUTING g

architecture. It is essentially a deployment of FPGAs in the cloud which interconnect CPUs to provide an interconnected and configurable compute layer of programmable silicon. The project started in 2010 when a small team, led by Doug Burger and Derek Chiou, began exploring alternative architectures and specialised hardware such as GPUs, FPGAs and ASICs. The team developed a

system for cloud computing

based on the FPGA, which offered better energy efficiency than using CPU or GPU-based systems for the same task. The FPGA also offered this benefit, while not requiring the same level of risk associated with developing a custom ASIC. Project Catapult’s board-

level architecture is designed to be highly flexible. The FPGA can act as a local compute accelerator, an inline processor, or a remote accelerator for distributed

computing. In this design, the FPGA sits between the datacentre’s top-of-rack (ToR) network switches and the server’s network interface chip. Network traffic is routed through the FPGA, which can perform line-rate computation on even high-bandwidth network flows. Using this acceleration

fabric, they can deploy distributed hardware microservices (HWMS) with the flexibility to harness a scalable number of FPGAs – from one

to thousands. Conversely, cloud-scale applications can leverage a scalable number of these microservices, with no knowledge of the underlying hardware. By coupling this approach with nearly a million Intel FPGAs deployed in datacentres, Microsoft has built a supercomputing-like infrastructure, which can compute specific machine learning and deep learning algorithms with incredible performance and energy efficiency.

CAN FPGAS BEAT GPUS IN ACCELERATING NEXT-GENERATION DEEP NEURAL NETWORKS?

In today’s big data era, business and consumers are inundated by large volumes of data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. This data comes in a number

of formats, from structured, numerical data in traditional databases, to unstructured text documents, email, video, audio and financial transactions. Effective analysis of this data is key to generating insights and driving better decision making and machine learning (ML) algorithms that are extensively used in modern data analytics. Deep convolutional networks (DNNs), a specific type of ML algorithm, are becoming widely adopted for image classification, as they excel in recognising objects in images, offering state-of-the- art accuracies. Current generation DNNs, such as AlexNet and VGG, rely on dense floating-point matrix multiplication (GEMM) which maps well to the capabilities of GPUs, with their regular parallelism and high Tflops. While FPGAs are much more

energy efficient than GPUs, (important in today’s IoT market), their performance on DNNs does not match that of GPUs. A series of tests conducted

by Intel have evaluated performance of two latest- generation FPGAs, (Intel’s Arria TM 10 and Stratix TM10) against the latest, highest

performance GPU, (Titan X Pascal), on DNN computation.

GPU vs FPGA trends GPUs have traditionally been used for DNNs, due to the data parallel computations, which exhibit regular parallelism and require high floating-point computation throughput. Each generation of GPU has incorporated more floating- point units, on-chip RAMs, and higher memory bandwidth, in order to offer increased flops. However, computations

exhibiting irregular parallelism can challenge GPUs, due to issues such as divergence. Also, since GPUs support only a fixed set of native data types, custom-defined data types may not be handled efficiently, contributing to underutilisation of hardware resources and unsatisfactory performance. Unlike GPUs, FPGA

architecture was conceived to be highly customisable and, in recent years, five key trends have led to significant advances in FPGAs, bringing their performance closer to state-of-the-art GPUs Firstly, next-generation

FPGAs incorporate much more on-chip RAMs. Secondly, technologies, such as HyperFlex, enable dramatic improvements in frequency. Third, there are many more hard DSPs available. Fourth, the integration of HBM memory technologies lead to an increase in off-chip bandwidth and, finally, next-generation FPGAs use more advanced

process technology, such as 14nm CMOS. The Intel Stratix 10 FPGA has more than 5,000 hardened floating-point units (DSPs), over 28MB of on-chip RAMs (M20Ks), integration with high-bandwidth memories (up to 4x250GB/s/ stack or 1TB/s), and improved frequency from the new HyperFlex technology, thereby leading to a peak 9.2 Tflops in FP32 throughput. FPGA development

environments and toolsets are also evolving, enabling programming at a higher level of abstraction. This makes FPGA

”Irregular parallelism can challenge GPUs, due to issues such as divergence”

programming more accessible to developers who are not hardware experts, speeding up the adoption of FPGAs into mainstream systems. Recent work by Intel studied

various GEMM operations for next-generation DNNs. A DNN hardware accelerator template for FPGA was developed, offering first-class hardware support for exploiting sparse computation and custom data types. The template was developed

to support various next- generation DNNs and can be customised to produce optimised hardware instances

for FPGA for a user-given variant of DNN. This template was then used

to run and evaluate various key matrix multiplication operations for next-generation DNNs on the current- and next- generation of FPGAs (Arria 10, Stratix 10) as well as the latest, high-performance Titan X Pascal GPU. The results of this work found that the Stratix 10 FPGA was 10 per cent, 50 per cent, and 5.4 times better in performance (TOP/sec) than Titan X Pascal GPU on GEMM operations for pruned, Int6, and binarised DNNs, respectively. These tests also showed

that both Arria 10 and Stratix 10 FPGAs offered compelling energy efficiency (TOP/sec/ watt) relative to Titan X GPU, with both devices delivering between three and 10 times better energy efficiency, relative to Titan X. Although GPUs have

traditionally been the undisputed choice for supporting DNNs, recent performance comparisons on two generations of Intel FPGAs (Arria 10 and Stratix 10) and the latest Titan X GPU shows current trends in DNN algorithms may favour FPGAs, and that FPGAs may even offer superior performance. The paper concludes that:

‘With results showing that the Stratix 10 out-performs the Titan X Pascal, while using less power, FPGAs may be about to become the platform of choice for accelerating DNNs’.

10 Scientific Computing World June/July 2019

@scwmagazine | www.scientific-computing.com

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32

orderForm.title