SCW October/November 2018

HIGH PERFORMANCE COMPUTING g

existing approaches over the past five years. There’s similarly rapid progress in other machine-learning fields, such as autonomous driving, and the semantic analysis of text to enable services such as online chatbots.

The FPGA advantage A lot of machine-learning work to date has been done using CPUs, or GPUs whose architectures happen to match the computational requirements of algorithms such as CNNs, so where do FPGAs fit? The answer is that FPGAs have a unique set of attributes that give them particular advantages for the implementation of machine-learning algorithms. The first of these is the flexibility that

comes with the fact that FPGAs are simply reconfigurable hardware. This means that as algorithms change, so can the hardware. It also means that FPGA users can control the whole datapath of their application, from input through processing to output. This sounds obvious, but many comparisons of machine-learning performance look at how fast a CPU runs an algorithm, rather than how fast the algorithm works in its systemic context. An FPGA implementation, however, can be programmed to achieve the greatest systemic performance – and will deliver it in a highly deterministic way, unlike a CPU- based approach subject to interrupts. The structure of FPGAs is also a good match for many machine-learning algorithms, with highly distributed logic resources, rich interconnect schemes and plenty of distributed local memory. This is important because neural network implementations usually involve many distributed computations to mimic the neurons, lots of local storage to hold intermediate values, and rich interconnections to pass the outputs of one layer of neurons to the inputs of the next, or among neurons in the same layer. This improves performance and reduces power consumption by cutting the amount of off-chip memory accesses necessary to implement machine-learning algorithms.

The highly parallel hardware of an

“FPGAs have a unique set of attributes that give them particular advantages for the implementation of machine-learning algorithms”

10 Scientific Computing World October/November 2018

 Intel programmable

acceleration card (PAC) with Intel Stratix 10 SX FPGA

FPGA also promises low latency and high throughput, essential characteristics in applications such as advanced driver assistance systems. And the performance and programmability of an FPGA’s I/O capabilities also makes it easier to integrate the devices into a system, and adapt the implementation for different markets or evolving I/O standards. It would be possible, for example, to use FPGAs to apply machine-learning strategies to pre- process data as it came off disc storage in a large data centre, before it even reached the server farm.

Applying FPGAs to machine learning Many developers working on machine learning are used to implementing their algorithms by writing software in relatively high-level languages and then running it on CPUs or GPUs, with compilers and related tools doing the job of parallelising the task across multiple processor execution threads, cores or even CPUs. Making the move to running algorithms directly on hardware, even if it is reprogrammable, may seem alien to their software-centric outlook. This needn’t be so. The history of

software development has been one of a steady rise in the level of abstraction at which developers have expressed themselves. This is also true in the field of machine learning: Google, for example, has released TensorFlow, an open- source software library for ‘numerical computation using data-flow graphs’, in other words, neural networks and related algorithms. Nodes in the graph represent mathematical operations, such as the transfer functions of the neurons we described above, while the graph edges represent the multidimensional data arrays

(tensors in Google’s terms, weighted values in ours) that are communicated between them. Intel PSG is implementing some of the

key primitives used in common machine- learning algorithms so that developers who are used to working at the level of abstraction of the TensorFlow library, can also do so when FPGAs are the implementation target. One way in which this is delivered is as a Deep Learning Accelerator, which enables users to implement the key primitives on an FPGA in such a way that they can then configure various network topologies within the FPGA, without having to reprogram its hardware. If this is too constricting, Intel

PSG is also implementing a software development kit for OpenCL, a common platform for parallel programming various types of processor to work together, so that users can customise and extend the facilities of the Deep Learning Accelerator. They might do this, for example, by changing primitives or adding custom accelerators. The solution is available today for Intel PSG’s Arria 10 PCIe cards. Interest in machine learning is growing

very rapidly at the moment. Although some very successful approaches have emerged over the past five years, it is clear that there’s still room for enormous amounts of innovation in both algorithms and implementations. FPGAs can bring the advantages of dedicated hardware to machine-learning developers, as well as offering a flexible path to efficient systemic implementations once an algorithm has been finalised.

Bill Jenkins is a senior product specialist in artificial intelligence at Intel Programmable Systems group

@scwmagazine | www.scientific-computing.com

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36

orderForm.title