HPC YEARBOOK 2012

HPC 2012 Networking

Open the bottleneck of HPC interconnects, says Mondrian Nüssle, CTO at Extoll

In HPC, cluster nodes have reached a very high level of performance. In communication bound applications especially, the processing units spend substantial time with

non-productive waiting because of the limited performance of the interconnection network connecting different nodes. Tis in turn is oſten the bottleneck of the entire HPC system. Typically, HPC applications send and receive

many messages in the small to medium range in order to exchange intermediate results, synchronise, etc. Here, latency turns out to be a key parameter, where it is defined to be the time from sending a message on one node until the message is received at the destination node. With growing parallelism we see the number of messages increase over-proportionally with the size of the cluster. Tis is especially true for weak-scaling problems. With an increasing number of messages needing to be sent and received, the message rate becomes yet another important metric. Commonly available HPC interconnection network adapters (NIC) are currently capable of 10-20 million messages per second and can reach MPI latencies as low as 1µs. Optimising HPC interconnects must be

done in a holistic manner: each component or process contributing to data exchange has to be thoroughly optimised, but in a balanced way. Tis can be envisioned like a sequence of tubes connected together forming one long channel – it is important to widen each and every tube as just oversizing one tube segment is useless. Hence, global optimisation is the right choice. Latency can be optimised by using a very lean protocol, preferably together with an

Further information

Arista Networks www.aristanetworks.com

Brocade www.brocade.com

Chelsio Communication www.chelsio.com

Cisco Systems www.cisco.com

DataDirect Networks www.ddn.com

Emulex www.emulex.com

Extoll www.extoll.de

Extreme Networks www.extremenetworks.com

Finisar www.finisar.com

Gigamon www.gigamon.com

Gnodal www.gnodal.com

Hot Lava Systems www.hotlavasystems.com

LSI www.lsi.com

Mellanox www.mellanox.com

Myricom www.myricom.com

Netlist www.netlist.com

Qleap Networks www.q-leap.com

Solarflare Communication www.solarflare.com

efficient implementation of the protocol in hardware. Additionally, it is paramount that cut-through forwarding is used wherever possible, while a high clock frequency can also help reduce latency. In terms of the message rate, that depends on an efficient protocol as well as an efficient implementation in hardware. Here, a fully pipelined architecture is necessary. Scalable topologies are another important

“Scalable topologies are another important measure to handle the ever growing parallelism in today’s HPC”

measure to handle the ever growing parallelism in today’s HPC. From a topological point of view, central components like switches are sub-optimal. Direct networks, where each node ‘sees’ only its neighbours and manages traffic to and from these nodes locally, are more advantageous and more scalable. Tis renders central switches obsolete. Furthermore, tomorrow’s interconnection networks need to work even better with multi- and many-core processors. A highly-optimised virtualised hardware

is required to enable independent progress for each process or thread on a many-core node. Optimising the NIC alone is not enough,

however. What about the signal transmission between nodes? Due to the high-frequency signals, electric cables are only suitable for short distances of up to two metres. For mid-range distances of typically 7-20 metres, optical cables have to be used. Active optical cables are a good choice, since they feature electric interfaces to the NICs and typically convert the signal within the connector. Te NIC may therefore be agnostic with respect to the signal carrier of the cable. Finally, in order to open the bottleneck for the next few years, message rates in excess of 100 million messages per second, latency of well below 1µs and a bandwidth of more than 10GB/s per link will be necessary.

The nature of interconnects is set to change, as Gilad Shainer, vice president, market development at Mellanox Technologies, explains

As systems continue to grow in size, the major challenge has become the question of how to fully utilise the infrastructure and ensure the

efficiency is there. Te performance of the interconnect is vital to this, not just in terms of the throughput or even the latency, but in ensuring that the CPU is never idle. Beyond that is the issue of whether the CPU is doing compute cycles for the applications, rather than things not directly related to them. For example, if 50 per cent of the CPU’s time is being spent on communication work, then those cycles have been wasted. Tis has prompted the use of InfiniBand. One important development has

been to provide users with control of the interconnect. Soſtware-Defined Networks (SDN) enable users, IT managers, etc. to define the routing, alter the location of jobs, optimise the application side and change from application to application in an automated manner. When you can optimise the interconnect infrastructure per a specific job in real time, you can gain more performance. Looking forward, I believe that the nature

“Te major challenge has become the question of how to fully utilise the infrastructure”

of interconnects will change. In a current HPC system, the CPU is the ‘smart’ element and the rest simply offer services, such as moving or storing data. As we approach exascale, the interconnect will need to reach the CPU’s same level of complexity. And we are seeing this trend develop today – interconnect technology is beginning to include a floating point capability that enables the manipulation of the data throughout the fabric, before it hits the application. Many operations within HPC require the gathering of data from multiple processes and then the manipulation of that data in order to combine the results. If we can move those processes to the interconnect, we can save time and increase the overall efficiency of the system. Right now, that’s the challenge we’re facing. l

15

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32