search.noResults

search.searching

dataCollection.invalidEmail
note.createNoteMessage

search.noResults

search.searching

orderForm.title

orderForm.productCode
orderForm.description
orderForm.quantity
orderForm.itemPrice
orderForm.price
orderForm.totalPrice
orderForm.deliveryDetails.billingAddress
orderForm.deliveryDetails.deliveryAddress
orderForm.noItems
HIGH PERFORMANCE COMPUTING g


an energy point of view. For example for the inferencing phase of neural network applications, where you need to recognise objects, these kinds of applications do not need the full floating point,’ said Unsal. While GPUs can handle mixed precision


workloads, Unsal argues that FPGAs can improve on this process from an energy efficiency standpoint. ‘FPGAs take that further, in the sense


that you can go down to one-bit level. There have been neural networks called ‘binary neural networks’ that work with just one bit. They are quite efficient in their application domain,’ said Unsal. ‘FPGAs have this flexibility where, when


you consider neural network applications, you can set your own optimal width of computing bits that you need. In this case, the optimum is the most energy efficient,’ added Unsal. However, it should be noted that this


is focusing on the efficiency of the application. That is not to say that a rack of GPUs might run the job faster, just that FPGA resources can be used to make the job more efficient. As reducing power budgets is a key objective to meet the energy requirements of exascale systems, this project could help to solve some of the challenges that face HPC users that want to run AI applications. The Legato toolchain backend, also


called the runtime system, consists of the technologies that are deployed during runtime to support the programmer’s task, and to intelligently manage the resources of the hardware platform. To help meet these goals, the Legato


backend is on a task-based execution model. Tasking, as opposed to a threading model, allows the runtime system to exploit higher parallelism and to perform the advanced scheduling necessary to effectively manage heterogeneous platforms.


Enhancing communication But the energy efficiency increases targeted by the Legato project extend beyond just the use of FPGA technology. The project is also researching the reduction of voltages sent to the processors, while maintaining stability of the application. Advances in features such as error


reporting on modern CPUs and hardware counters, allow the team to reduce voltages by enhancing the communication between processor, operating system (OS) and the application. Unsal said that previous work done at the BSC looked at reducing voltages while maintaining stability of the HPC applications. ‘One need that we saw during


8 Scientific Computing World February/March 2020


‘Since we are now operating at voltages close to physical limits, the gains that could be possible from this more conservative approach is nearing its limits’


that project was that the hardware and the software need to work together. That is to say the hardware and the software need to catch up with each other,’ said Unsal. ‘That was the crux of the idea for Legato,


so we got together with a couple of other partners that were in a different project called M2DC. We proposed that you have all these wonderful features in hardware to help save energy, and you have these frameworks in software to help save energy, but they do not talk to each other.’ He explained that in the past only errors


that were detected but not corrected were reported. ‘We detected an error that is not correctable and we sent this signal “somewhere” because the software stack was not equipped to deal with the signal,’ said Unsal. ‘There is an error that was detected but could not be corrected, so what do you do


with that on the software side? What we are doing in Legato is we propagate this error to the proper place – in this case the application,’ he said. These messages would be passed to


the OS, at which point the error then stops. ‘In our case, we wanted to continue past the operating system to the application, because the application knows if the error is serious, or if the error could be somehow corrected or accounted for on the application side. Also, the applications can just disregard the error, because it is not important for the application,’ said Unsal.


Making the application aware of


these errors, and also errors detected and corrected, allows the researchers to manipulate voltages more carefully, without pushing too far and affecting the stability of the application. ‘Hardware manufacturers made the change so that these correctable errors are also reported. They are important because to save power, one option is to go below the safe operating voltage limits. ‘We are now able to use this information


like a canary in a coal mine. We use it to tell the software that it doesn’t need to lower the voltage anymore, there has been enough of an energy saving without compromising reliability,’ he added.


@scwmagazine | www.scientific-computing.com


Blue Andy/Shutterstock.com


Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28