Components in Electronics Dec/Jan 2023

Artificial Intelligence Technology

Optimising memory area and power in AI applications

By Tony Stansfield, CTO, sureCore T

his article describes some of the optimisations that can be made to a memory subsystem to improve its overall power, bandwidth, area, or other

relevant metrics. It focuses on those optimisations that are likely to be useful for artificial intelligence (AI) applications, as these are both memory-intensive and have different properties to other high-performance compute applications. Illustrations in the article use data from sureCore’s PowerMiser SRAM compiler, which was developed for low-power applications, and further application- specific optimisations beyond those included in this compiler are also discussed at the end of the article.

Memory metrics

When assessing a memory subsystem design, the key metrics are referred to as Power, Performance, and Area (PPA). However, the relative importance of these different factors is strongly dependent on the target application – for instance, some have an absolute speed requirement that must be met, while others have a constrained power budget or area budget in order to fit with a battery lifetime or system cost target. Looking at the individual elements of PPA in more detail brings up some more detailed issues: ● Power: Is it active power that matters most, i.e., the power used to perform all necessary reads and writes for the running application or is it standby power?

● Performance: Is it the speed of individual memory accesses that matters, or is it overall bandwidth, i.e., is it acceptable to run at a reduced speed, but reading or writing more bits per cycle?

● Area: Is SRAM area a significant fraction of total chip area? If it is, then does the SRAM cause problems for the overall chip floorplan? Would using many small

30 December/January 2023

memories rather than fewer larger ones make it easier to achieve an efficient overall floorplan?

Metrics for AI

AI applications perform repetitive calculations over large data sets. There is relatively little data-dependent branching, and so memory access patterns are very predictable. This means that optimising for bandwidth rather than for speed of access to individual addresses is appropriate. In the remainder of this article, we will consider the behaviour of a 1Mbyte memory, organised as 65,536 words of 128 bits. This is larger than can be implemented with a single SRAM instance from most SRAM compilers, so it illustrates the options for combining smaller memories to create a larger one. Many AI-focused SoCs have parallel architectures that use multiple such memories, which means that reducing

Components in Electronics

area and active power are important factors, in addition to bandwidth.

Exploring the trade-offs

This section contains graphs to illustrate the potential trade-offs between speed, area, and power. These graphs use data from a sureCore PowerMiser low-power SRAM compiler. Speed, area, read and write currents, and leakage currents were generated from the compiler for multiple memory instances with data widths of 16, 32, 64 or 128 bits, and numbers of words from 1024 to 16384, with multiple column mux factors also used. From this data, an estimate of parameters for a 1Mbyte memory subsystem with a 128-bit data word was calculated, based on using multiple copies of each of the SRAM instances. For instance: Total area is equal to the area of a single instance multiplied by the total number of instances needed to make a

1Mbyte memory. Total leakage is similarly calculated from instance leakage multiplied by number of instances.

Total read or write current is calculated from the current for a single instance multiplied by the number of parallel instances needed for a 128-bit word.

Operating frequency is the frequency for a single instance, i.e., it is assumed that no extra time is required to combine the memory outputs or to select the memory to be accessed. This is effectively an assumption that any extra logic required can be pipelined, which is likely to be valid for AI applications that have predictable memory access patterns. In all cases, to make it easy to see both the trends and the ranges of parameters, what is plotted is the value of a parameter divided by the smallest value of that parameter in the data set – for instance frequency divided by frequency of the slowest memory.

www.cieonline.co.uk

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48 | Page 49 | Page 50 | Page 51 | Page 52 | Page 53 | Page 54 | Page 55 | Page 56 | Page 57 | Page 58 | Page 59 | Page 60 | Page 61 | Page 62 | Page 63

orderForm.title