EDA Continues from page 30 Fig 3: C-code for standard RISC-V compared with fir_push custom instruction
to accelerate the FIR computation. The FIR filtering took 3 clock cycles achieved by (1) providing simultaneous access to all N signal samples that are required for the output result, and (2) by a highly parallel implementation of arithmetic operations upon these samples. For step (1), the FIR accelerator contains an N-stage FIFO containing all “filtering window” signal samples. The FIFO is implemented with a register array, each time a new input signal sample arrives the entire FIFO shifts, purging the “oldest” element, and storing the new signal sample at the vacant array location. By reading the entire register array, all required
signal samples are accessed within the same clock cycle (Fig 2).
The FIR accelerator contains a set of multipliers that compute N results {ƒisi} in parallel on the first cycle, the filter coefficients {ƒi} are taken from internal registers. The multiplication results are summed up in the second cycle, and in the third cycle the FIR result is written back to the general-purpose register file.
Two additional internal register arrays fir_fifo & fir_coeff were defined in CodAL, the former stores the input signal samples that are required to compute the FIR value, the latter
Fig 4: Normalised comparison of core with FIR unit versus original L31
contains the filter coefficients. Three custom RISC-V instructions were added to the L31 core by modifying the CodAL source code. The main one was fir_push which defines the entire FIR calculation flow (Fig 3) plus two housekeeping instructions. Note that the “fir_push” instruction semantics given on the right side of Fig 3 starts with taking the new signal sample in 32-bit integer representation from the general-purpose register file. Then there is a for-loop construct, similar to the for-loop in C, on the left, that is used to describe the set of parallel multiplications of all signal samples in
a FIFO by corresponding FIR coefficients taken from internal registers. The multiplication results are summed up and the FIFO array is shifted, the latter is done by assigning to FIFO elements the values of their neighbours in reversed order. After the for-loop the new sample value is placed in the beginning of the FIFO array, then the result is written to the destination register. The last statement ‘codasip_inc_clock_cycle(3);’ informs the simulator that this instruction will take three clock cycles. Although the microarchitecture of application-specific processors can be described with industry-standard HDLs, the CodAL implementation provides a more compact way of doing that. The entire description of the FIR accelerator fits in ~150 lines of CodAL code including both the hardware resource coverage and the custom instructions implementation. Describing similar modules in Verilog will require ~670 lines of code for the FIR accelerator and does not allow one to automatically generate a compiler and simulator that will be aware of the customized instruction set. The cycle-accurate simulator and profiler are automatically generated for the customized cores simplifying the evaluation of the FIR filter accelerator’s performance. The comparison of the performance of the FIR filter operation of the original L31 core and the L31 with the additional FIR unit is in Fig 4. The FIR algorithm runs 11.45× faster in the modified core and consumes 9.10 per cent of the energy. The FIR unit adds 37 per cent to the silicon area but this is a small increment compared with using an additional DSP core or using a more complex microcontroller core.
A detailed paper describing the FIR filter implementation is available here: https://
codasip.com/papers/fir-and-median-filter- accelerators-in-codal/
Conclusion
SoC designers today have the opportunity to use Custom Compute methods to accelerate critical algorithms using existing processor cores as a starting point. As the FIR filter example shows, significantly better computational performance and energy efficiency can be achieved with an acceptable cost in silicon area. Combining design automation and processor source code dramatically reduces design and verification costs compared with a completely new development. These advantages are not only for accelerating embedded algorithms but also for more complex cores.
https://codasip.com/ 32 May 2024 Components in Electronics
www.cieonline.co.uk
Page 1 |
Page 2 |
Page 3 |
Page 4 |
Page 5 |
Page 6 |
Page 7 |
Page 8 |
Page 9 |
Page 10 |
Page 11 |
Page 12 |
Page 13 |
Page 14 |
Page 15 |
Page 16 |
Page 17 |
Page 18 |
Page 19 |
Page 20 |
Page 21 |
Page 22 |
Page 23 |
Page 24 |
Page 25 |
Page 26 |
Page 27 |
Page 28 |
Page 29 |
Page 30 |
Page 31 |
Page 32 |
Page 33 |
Page 34 |
Page 35 |
Page 36 |
Page 37 |
Page 38 |
Page 39 |
Page 40 |
Page 41 |
Page 42 |
Page 43 |
Page 44 |
Page 45 |
Page 46 |
Page 47 |
Page 48 |
Page 49 |
Page 50 |
Page 51 |
Page 52 |
Page 53 |
Page 54 |
Page 55 |
Page 56 |
Page 57 |
Page 58 |
Page 59 |
Page 60 |
Page 61 |
Page 62 |
Page 63 |
Page 64