Machine Learning Hardware Acceleration



Specifications:


CPU GPU FPGA
Training Validation Testing Training Validation Testing Testing
Time (ms) 1.26 1.25 1.23 0.65 0.48 0.22 2160000
Clock cycles 1242 1232 1227 640 475 215 n/a
Throughput (samples/s) 792.26 769.99 810.07 4281.29 4353.12 6315.91 0.0037
Latency (ms/sample) 1.26 1.25 1.23 0.23 0.23 0.16 270000
Accuracy (%) 34.76 34.93 51.57 44 44 44 57
Power Consumption (W) n/a n/a n/a 26.92 26.83 26.29 99.6
Temperature (°C) n/a n/a n/a 45 45 45 53.5
Positional Payload Schematic

Motivation:

In the midst of a rising popularity of Language models, sentiment analysis continues to gain reputation for classification of textual data. With a purpose to train both my embedded systems and machine learning skills I started this project with classmates from a Machine Learning Hardware Acceleration course I was taking. They provided primary support in training, defining, and all the GPU and python-based implementation of the model.

As such I started this project with the purpose of optimizing and benchmarking language model inference with the GPU performance as a baseline using different hardware such as a CPU or FPGA.

Hardware:

Helped by Dr. Dayane Reis, I was provided with an ARTY S7 Development board, along with my personal machine that holds a 3.8 GHz AMD Ryzen 9 3900X 12-Core CPU. Both of these can hold the weight of the pretrained weights of around 2.85 MB with no need for quantization.

Positional Payload Schematic

Firmware:

Using the python code as basis, the model was translated completely to C++ which allowed it to be both used by a CPU and an FPGA directly through the use of Xilinx’s Vitis unified software platform.

To implement this code in an FPGA, Vivado is first used to construct a block design by instantiating all needed modules available on our ARTY S7 board paying particular close attention to both the DDR3L memory and the MicroBlaze Soft processor to obtain the most performance. The block design is exporten into a .xsa file that can be used in the Vitis Unified IDE 2023.2 to create a platform that will be used by Vitis HLS to compile the C++ code and generate the bitstream to upload to the FPGA easily through UART which allowed us to print the output of the inference through a terminal.

With a primary focus to optimize parallel matrix multiplication based around dot multiplications, each layer is declared as its specialized class that takes advantage of array partitioning, pipelining and loop unrolling to maximize performance

BNO055 IMU

Results and Conclusions:

The most computationally intensive layer of the model proved to be the bidirectional LSTM layer, implemented as two separate LSTM classes, operating both forward and backwards as suggested by the name. The main complexity of a single iteration of each cell is equivalent to eight matrix multiplications with different functions. These eight operations are dependent on each other, and as such the FPGA’s throughput and latency for this specific model greatly lags behind both CPU and GPU due to the difficulty of managing data dependencies. This was confirmed upon profiling, as the dot() function within the LSTM class consumed most of the processing time by repeatedly calling floating point operations.

In the future, the FPGA implementation shows great promise in cases with less data dependencies and the CPU-only implementation is a strong low power option for small batch sizes. However, the GPU implementation proved the best for most applications of this specific model. These lessons learned from this project were the biggest success.