Currently, many AI companies adopt FPGA technology by way of achieving a customizable, low latency, high performance and high power-consumption ratio for AI inference application. However, there are still many challenges before FPGA technology can enter into large-scale AI business deployment, like the high software writing threshold, limited performance optimization, and difficult power control. The goal of Inspur’s TF2 Compute Acceleration Engine is to solve these challenges for customers.
The TF2 FPGA Compute Acceleration Engine, which supports TensorFlow, helps AI customers quickly implement FPGAs based on mainstream AI training software and deep neural network model DNN on inference. It delivers high performance and low latency for AI applications through the world’s first DNN shifting technology on FPGAs.
The TF2 computing acceleration engine consists of two parts.
- The TF2 Transform Kit: a model optimization conversion tool which optimizes and transforms the deep neural network model data trained by the framework such as TensorFlow. It greatly reduces the size of the model data file, as it can compress 32-bit floating-point model data into a 4-bit integer data model, making the actual model data file size smaller than the original 1/8 and basically keeps the rule storage of the original model data.
- The TF2 FPGA Runtime Engine: it can automatically convert the previously optimized model file into FPGA target running file. In order to eliminate the dependence of deep neural network such as CNN on FPGA floating-point computing power, Inspur designed a shift computing technology which can quantize 32-bit float-point into 8-bit integer data. Combined with the aforementioned 4-bit integer model data, the converted convolution operation floating-point multiplication is calculated as an 8-bit integer shift operation, which greatly improves the FPGA for inference calculation performance and effectively reduces its actual operating power consumption. This is also the world’s first case of implementing the shift operation of deep neural network DNN on FPGA under the premise of maintaining the accuracy of the original model.
TF2 acceleration process
The SqueezeNet model on the Inspur F10A FPGA card – the world’s first half-height, half-length FPGA accelerator card to support the Arria 10 chip – shows excellent computational performance for the TF2 engine. SqueezeNet is a typical convolutional neural network architecture – a streamlining model – but its accuracy is comparable to AlexNet. It is especially suitable for image-based AI applications with high real-time requirements. Running the SqueezeNet model optimized by the TF2 engine on the F10A, the calculation time of a single picture is 0.674ms while maintaining the original accuracy. It is slightly better than the currently widely used GPU P4 accelerator card in terms of calculation accuracy and delay.
TF2 with F10A VS GPU
The Inspur TF2 computing acceleration engine improves AI calculation performance on the FPGA through technical innovations like shift calculation and model optimization, and lowers the AI software implementation threshold of the FPGA. It allows the FPGA to be widely used in the AI ecosystem to enable more AI applications.
Inspur is planning to open TF2 to its AI customers, and will continue to upgrade and develop optimization technologies that can support multiple models, the latest deep neural network model and FPGA accelerator cards using with the latest chip. It is expected that the performance of the next-generation high-performance FPGA accelerator card will be three times of F10A, which, augmented with TF2, will present many more exciting opportunities in AI business applications down the line.