Inspur AI: Framework Optimization
Inspur optimizes and innovates on various different deep learning frameworks to maximize performance, flexibility, reliability and scalability in parallel processing for training and inference.
Developed by Inspur on top of Caffe, Caffe-MPI is a highly scalable cluster parallel deep learning framework that maximizes the performance of Caffe in data training through parallel data processing and multi-tasking.
Inspur's TF2 inferencing framework, based on TensorFlow, compresses neural network models up to 1/8th the size of the original, retaining accuracy while reducing power consumption.
Caffe-MPI is the world’s first cluster-parallel version of the BVLC Caffe deep learning computing framework developed by Inspur. It is open source and available on Github.
Caffe-MPI can maximize the performance of Caffe in data trainings through parallel data processing and multi-tasking, able to run on large-scale cluster platforms, including GPU, KNL and CPU cluster platforms.
With sound inheritance and usability, Caffe-MPI has kept characteristics of the original Caffe, featuring high performance and scalability.
Caffe-MPI on ResNet
The test data shows that Caffe-MPI exhibits good parallel expansion when training deep learning models based on the internationally common Imagenet dataset. For the ResNet model, the performance of the 4-node 16GPU is 15 times better than that of the single card.
TF2 FPGA Compute Acceleration Engine
The TF2 FPGA Compute Acceleration Engine, which supports TensorFlow, helps AI customers quickly implement FPGAs based on mainstream AI training software and deep neural network model DNN on inference. It delivers high performance and low latency for AI applications using DNN shift computation to achieve efficient deployment of TensorFlow on FPGA.
Inspur-optimized TensorFlow on ResNet
Inspur developed Alibaba Cloud’s AI training system under the world’s largest TensorFlow framework, further optimized on the basis of Horovod.
On a ResNet-50 test network with batchsize of 256, the scalability of 512 GPUs relative to a single GPU is 90%, and the scalability relative to a single node is 93%. This makes Inspur Optimized Horovod the world’s best distributed deep learning framework based on TensorFlow.
The framework executed the ResNet-50 model training on a 512 P100 GPU card in 24 minutes, breaking the world record held by Facebook: one hour.