Inspur AI: Framework Optimization

Inspur optimizes and innovates on various different deep learning frameworks to maximize performance, flexibility, reliability and scalability in parallel processing for training and inference.


Developed by Inspur on top of Caffe, Caffe-MPI is a highly scalable cluster parallel deep learning framework that maximizes the performance of Caffe in data training through parallel data processing and multi-tasking.


Inspur's TF2 inferencing framework, based on TensorFlow, compresses neural network models up to 1/8th the size of the original, retaining accuracy while reducing power consumption.


Caffe-MPI is the world’s first cluster-parallel version of the BVLC Caffe deep learning computing framework developed by Inspur. It is open source and available on Github.

Caffe-MPI can maximize the performance of Caffe in data trainings through parallel data processing and multi-tasking, able to run on large-scale cluster platforms, including GPU, KNL and CPU cluster platforms.

With sound inheritance and usability, Caffe-MPI has kept characteristics of the original Caffe, featuring high performance and scalability.

Caffe-MPI on ResNet

The test data shows that Caffe-MPI exhibits good parallel expansion when training deep learning models based on the internationally common Imagenet dataset. For the ResNet model, the performance of the 4-node 16GPU is 15 times better than that of the single card.


TF2 FPGA Compute Acceleration Engine

The TF2 FPGA Compute Acceleration Engine, which supports TensorFlow, helps AI customers quickly implement FPGAs based on mainstream AI training software and deep neural network model DNN on inference. It delivers high performance and low latency for AI applications using DNN shift computation to achieve efficient deployment of TensorFlow on FPGA.

Read this blog post to learn more about TF2 »

Technology Innovation

  • Model Optimization
    Convert 32-bit floating-point network model to 4-bit integer and keep the rules of the original model
  • DNN Shift Computation
    Improves computational performance and reduces actual power consumption


  • Save Computing Resources
    Compress network model to 1/8, compress feature map to 1/4
  • Improve Computing Performance
    Reduce single picture inference time to 0.674ms
  • Decrease Development Difficulty
    Support OpenCL language, shorten development cycle
  • Promote AI Ecological Development
    Accelerate AI deployment on FPGA

Inspur-optimized TensorFlow on ResNet

Inspur developed Alibaba Cloud’s AI training system under the world’s largest TensorFlow framework, further optimized on the basis of Horovod.

On a ResNet-50 test network with batchsize of 256, the scalability of 512 GPUs relative to a single GPU is 90%, and the scalability relative to a single node is 93%. This makes Inspur Optimized Horovod the world’s best distributed deep learning framework based on TensorFlow.

The framework executed the ResNet-50 model training on a 512 P100 GPU card in 24 minutes, breaking the world record held by Facebook: one hour.

Return to AI & Deep Learning page: