Accelerate and simplify deep learning development and deployment on an optimized, verified infrastructure based on Apache Spark.
In the past few years, organizations have seen a convergence of massive amounts of data with the compute power and large-capacity storage needed to process it all. The right infrastructure can provide modern businesses with new ways of harnessing data for innovative apps and services built on artificial intelligence (AI). The opportunities are nearly infinite and stretch across almost every field—from financial services to manufacturing to healthcare and beyond.
But organizations with on-premises infrastructures or using hybrid cloud models face several challenges on the road to AI. They need to research, select, deploy, and optimize infrastructure that can provide efficient resource utilization while scaling on demand to meet changing business requirements. Beyond scalability, organizations seek easier ways to implement AI initiatives. Many businesses lack sufficient in-house expertise and infrastructure to get started with AI, particularly for deep learning (DL). The road to deploying DL in production environments is time-intensive and complex. Managing the data for AI initiatives can also be a challenge: organizations struggle to extract value from their “data swamps,” and it can be complex and resource-intensive to move data from on premises to the cloud for analytics.
The Intel® Select Solution for BigDL on Apache Spark* can help businesses overcome these key challenges to achieve their AI initiatives faster and more easily. The pre-tested and tuned solution eliminates the need for organizations to research and manually optimize infrastructure to efficiently pursue their AI initiatives. The solution reduces the need for specialized in-house expertise to deploy and manage AI infrastructure. And it can help IT organizations improve infrastructure utilization, while ensuring scalability to meet the growing needs of their companies.
BigDL
Apache Spark helps solve the IT challenges of DL, data, and specialized expertise by providing for standardized big-data storage and compute, with scalability, by enabling the addition of hundreds of nodes without degrading performance and without changing the fundamental architecture.
BigDL: a distributed DL library that augments the storage and compute capabilities of Apache Spark—provides efficient, scalable, and optimized DL development. BigDL enables the development of new DL models for training and serving on the same big data cluster. It also supports models from other frameworks, including TensorFlow*, Keras*, and others, so you can import other trained models into the BigDL framework or use BigDL trained models in other frameworks. BigDL is supported by Analytics Zoo, which provides a unified AI platform and pipeline with built-in reference use cases to further simplify your AIsolutions development.
BigDL is optimized for Intel®-based platforms with software libraries like Intel® Math Kernel Library (Intel® MKL) and Intel® Math Kernel Library for Deep Learning Networks (Intel® MKL-DNN) to increase computational performance. Other supporting software includes the Intel® Distribution for Python*, which accelerates popular machine learning libraries such as NumPy*, SciPy*, and scikit-learn* with integrated Intel® Performance Libraries such as Intel MKL and Intel® Data Analytics Acceleration Library (Intel® DAAL). On the hardware side, the Intel Select Solution for BigDL on Apache Spark uses Intel® Xeon® Scalable processors for high performance and Intel® Solid State Drives (SSDs) for better performance and improved reliability compared to traditional hard-disk drives (HDDs).
The Intel Select Solution for BigDL on Apache Spark
The Intel Select Solution for BigDL on Apache Spark helps optimize price/performance while significantly reducing infrastructure evaluation time. The Intel Select Solution for BigDL on Apache Spark combine Intel Xeon Scalable processors, Intel SSDs, and Intel® Ethernet Network Adapters to empower enterprises to quickly harness a reliable, comprehensive solution that delivers:
- The ability to prepare your machine learning (ML)/DL infrastructure investments for the future with scalable storage and compute
- Excellent total cost of ownership (TCO) with multi-purpose hardware that your IT organization is used to managing in a verified, tested solution that simplifies deployment
- Accelerated time to market with a turnkey solution that includes a rich development toolset and that is optimized for crucial software libraries
- The ability to run analytics on data where it is stored
BigDL Application Scenario
- Analyze large amounts of data on big data Spark clusters that store data, such as HDFS, Apache HBase, or Hive
- Add deep learning capabilities (training or inference) to big data (Spark) programs or workflows
- Run deep learning applications with existing Hadoop/Spark clusters and then easily share them with other workloads (eg extract-convert-load, data warehousing, feature design, classic machine learning, graphical analysis)
Inspur BigDL solution
This test is based on the Inspur NF5280M5 server.
Network Topology
Configuration for Inspur solution based on Intel BigDL
To refer to a solution as an Intel Select Solution, a server vendor or data center solution provider must meet or exceed the defined minimum configuration ingredients and reference minimum benchmark-performance thresholds listed below.
One Master Node
Processor | Intel® Xeon® Platinum 8160 processor (2.10 GHz, 24 cores, 48 threads) |
---|---|
Memory | 384 GB or higher (12 x 32 GB DDR4-2666) |
Boot Drive | 1 x 240 GB Intel® SSD DC S4510 |
Data Tier | 1 x 1.92 TB Intel® SSD DC S4510 |
Data Network | 10 Gb Intel® Ethernet Converged Network Adapter X520-SR2 |
Management Network per Node | Integrated 1 GbE port 0/RMM port |
Four Worker Nodes
Processor | Intel® Xeon® Platinum 8280 processor (2.60 GHz, 28 cores, 56 threads) |
---|---|
Memory | 384 GB (12 x 32 GB DDR4-2933) |
Boot Drive | 1 x 240 GB Intel® SSD DC S4510 |
Data Tier | 1 x 1.92 TB Intel® SSD DC S4510 |
Data Network | 10 Gb Intel® Ethernet Converged Network Adapter X520-SR2 |
Management Network per Node | Integrated 1 GbE port 0/RMM port |
Network Switches
Top of the Rack (TOR) Switch | 10 Gbps 48x port switch |
---|---|
Management Switch | 1 Gbps 48x port switch |
Software
Linux OS | Intel® Xeon® Platinum 8280 processor (2.60 GHz, 28 cores, 56 threads) |
---|---|
Apache Spark | 384 GB (12 x 32 GB DDR4-2933) |
Apache Hadoop | 1 x 240 GB Intel® SSD DC S4510 |
Java Development Kit (JDK) | 1 x 1.92 TB Intel® SSD DC S4510 |
BigDL | 10 Gb Intel® Ethernet Converged Network Adapter X520-SR2 |
Analytics Zoo | Integrated 1 GbE port 0/RMM port |
Intel® Distribution for Python | 10 Gb Intel® Ethernet Converged Network Adapter X520-SR2 |
Intel® Math Kernel Library (Intel® MKL) | Integrated 1 GbE port 0/RMM port |
Applies to All Nodes
Trusted Platform Module (TPM) | TPM 1.2 discrete or firmware TPM (Intel® Platform Trust Technology [Intel® PTT]) |
---|---|
Firmware and Software Optimizations | Intel® Hyper-Threading Technology (Intel® HT Technology) disabled Intel® Turbo Boost Technology enabled P-states enabled** C-states enabled** Power-management settings set to performance** Workload configuration set to balanced** Memory Latency Checker (MLC) streamer enabled** MLC spatial prefetch enabled** Data Cache Unit (DCU) data prefetch enabled** DCU instruction prefetch enabled** Last-level cache (LLC) prefetch disabled** Uncore frequency scaling enabled** |
BigDL
Datasets | ImageNet-2012 |
---|---|
Model | Inception V1 |
Benchmark | Training, Inference |
Spark Cores | 50 (every worker node) |
Batch Size | 800 images (4*cores*executors) |
Performance
ImageNet Training Throughput | 453 images/sec with Top-5 Accuracy of 85.7% |
---|---|
ImageNet Inference Throughput | 1358 images/sec with Top-5 Accuracy of 85.7% |
Test Results
- The CPU cores of worker node are not fully used in this test,The average throughput (453 images/sec) and Top5 accuracy (85.7%) during the initial V1 model training process, both performance indicators exceeded the 375 images/sec, 85% of the Intel Select Solution Certification.
- In the inference, the average throughput is three times that of training.