GPU-Accelerated Apache Spark

For data analytics, machine learning, and deep learning pipelines.

Accelerate Apache Spark 3 data science pipelines—without code changes—and speed up data processing and model training while substantially lowering infrastructure costs.

 

Key Benefits of Spark on NVIDIA GPUs

Key Benefits of Spark on NVIDIA GPUs
Faster Execution Time

Faster Execution Time

Accelerate the performance of data preparation tasks to quickly move to the next stage of the pipeline. This allows models to be trained faster, while freeing up data scientists and engineers to focus on the most critical activities.

Streamline Analytics to AI

Reduced Infrastructure Costs

Do more with less: Spark on NVIDIA® GPUs completes jobs faster with less hardware when compared to CPUs, saving organizations time as well as on-premises capital costs or operational costs in the cloud.

Reduced Infrastructure Costs

Streamlined AI Journey

Use NVIDIA AI Enterprise, an end-to-end AI software platform including RAPIDS Accelerator, to speed time to production by accelerating the end to end AI pipeline from data prep and processing to, model training, simulation, and  inference at scale

Analysis Tool

Run Apache Spark 5X Faster

Evaluate workloads for GPU acceleration, and learn how to configure a cluster for optimal cost savings.

Spark 3 Innovations

Given the “embarrassingly parallel” nature of many data processing tasks, it’s only natural that GPU architecture should be used for Spark data processing queries, similar to how a GPU can accelerate deep learning workloads in AI. GPU acceleration is transparent to the developer and doesn’t require code changes to obtain benefits. Three key advancements in Spark 3 have contributed to delivering transparent GPU acceleration:

RAPIDS Accelerator for
Spark 3

NVIDIA CUDA® is a revolutionary parallel computing platform that supports accelerating computational operations on NVIDIA GPU architecture. RAPIDS, incubated at NVIDIA, is a suite of open-source libraries layered on top of CUDA that enables GPU acceleration of data science pipelines.

NVIDIA has created a RAPIDS Accelerator for Spark 3 that intercepts and accelerates extract, transform and load pipelines by dramatically improving the performance of Spark SQL and DataFrame operations.

Modifications to Spark Components

Spark 3 provides columnar processing support in the Catalyst query optimizer, which is what the RAPIDS Accelerator plugs into to accelerate SQL and DataFrame operators. When the query plan is executed, those operators can then be run on GPUs within the Spark cluster.

NVIDIA has also created a new Spark shuffle implementation that optimizes the data transfer between Spark processes. This shuffle implementation is built on GPU-accelerated communication libraries, including UCX, RDMA, and NCCL.

GPU-Aware Scheduling in Spark

Spark 3 recognizes GPUs as a first-class resource along with CPU and system memory. This allows Spark 3 to place GPU-accelerated workloads directly onto servers containing the necessary GPU resources as they are needed to accelerate and complete a job.

NVIDIA engineers have contributed to this major Spark enhancement, enabling the launch of Spark applications on GPU resources in Spark standalone, YARN, and Kubernetes clusters.

deep-learning-apache-spark-3-innovations-refactored

Accelerated Analytics and AI on Spark

Spark 3 marks a key milestone for analytics and AI, as ETL operations are now accelerated while ML and DL applications leverage the same GPU infrastructure. The complete stack for this accelerated data science pipeline is shown below:

Accelerated Analytics and AI on Spark

Enterprise-Ready Spark Acceleration

RAPIDS Accelerator for Apache Spark is available with NVIDIA AI Enterprise. Get optimized performance for Spark deployments with full access to enterprise-grade support, security, and stability on certified platforms across on prem to cloud – including Amazon EMR, Google Cloud Dataproc and Databricks. Take advantage of guaranteed response times, priority security notifications, and access to data science experts from NVIDIA.

IRS

The Cloudera and NVIDIA integration will empower us to use data-driven insights to power mission-critical use cases. We are currently implementing this integration and already seeing over 10X speed improvements at half the cost for our data engineering and data science workflows.

– Joe Ansaldi, Technical Branch Chief of Research Applied Analytics and Statistics, IRS

Adobe

We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3 compared to running Spark on CPUs. With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.

– William Yan, Senior Director of Machine Learning, Adobe

databricks

Our continued work with NVIDIA improves performance with RAPIDS optimizations for Apache Spark 3 and Databricks to benefit our joint customers like Adobe. These contributions lead to faster data pipelines, model training and scoring, that directly translate to more breakthroughs and insights for our community of data engineers and data scientists.

– Matei Zaharia, Original Creator of Apache Spark and Chief Technologist at Databricks

IRS

The Cloudera and NVIDIA integration will empower us to use data-driven insights to power mission-critical use cases… we are currently implementing this integration, and already seeing over 10x speed improvements at half the cost for our data engineering and data science workflows.

- Joe Ansaldi, IRS/Research Applied Analytics & Statistics Division (RAAS)/Technical Branch Chief

Adobe

We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3 compared to running Spark on CPUs. With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.

- William Yan, Senior Director of Machine Learning, Adobe

databricks

Our continued work with NVIDIA improves performance with RAPIDS optimizations for Apache Spark 3 and Databricks to benefit our joint customers like Adobe. These contributions lead to faster data pipelines, model training and scoring, that directly translate to more breakthroughs and insights for our community of data engineers and data scientists.

- Matei Zaharia, original creator of Apache Spark and Chief Technologist at Databricks

IRS

The Cloudera and NVIDIA integration will empower us to use data-driven insights to power mission-critical use cases… we are currently implementing this integration, and already seeing over 10x speed improvements at half the cost for our data engineering and data science workflows.

- Joe Ansaldi, IRS/Research Applied Analytics & Statistics Division (RAAS)/Technical Branch Chief

Adobe

We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3 compared to running Spark on CPUs. With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.

- William Yan, Senior Director of Machine Learning, Adobe

databricks

Our continued work with NVIDIA improves performance with RAPIDS optimizations for Apache Spark 3 and Databricks to benefit our joint customers like Adobe. These contributions lead to faster data pipelines, model training and scoring, that directly translate to more breakthroughs and insights for our community of data engineers and data scientists.

- Matei Zaharia, original creator of Apache Spark and Chief Technologist at Databricks

Download Free E-Book

To unlock the value of AI-powered big data and learn more about the next evolution of Apache Spark, download NVIDIA’s new e-book, “Accelerating Apache Spark 3.x – Leveraging NVIDIA GP."Us to Power the Next Era of Analytics and AI.