Snowflake Introduces Snowpark Connect for Apache Spark in Public Preview


This announcement allows Spark users to leverage the power of the Snowflake engine directly with their existing Spark code. Snowpark Connect for Apache Spark is based on a decoupled client-server architecture that separates user code from the Spark cluster where processing runs. This new architecture now makes it possible for Spark jobs to be powered by Snowflake and has been introduced by the Apache Spark community in version 3.4.
With Snowpark Connect, customers can leverage Snowflake's vectorized engine for their Spark code, avoiding the complexity of maintaining or tuning separate Spark environments, including dependency management, version compatibility, and updates. With Snowpark Connect, you can run all modern Spark DataFrame, Spark SQL, and user-defined functions (UDFs) code with Snowflake.
Snowflake automatically manages all performance optimization and scaling, freeing developers from the operational overhead of managing Spark. In addition, by bringing data processing into Snowflake, a single, robust governance framework is established at the outset, helping to ensure data consistency and security throughout the lifecycle, without the need for duplicate efforts.
Developed on Spark Connect and the Snowflake architecture
Snowpark Connect for Spark leverages the decoupled architecture of Spark Connect, which allows applications to send an unresolved logical plan to a remote Spark cluster for processing. This client-server separation philosophy has been fundamental to the design of Snowpark since its inception. Snowpark Connect currently supports Spark versions 3.5.x, ensuring compatibility with the latest features and enhancements introduced in those versions.
This innovation eliminates the need to move data between Spark and Snowflake, a process that has historically led to additional costs, latency, and governance complexity. Now, organizations can run Spark DataFrame, SQL, and UDF code in Snowflake through Snowflake Notebooks, Jupyter notebooks, Snowflake stored procedures, VSCode, Airflow, or Snowpark Submit, enabling seamless integration across different storage options in Snowflake, Iceberg (in Snowflake or externally managed), and cloud storage options.