Spark Connect - What is this madness?

Using Rust to write Spark? Lord save us.

Apr 09, 2024

I wasn’t sure if my poor old eyes were deceiving me or not, but yet, there it was. After years and decades of doing data work, my heart has become cold and my fingers feel wizened and dry as I leaf through the endless supply of digital new snake oils being hawked on the street corners.

But I could not deny this miracle elixir, Spark Connect, like a fly to the light, even if it was my death. There it was, in my face, staring back at me like some mad old crazed person … saying “Come hither my son, feast upon my delectable delights.”

Spark Connect … behold it has come to save the world.

Thank you for reading Data Engineering Central. This post is public so feel free to share it.

I’m still unsure why there hasn’t been more fanfare, general horn-blowing, and shouting from the rooftops about Spark Connect. It seems as of Apache Spark 3.4, Spark Connect was quietly released to the mindless masses without much to do.

I happen to think it’s one of the greatest moves Apache Spark could have made to keep itself relevant and on top of the hill as Lord of us all.

With the advent of tools like Polars and Datafusion, and the ever-present threat they could be distributed in the near future, it’s as important as ever for Spark as the defacto tool to keep invocation on the move.

What, pray tell is Apache Spark Connect?

“… a decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API …”

and

“It is a thin API that can be embedded everywhere: in application servers, IDEs, notebooks, and programming languages …”

What does this mean to the average Data Engineer? It means one could take Golang, Rust, Python, whatever, from wherever, and connect to a Spark Cluster, do a thing, and get results back.

It’s hard to express how big of a deal this is in the realm of opening the floodgates of possible other innovations of toolsets and products based on Spark.

Lest you still not understand what’s going on with Spark Connect …

How Spark results are typically dealt with.

If you work in the land of Spark for any period of time, building pipelines and datasets for use by downstream applications and business units, one thing is taken for granted that must be done.

We are constantly writing results sets up to intermediate storage solutions and ingesting said results into other systems like Postgres, and MySQL, or simply more munging and massaging with tools like Python, etc.

We write results to …
- Parquet files
- CSV files
- Delta Lake tables
- etc.

Of course, there are some tools that allow us to dump Dataframes directly to Postgres, etc, but they’ve always been a little rickety.

The main “sticking point” has always been the large wall of JVM that has been built up around Spark, blocking off easy access from the rest of the world.

This is Spark Connect, my friend. It busts down that wall. Let’s prove the point.

Thank you for reading Data Engineering Central. This post is public so feel free to share it.

Writing an Apache Spark pipeline with Rust.

So because I can, I will, that’s my way in life. Let’s remove all doubt about this 7th Wonder of the World, Spark Connect, and give it a try ourselves.

We will do this by installing a Spark Cluster of our own, on a remote server in the cloud, then connect to it with Rust and get some results.

Step 1:

Create a remote server. (I use Linode and Ubuntu shared instances).

Install crap like Java, UFW, etc, etc.

Step 2:

Harden access to the server via things like UFW and fail2ban etc. Only allow access from whitelisted IP address(s) of your choice etc.

root@localhost:~# sudo ufw status
Status: active

To                         Action      From
--                         ------      ----
22/tcp                     ALLOW       Anywhere                  
Anywhere                   ALLOW       217.180.228.XXX           
22/tcp (v6)                ALLOW       Anywhere (v6)

Step 3:

Get and install Apache Spark 3.4.0 onto the server and start the Spark cluster.

>> wget https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
>> tar xvf spark-3.4.0-bin-hadoop3.tgz
>> sudo mv spark-3.4.0-bin-hadoop3 spark

Get the cluster up and running …

root@localhost:~# ls
spark  spark-3.4.0-bin-hadoop3.tgz
root@localhost:~# cd spark
root@localhost:~/spark# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /root/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-localhost.out

I can see the Cluster is up and running from my local machine by hitting the IP address of my remote machine + port 8080.

Also, we need to start the Spark Connect server.

./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0

Step 4:

Dude, let’s write some Rust Spark! Boy, never thought I would say those words together. Amazing.

Luckily, there is spark_connect_rs waiting for us. So let’s set up a new Rust project and give this a try.

cargo new rust-test-spark
cargo add spark_connect_rs
cargo add tokio

Let’s also create a small `CSV` file on our Spark server, and see if we can read it from Rust.

mkdir /data
vim /data/test.csv

Here is my Rust code.

use spark_connect_rs::{SparkSession, SparkSessionBuilder};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {

    let spark: SparkSession = SparkSessionBuilder::remote("sc://172.233.217.239:15002/")
        .build()
        .await?;

let df = spark
    .sql("SELECT * FROM csv.`/data/test.csv`")
    .await?;


    df.show(Some(5), None, None).await?;
    Ok(())
}

Trying to build and run this problem fails.

error: failed to run custom build command for `aws-lc-sys v0.14.1`

Caused by:
  process didn't exit successfully: `/Users/danielbeach/code/rust-test-spark/target/release/build/aws-lc-sys-be37136d54f41db2/build-script-main` (exit status: 101)
  --- stdout
  cargo:rerun-if-env-changed=AWS_LC_SYS_INTERNAL_NO_PREFIX
  cargo:rerun-if-env-changed=AWS_LC_RUST_INTERNAL_BINDGEN
  cargo:rustc-cfg=aarch64_apple_darwin
  cargo:rerun-if-env-changed=AWS_LC_SYS_STATIC

  --- stderr
  Missing dependency: cmake

After doing a `brew install cmake` on my local Mac (where trying to build the Rust), I got past that issue.

Good Lord, it worked!

`cargo run —release`

danielbeach@Daniels-MacBook-Pro rust-test-spark % cargo run --release
   Compiling rust-test-spark v0.1.0 (/Users/danielbeach/code/rust-test-spark)
    Finished release [optimized] target(s) in 12.66s
     Running `target/release/rust-test-spark`
+--------------------+
| show_string        |
+--------------------+
| +---+-------+----+ |
| |_c0|_c1    |_c2 | |
| +---+-------+----+ |
| |id |name   |null| |
| |1  |billbo |null| |
| |2  |gandalf|null| |
| |3  |samwise|null| |
| +---+-------+----+ |
|                    |
+--------------------+

That is quite amazing if I do say so myself. I used Rust on my local machine to connect to and run Spark commands on a remote Spark Cluster … and return the results.

What’s next, flying pigs?

Is a new era of ETL and Data Engineering upon us?

This is another one of those big turning points that makes me wonder if people have the wherewithal to take advantage of what’s laid before them.

Think about it. Could it be the dawn of a new era of writing data pipelines and ETL … against massive Spark Clusters in the cloud … all from the safety of your Python, Golang, or Rust code?

You have to admit, it does open up a new horizon of vast possibilities for the expansion of Apache Spark into other verticals and new, powerful ways of creating Data Apps powered by Spark Connect.

What does the future hold? What new tools and apps will the open-source and other companies build on top of Spark Connect? Will it really change the way we do Data Engineering with Apache Spark? One can only hope.

Callum Dempsey Leach

Oct 9

the smile on that stickman image had me cracking up for hours., the way he is doing handsfree spark queries is absolute kino (and very true)

Expand full comment

Prabodh Agarwal

Jul 8, 2024

thanks for the article. what is the relevance of starting the spark master? while booting up spark connect, I see that we are not connecting with master anywhere.