I wasn’t sure if my poor old eyes were deceiving me or not, but yet, there it was. After years and decades of doing data work, my heart has become cold and my fingers feel wizened and dry as I leaf through the endless supply of digital new snake oils being hawked on the street corners.
But I could not deny this miracle elixir, Spark Connect, like a fly to the light, even if it was my death. There it was, in my face, staring back at me like some mad old crazed person … saying “Come hither my son, feast upon my delectable delights.”
Spark Connect … behold it has come to save the world.
I’m still unsure why there hasn’t been more fanfare, general horn-blowing, and shouting from the rooftops about Spark Connect. It seems as of Apache Spark 3.4, Spark Connect was quietly released to the mindless masses without much to do.
I happen to think it’s one of the greatest moves Apache Spark could have made to keep itself relevant and on top of the hill as Lord of us all.
With the advent of tools like Polars and Datafusion, and the ever-present threat they could be distributed in the near future, it’s as important as ever for Spark as the defacto tool to keep invocation on the move.
What, pray tell is Apache Spark Connect?
“… a decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API …”
and
“It is a thin API that can be embedded everywhere: in application servers, IDEs, notebooks, and programming languages …”
What does this mean to the average Data Engineer? It means one could take Golang, Rust, Python, whatever, from wherever, and connect to a Spark Cluster, do a thing, and get results back.
It’s hard to express how big of a deal this is in the realm of opening the floodgates of possible other innovations of toolsets and products based on Spark.
Lest you still not understand what’s going on with Spark Connect …
How Spark results are typically dealt with.
If you work in the land of Spark for any period of time, building pipelines and datasets for use by downstream applications and business units, one thing is taken for granted that must be done.
We are constantly writing results sets up to intermediate storage solutions and ingesting said results into other systems like Postgres, and MySQL, or simply more munging and massaging with tools like Python, etc.
We write results to …
Parquet files
CSV files
Delta Lake tables
etc.
Of course, there are some tools that allow us to dump Dataframes directly to Postgres, etc, but they’ve always been a little rickety.
The main “sticking point” has always been the large wall of JVM that has been built up around Spark, blocking off easy access from the rest of the world.
This is Spark Connect, my friend. It busts down that wall. Let’s prove the point.
Writing an Apache Spark pipeline with Rust.
So because I can, I will, that’s my way in life. Let’s remove all doubt about this 7th Wonder of the World, Spark Connect, and give it a try ourselves.
We will do this by installing a Spark Cluster of our own, on a remote server in the cloud, then connect to it with Rust and get some results.
Step 1:
Create a remote server. (I use Linode and Ubuntu shared instances).
Install crap like Java, UFW, etc, etc.
Step 2:
Harden access to the server via things like UFW and fail2ban etc. Only allow access from whitelisted IP address(s) of your choice etc.
root@localhost:~# sudo ufw status
Status: active
To Action From
-- ------ ----
22/tcp ALLOW Anywhere
Anywhere ALLOW 217.180.228.XXX
22/tcp (v6) ALLOW Anywhere (v6)
Step 3:
Get and install Apache Spark 3.4.0 onto the server and start the Spark cluster.
>> wget https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
>> tar xvf spark-3.4.0-bin-hadoop3.tgz
>> sudo mv spark-3.4.0-bin-hadoop3 spark
Get the cluster up and running …
root@localhost:~# ls
spark spark-3.4.0-bin-hadoop3.tgz
root@localhost:~# cd spark
root@localhost:~/spark# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /root/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-localhost.out
I can see the Cluster is up and running from my local machine by hitting the IP address of my remote machine + port 8080.
Also, we need to start the Spark Connect server.
./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0
Step 4:
Dude, let’s write some Rust Spark! Boy, never thought I would say those words together. Amazing.
Luckily, there is spark_connect_rs waiting for us. So let’s set up a new Rust project and give this a try.
cargo new rust-test-spark
cargo add spark_connect_rs
cargo add tokio
Let’s also create a small `CSV` file on our Spark server, and see if we can read it from Rust.
mkdir /data
vim /data/test.csv
Here is my Rust code.
use spark_connect_rs::{SparkSession, SparkSessionBuilder};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let spark: SparkSession = SparkSessionBuilder::remote("sc://172.233.217.239:15002/")
.build()
.await?;
let df = spark
.sql("SELECT * FROM csv.`/data/test.csv`")
.await?;
df.show(Some(5), None, None).await?;
Ok(())
}
Trying to build and run this problem fails.
error: failed to run custom build command for `aws-lc-sys v0.14.1`
Caused by:
process didn't exit successfully: `/Users/danielbeach/code/rust-test-spark/target/release/build/aws-lc-sys-be37136d54f41db2/build-script-main` (exit status: 101)
--- stdout
cargo:rerun-if-env-changed=AWS_LC_SYS_INTERNAL_NO_PREFIX
cargo:rerun-if-env-changed=AWS_LC_RUST_INTERNAL_BINDGEN
cargo:rustc-cfg=aarch64_apple_darwin
cargo:rerun-if-env-changed=AWS_LC_SYS_STATIC
--- stderr
Missing dependency: cmake
After doing a `brew install cmake` on my local Mac (where trying to build the Rust), I got past that issue.
Good Lord, it worked!
`cargo run —release`
danielbeach@Daniels-MacBook-Pro rust-test-spark % cargo run --release
Compiling rust-test-spark v0.1.0 (/Users/danielbeach/code/rust-test-spark)
Finished release [optimized] target(s) in 12.66s
Running `target/release/rust-test-spark`
+--------------------+
| show_string |
+--------------------+
| +---+-------+----+ |
| |_c0|_c1 |_c2 | |
| +---+-------+----+ |
| |id |name |null| |
| |1 |billbo |null| |
| |2 |gandalf|null| |
| |3 |samwise|null| |
| +---+-------+----+ |
| |
+--------------------+
That is quite amazing if I do say so myself. I used Rust on my local machine to connect to and run Spark commands on a remote Spark Cluster … and return the results.
What’s next, flying pigs?
Is a new era of ETL and Data Engineering upon us?
This is another one of those big turning points that makes me wonder if people have the wherewithal to take advantage of what’s laid before them.
Think about it. Could it be the dawn of a new era of writing data pipelines and ETL … against massive Spark Clusters in the cloud … all from the safety of your Python, Golang, or Rust code?
You have to admit, it does open up a new horizon of vast possibilities for the expansion of Apache Spark into other verticals and new, powerful ways of creating Data Apps powered by Spark Connect.
What does the future hold? What new tools and apps will the open-source and other companies build on top of Spark Connect? Will it really change the way we do Data Engineering with Apache Spark? One can only hope.
the smile on that stickman image had me cracking up for hours., the way he is doing handsfree spark queries is absolute kino (and very true)
thanks for the article. what is the relevance of starting the spark master? while booting up spark connect, I see that we are not connecting with master anywhere.