Welcome to the official SparkSQL.jl Blog. This blog teaches Julia developers best practices for using the SparkSQL.jl package.
SparkSQL.jl enables Julia programs to work with Apache Spark data using just SQL. Here are the top 3 reasons to use Julia with Spark for data science:
Julia is a modern programming language that has state-of-the art data science packages and is much faster than Python.
Apache Spark is one of the world’s most ubiquitous open-source big data processing platforms. SparkSQL.jl allows Julia programmers to create Spark applications in Julia.
Used together, Julia with Apache Spark forms the most advanced data science platform. SparkSQL.jl makes it happen.
The official SparkSQL.jl project page is located here:
The official tutorial page for SparkSQL.jl is here:
The “Tutorials_SparkSQL” folder has the Julia Pluto notebook tutorials and sample data. To run the Pluto notebook tutorials, setup Apache Spark and your Julia environment:
export JAVA_HOME=/path/to/java
export SPARK_HOME=/path/to/Apache/Spark
export _JAVA_OPTIONS='-Djdk.lang.processReaperUseDefaultStackSize=true'
/path/to/Apache/Spark/sbin/start-master.sh
/path/to/Apache/Spark/sbin/start-worker.sh --master localhost:7070
JULIA_COPY_STACKS=yes julia
JULIA_COPY_STACKS=yes julia --handle-signals=no
] add SparkSQL; add DataFrames; add Decimals; add Dates; add Pluto;
Using Pluto; Pluto.run();
This post is announcing the release of SparkSQL.jl version 1.4.0.
New features of this release are:
This post is announcing the release of SparkSQL.jl version 1.3.0.
SparkSQL.jl is software that enables developers to use the Julia programming language with the Apache Spark data processing engine.
Apache Spark is one of the world’s most ubiquitous open-source big data processing engines. Spark’s distributed processing power enables it to process very large datasets. Apache Spark runs on many platforms and hardware architectures including those used by large enterprise and government.
Released in 2012, Julia is a modern programming language ideally suited for data science and machine learning workloads. Expertly designed, Julia is a highly performant language. It sports multiple-dispatch, auto-differentiation and a rich ecosystem of packages.
Submits Structured Query Language (SQL), Data Manipulation Language (DML) and Data Definition Language (DDL) statements to Apache Spark. Has functions to move data from Spark into Julia DataFrames and Julia DataFrame data into Spark.
SparkSQL.jl delivers advanced features like dynamic horizontal autoscaling that scale compute nodes to match workload requirements (1). This package supports structured and semi-structured data in Data Lakes, Lakehouses (Delta Lake, Iceberg) on premise and in the cloud. To maximize java virtual machine performance, SparkSQL.jl brings support for the latest Java JDK-17 to Spark 3.2.0 (2).
New features of this release are:
Install SparkSQL.jl via the Julia REPL:
] add SparkSQL
Update from earlier releases of SparkSQL.jl via the Julia REPL:
] update SparkSQL
update DataFrames
Example usage:
JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)
To learn more visit the Official Project Page:
(1) The SparkSQL.jl compute node autoscaling feature is based on Kubernetes. For SparkSQL.jl on Kubernetes setup instructions see: SparkSQL.jl kubernetes readme
(2) JDK-17 support is provided as a podman container file: Containerfile-JDK-17
This post is announcing the release of SparkSQL.jl version 1.2.0.
SparkSQL.jl is software that enables Julia programs to work with Apache Spark using just SQL.
Apache Spark is one of the world’s most ubiquitous open-source big data processing engines. Spark’s distributed processing power enables it to process very large datasets. Apache Spark runs on many platforms and hardware architectures including those used by large enterprise and government.
Released in 2012, Julia is a modern programming language ideally suited for data science and machine learning workloads. Expertly designed, Julia is a highly performant language. It sports multiple-dispatch, auto-differentiation and a rich ecosystem of packages.
SparkSQL.jl provides the functionality that enables using Apache Spark and Julia together for tabular data. With SparkSQL.jl, Julia takes the place of Python for data science and machine learning work on Spark.
New features of this release are:
Install SparkSQL.jl via the Julia REPL:
] add SparkSQL
Update from earlier releases of SparkSQL.jl via the Julia REPL:
] update SparkSQL
update DataFrames
Example usage:
JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)
Official Project Page:
This post is announcing the release of SparkSQL.jl version 1.1.0.
SparkSQL.jl is software that enables Julia programs to work with Apache Spark using just SQL.
Apache Spark is one of the world’s most ubiquitous open-source big data processing engines. Spark’s distributed processing power enables it to process very large datasets. Apache Spark runs on many platforms and hardware architectures including those used by large enterprise and government.
Released in 2012, Julia is a modern programming language ideally suited for data science and machine learning workloads. Expertly designed, Julia is a highly performant language. It sports multiple-dispatch, auto-differentiation and a rich ecosystem of packages.
SparkSQL.jl provides the functionality that enables using Apache Spark and Julia together for tabular data. With SparkSQL.jl, Julia takes the place of Python for data science and machine learning work on Spark.
New features of this release are:
Install SparkSQL.jl via the Julia REPL:
] add SparkSQL
Update from earlier releases of SparkSQL.jl via the Julia REPL:
] update SparkSQL
update DataFrames
Example usage:
JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)
Official Project Page:
This post is announcing the availability of the SparkSQL.jl package.
SparkSQL.jl is an open-source software package that enables the Julia programming language to work with Apache Spark using just SQL and Julia.
Apache Spark is one of the world’s most ubiquitous open-source big data processing engines. Spark’s distributed processing power enables it to process very large datasets. Apache Spark runs on many platforms and hardware architectures including those used by large enterprise and government. By utilizing SparkSQL.jl, Julia can program Spark clusters running on:
Released in 2012, Julia is a modern programming language ideally suited for data science and machine learning workloads. Expertly designed, Julia is a highly performant language. It sports multiple-dispatch, auto-differentiation and a rich ecosystem of packages.
SparkSQL.jl provides the functionality that enables using Apache Spark and Julia together for tabular data. With SparkSQL.jl, Julia takes the place of Python for data science and machine learning work on Spark. Apache Spark data science tooling that is free from the limitations of Python represents a substantial upgrade.
For decision makers, SparkSQL.jl is the safe choice in data science tooling modernization. Julia interoperates with Python. That means legacy code investments are protected while gaining new capabilities.
The SparkSQL.jl package is designed to support many advanced features including Delta Lake. Delta Lake architecture is a best practice for multi-petabyte and trillion+ row datasets. The focus on tabular data using SQL means the Spark RDD API is not supported.
You can install SparkSQL.jl via the Julia REPL:
] add SparkSQL
Example usage:
JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)
Official Project Page:
The introduction to SparkSQL tutorial covers:
Download the tutorial: Tutorial_00_SparkSQL_Notebook.jl
The working with data tutorial covers:
Download the tutorial: Tutorial_01_Data_Notebook.jl
The machine learning with SparkSQL tutorial covers:
Download the tutorial: Tutorial_02_MachineLearning_Notebook.jl
Take the guess work out of big data performance tuning. SparkSQL.jl supports dynamic autoscaling that matches compute node size to your workload. Ensure maximum performance at optimal cost by having the system scale worker nodes up and down to meet demand.
You set the maximum and minimum node sizes. Based on utilization, the system will autoscale within those parameters. Meet cost targets and service level agreement (SLA) requirements.
via YAML File:
kubectl apply -f spark-worker-hpa.yaml
via command line:
kubectl autoscale deployment spark-worker --min-1 --max=4 --cpu-percent=90
kubectl get deployments, hpa
or monitor continuously:
watch kubectl get deployments, hpa