Tutorials

SparkSQL.jl Blog

Welcome to the official SparkSQL.jl Blog. This blog teaches Julia developers best practices for using the SparkSQL.jl package.

Posts:

  1. Top 3 benefits of using the SparkSQL.jl Julia package

  2. Project links

  3. SparkSQL.jl and tutorials environment setup

Releases:

  1. SparkSQL.jl release 1.2.0 announcement

  2. SparkSQL.jl release 1.1.0 announcement

  3. SparkSQL.jl release 1.0.0 announcement

Tutorials:

  1. Introduction to SparkSQL.jl tutorial

  2. Working with data tutorial

  3. Machine learning with SparkSQL.jl tutorial

Top 3 benefits of using the SparkSQL.jl Julia package

SparkSQL.jl enables Julia programs to work with Apache Spark data using just SQL. Here are the top 3 reasons to use Julia with Spark for data science:

  1. Julia is a modern programming language that has state-of-the art data science packages and is much faster than Python.

  2. Apache Spark is one of the world’s most ubiquitous open-source big data processing platforms. SparkSQL.jl allows Julia programmers to create Spark applications in Julia.

  3. Used together, Julia with Apache Spark forms the most advanced data science platform. SparkSQL.jl makes it happen.

The official SparkSQL.jl project page is located here:

The official tutorial page for SparkSQL.jl is here:

SparkSQL.jl and tutorials environment setup

The “Tutorials_SparkSQL” folder has the Julia Pluto notebook tutorials and sample data. To run the Pluto notebook tutorials, setup Apache Spark and your Julia environment:

  1. Install Apache Spark 3.2.0 or later: http://spark.apache.org/downloads.html
  2. Install either OpenJDK 8 or 11:
  3. Setup your JAVA_HOME and SPARK_HOME enviroment variables:
    • export JAVA_HOME=/path/to/java
    • export SPARK_HOME=/path/to/Apache/Spark
  4. If using OpenJDK 11 on Linux set processReaperUseDefaultStackSize to true:
    • export _JAVA_OPTIONS='-Djdk.lang.processReaperUseDefaultStackSize=true'
  5. Start Apache Spark (note using default values):
    • /path/to/Apache/Spark/sbin/start-master.sh
    • /path/to/Apache/Spark/sbin/start-worker.sh --master localhost:7070
  6. Start Julia with “JULIA_COPY_STACKS=yes” required for JVM interop:
    • JULIA_COPY_STACKS=yes julia
  7. Install SparkSQL.jl along with other required Julia Packages:
    • ] add SparkSQL; add DataFrames; add Decimals; add Dates; add Pluto;
  8. Launch the Pluto notebook:
    • Using Pluto; Pluto.run();
  9. Download the tutorial Notebooks and sample data from the Tutorials_SparkSQL repository. In Pluto, navigate to where you saved the tutorial notebooks.
  10. The notebooks will run automatically.

SparkSQL.jl release 1.2.0 announcement

This post is announcing the release of SparkSQL.jl version 1.2.0.

SparkSQL.jl is software that enables Julia programs to work with Apache Spark using just SQL.

Apache Spark is one of the world’s most ubiquitous open-source big data processing engines. Spark’s distributed processing power enables it to process very large datasets. Apache Spark runs on many platforms and hardware architectures including those used by large enterprise and government.

Released in 2012, Julia is a modern programming language ideally suited for data science and machine learning workloads. Expertly designed, Julia is a highly performant language. It sports multiple-dispatch, auto-differentiation and a rich ecosystem of packages.

SparkSQL.jl provides the functionality that enables using Apache Spark and Julia together for tabular data. With SparkSQL.jl, Julia takes the place of Python for data science and machine learning work on Spark.

New features of this release are:

Install SparkSQL.jl via the Julia REPL:

] add SparkSQL

Update from earlier releases of SparkSQL.jl via the Julia REPL:

] update SparkSQL
update DataFrames

Example usage:

JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)

Official Project Page:

SparkSQL.jl release 1.1.0 announcement

This post is announcing the release of SparkSQL.jl version 1.1.0.

SparkSQL.jl is software that enables Julia programs to work with Apache Spark using just SQL.

Apache Spark is one of the world’s most ubiquitous open-source big data processing engines. Spark’s distributed processing power enables it to process very large datasets. Apache Spark runs on many platforms and hardware architectures including those used by large enterprise and government.

Released in 2012, Julia is a modern programming language ideally suited for data science and machine learning workloads. Expertly designed, Julia is a highly performant language. It sports multiple-dispatch, auto-differentiation and a rich ecosystem of packages.

SparkSQL.jl provides the functionality that enables using Apache Spark and Julia together for tabular data. With SparkSQL.jl, Julia takes the place of Python for data science and machine learning work on Spark.

New features of this release are:

Install SparkSQL.jl via the Julia REPL:

] add SparkSQL

Update from earlier releases of SparkSQL.jl via the Julia REPL:

] update SparkSQL
update DataFrames

Example usage:

JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)

Official Project Page:

SparkSQL.jl release 1.0.0 announcement

This post is announcing the availability of the SparkSQL.jl package.

SparkSQL.jl is an open-source software package that enables the Julia programming language to work with Apache Spark using just SQL and Julia.

Apache Spark is one of the world’s most ubiquitous open-source big data processing engines. Spark’s distributed processing power enables it to process very large datasets. Apache Spark runs on many platforms and hardware architectures including those used by large enterprise and government. By utilizing SparkSQL.jl, Julia can program Spark clusters running on:

Released in 2012, Julia is a modern programming language ideally suited for data science and machine learning workloads. Expertly designed, Julia is a highly performant language. It sports multiple-dispatch, auto-differentiation and a rich ecosystem of packages.

SparkSQL.jl provides the functionality that enables using Apache Spark and Julia together for tabular data. With SparkSQL.jl, Julia takes the place of Python for data science and machine learning work on Spark. Apache Spark data science tooling that is free from the limitations of Python represents a substantial upgrade.

For decision makers, SparkSQL.jl is the safe choice in data science tooling modernization. Julia interoperates with Python. That means legacy code investments are protected while gaining new capabilities.

The SparkSQL.jl package is designed to support many advanced features including Delta Lake. Delta Lake architecture is a best practice for multi-petabyte and trillion+ row datasets. The focus on tabular data using SQL means the Spark RDD API is not supported.

You can install SparkSQL.jl via the Julia REPL:

] add SparkSQL

Example usage:

JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)

Official Project Page:

Introduction to SparkSQL.jl tutorial

The introduction to SparkSQL tutorial covers:

Download the tutorial: Tutorial_00_SparkSQL_Notebook.jl

Notebook output:

Working with data tutorial

The working with data tutorial covers:

Download the tutorial: Tutorial_01_Data_Notebook.jl

Notebook output:

Machine learning with SparkSQL.jl tutorial

The machine learning with SparkSQL tutorial covers:

Download the tutorial: Tutorial_02_MachineLearning_Notebook.jl

Notebook output: