Incorporating Plotly into your Zeppelin notebooks with Spark and Scala
When using Zeppelin with Spark interpreter and Scala for data analysis, you quickly run into a problem of data visualisation shortcomings the standard Zeppelin graphs give you. To overcome this, and still use Scala for your scripting needs, it would be good to add some other graph library to help you display whatever you need in a nice, colourful way.
In this blog post, I’m going to quickly show you how to incorporate a well-known Plotly JS library into your Zeppelin notebooks with Scala. By using the presented approach, you can basically use any JS based graph library, as long as you handle preparation of data in a format your JS library needs.
Let’s get started.
Geoff Henson @ Flickr
Installing Spark and Zeppelin
To install Spark and Zeppelin, simply download the latest versions of them from the corresponding websites and execute the following steps:
Side note: you can download a smaller version of Zeppelin with only the Spark interpreter pre-installed (it is about 3 times smaller than the full version)
- Add environmental variables pointing out to your Spark installation and add the Spark bin folder to your path:
export SPARK_HOME="/Users/blahblah/spark-3.1.2-bin-hadoop3.2"
export PATH=“$SPARK_HOME/bin/:$PATH"
- Inside Zeppelin’s
conf
directory, copy azeppelin-env
template file and change its name tozeppelin-env.sh
. Open the file in your text editor of choice, find out where the commented-out setting forSPARK_HOME
is and add your own like:export SPARK_HOME=/Users/blahblah/spark-3.1.2-bin-hadoop3.2
- Navigate to the Zeppelin’s
bin
directory and execute thezeppelin.sh
script. Open up your browser atlocalhost:8080
and you should see the Zeppelin’s home page. - Play around with the provided examples, which you can find in
Notebook
->Spark Tutorial
menu. If something is borking, it can be due to version mismatch between Zeppelin and Spark, if that’s the case you can disable version check in Zeppelin’s configuration.
If, for example, you are running spark version newer than 3.2.0 and Zeppelin 0.9, you would need to change the setting in the interpreters menu so that the supported version check won’t be executed. Open up the Interpreter
menu:
Type ‘spark’, and in Spark interpreter click Edit
, uncheck the flag zeppelin.spark.enableSupportedVersionCheck
, and save.
Side note: please make sure you won’t have anything added by the browser inside empty text fields before saving, otherwise the whole interpreter can start crashing — been there, done that.
You can finally verify if you have Spark context ready by opening up a new Note
, type: sc.version
and execute:
You should see the currently used Spark version printed at the bottom of the Zeppelin’s section.
Installing dependencies
In order to use Plotly, or actually any JS graphing library, you would probably like to feed it with some JSON data. To do that easily, we add a dependency allowing us to convert our Scala object instances into their corresponding JSON representations. In our case, we will simply use JSON capabilities from the Play framework.
Open up the Spark interpreter settings again, click Edit
and at the bottom of the list of settings, you will see text fields for adding new artifacts. Add com.typesafe.play:play-json_2.12:jar:2.8.0
as our dependency so it will be available for imports in our Zeppelin notes.
Save and restart interpreter for the changes to take effect.
Your first Plotly graph with Plotly JS
One of the simplest graphs you can draw with Plotly is a bar graph. The JS Plotly documentation states that we need to provide 2 arrays of data to build this graph, one for the X values and another one for the Y axis.
var data = [
{
x: ['giraffes', 'orangutans', 'monkeys'],
y: [20, 14, 23],
type: 'bar'
}
];
Plotly.newPlot('myDiv', data);
To use the same graph in Zeppelin, we need to create a method that will return an HTML output with our graph embedded. For the purposes of this tutorial, we will work with highly popular seaborn-data
and its tips.csv
which can be downloaded from here.
Our template for the method returning our graph looks like the following:
def plotGraph() = {
print(s"""%html
<head><script src="https://cdn.plot.ly/plotly-latest.min.js"></script></head>
<body><div id="someRandomIdInstance"></div></body>
<script>
var data = [
{
x: //js array,
y: //js array,
type: 'bar',
marker: {
color: //js array
}
}
];
var layout = {
title: "Some title",
xaxis: {
title: "some title"
},
yaxis: {
title: "some title"
}
}
Plotly.newPlot('someRandomIdInstance', data, layout);
</script>
""")
}
plotGraph()
As you can see, our template method simply returns an HTML String. The only thing we need to take care of is to fill up the missing values for x and y and eventually customise the graph with other variables when needed.
Let’s start with the example of a bar chart but for our tips.csv
data.
Add some imports needed and load our data into a Spark dataset first:
%spark
import play.api.libs.json._
import play.api.libs.functional.syntax._
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Column
val tips = spark.read.
option("delimiter", ",").
option("header", "true").
option("inferSchema", "true").
csv("/Users/blahblah/Downloads/tips.csv")
Create some utilities methods and group our data by category so we can generate Bar Charts with random colours and chosen column name from the data set:
%spark
def randomColor() = "#%06x".format(scala.util.Random.nextInt(1<<24))
def getCategoryStats(columnName: String) = tips.select(col(columnName)).groupBy(col(columnName)).count().as[(String, Long)].collect.toList
def plotCategory(columnName: String) = {
val randomIndex = Math.abs(scala.util.Random.nextInt)
val xs = getCategoryStats(columnName).map(_._1)
val ys = getCategoryStats(columnName).map(_._2)
print(s"""%html
<head><script src="https://cdn.plot.ly/plotly-latest.min.js"></script></head>
<body><div id="plotCat_${randomIndex}"></div></body>
<script>
var data = [
{
x: ${Json.toJson(xs)},
y: ${Json.toJson(ys)},
type: 'bar',
marker: {
color: ${Json.toJson(xs.map(_ => randomColor()))}
}
}
];
var layout = {
title: "$columnName Grouped",
xaxis: {
title: "$columnName"
},
yaxis: {
title: "count"
}
}
Plotly.newPlot('plotCat_${randomIndex}', data, layout);
</script>
""")
}
The randomColor()
method is used to simply generate a colour hex representation using String (#rrggbb
). It returns a new colour every time it is called.
The getCategoryStats(columnName: String)
groups and counts the values by the given category/column name.
Json.toJson
converts our Scala class instances into JSON representation.
Our dataset contains 5 columns that can be interpreted as categories, those are: sex
, smoker
, day
, size
, and time
.
In other words, when passing a column name to our function, we will convert the data so that x
values for our graph are possible values in the given column (e.g. for sex, those are: Male
and Female
) and y
values are the actual counts (occurrences of a single value inside the selected column).
Example graphs for selected columns:
plotCategory("sex")
plotCategory("smoker")
plotCategory("day")
plotCategory("time")
That was easy! By following this simple recipe, you can create any chart you want, as long as it is available in Plotly JS or some other graph JS library.
More graphs
Bubble Chart Example:
def plotBubbles(columnName: String) = {
val randomIndex = Math.abs(scala.util.Random.nextInt)
val xs = getCategoryStats(columnName).map(_._1)
val ys = getCategoryStats(columnName).map(_._2)
print(s"""%html
<head><script src="https://cdn.plot.ly/plotly-latest.min.js"></script></head>
<body><div id="plotCat_${randomIndex}"></div></body>
<script>
var data = [
{
x: ${Json.toJson(xs)},
y: ${Json.toJson(ys)},
mode: 'markers',
marker: {
color: ${Json.toJson(xs.map(_ => randomColor()))},
size: ${Json.toJson(ys)}
}
}
];
var layout = {
title: "Day Ocurrences",
xaxis: {
title: "$columnName"
},
yaxis: {
title: "size"
}
}
Plotly.newPlot('plotCat_${randomIndex}', data, layout);
</script>
""")
}
plotBubbles("day")
Pie Chart Example:
def plotPie(columnName: String) = {
val randomIndex = Math.abs(scala.util.Random.nextInt)
val xs = getCategoryStats(columnName).map(_._1)
val ys = getCategoryStats(columnName).map(_._2)
print(s"""%html
<head><script src="https://cdn.plot.ly/plotly-latest.min.js"></script></head>
<body><div id="plotCat_${randomIndex}"></div></body>
<script>
var data = [
{
labels: ${Json.toJson(xs)},
values: ${Json.toJson(ys)},
type: 'pie',
hole: .4
}
];
var layout = {
title: "Day Ocurrences",
xaxis: {
title: "$columnName"
},
yaxis: {
title: "size"
}
}
Plotly.newPlot('plotCat_${randomIndex}', data, layout);
</script>
""")
}
plotPie("day")
Final thoughts
As you can see, once you prepare the data and a template method to return HTML with your chart, you can basically create any chart you want. The list of available charts with the documentation for each one can be found on the Plotly website.