Unless noted otherwise, code is tested with Spark **2.2**
====== Non-committal testdrive ======
Minimum-effort way to test-drive Spark with a
[[https://databricks.com/spark/getting-started-with-apache-spark/quick-start#overview|Databricks tutorial]] (no local setup required)
====== Machine learning ======
Quora Q/A: [[https://www.quora.com/Why-are-there-two-ML-implementations-in-Spark-ML-and-MLlib-and-what-are-their-different-features|Why are there two ML implementations in Spark?]]
* spark.mllib contains the original API built on top of RDDs.
* spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
====== Profiling ======
[[https://stackoverflow.com/questions/30900104/profiling-a-scala-spark-application|SO]] led me to
[[https://spektom.blogspot.com/2017/06/profiling-spark-applications-easy-way.html|a blog entry]] which did not work out for me, although it's said to be a platform-agnostic script - YMMV.
I base my notes on the manual process described [[https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-spark-applications-using-flame-graphs/|here]].
===== Installation =====
In order of how it will be used later on.
==== influxdb ====
sudo apt install influxdb
You can manage the service with
sudo service influxdb stop
sudo service influxdb start
==== statsd ====
Build or download jar from https://github.com/etsy/statsd-jvm-profiler
I tested successfully with version 2.1.0.
==== stacktrace export utility ====
Download this [[https://github.com/aviemzur/statsd-jvm-profiler/blob/master/visualization/influxdb_dump.py|Python script]].
==== flamegraph ====
Download this [[https://github.com/brendangregg/FlameGraph/blob/master/flamegraph.pl|Perl script]].
===== Prepare =====
Set some variables:
db_user=myuser
local_ip=$(hostname -s)
port=48081
influx_uri=http://${local_ip}:$port
flaminggraph_installation=/path/to/flamegraph/
MAINCLASS=my.Mainclass
Setup influxdb: create database and password
curl -sS -X POST $influx_uri/query --data-urlencode "q=DROP DATABASE $duser" # >/dev/null
curl -sS -X POST $influx_uri/query --data-urlencode "q=CREATE DATABASE $duser" # >/dev/null
curl -sS -X POST $influx_uri/query --data-urlencode "q=CREATE USER $duser WITH PASSWORD '$duser' WITH ALL PRIVILEGES" # >/dev/null
Add to your submit some lines (modify as desired)
1. the db connection configuration
--conf "spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler.jar=server=your.ip.or.hostname,port=$port,reporter=InfluxDBReporter,database=$duser,username=$duser,password=$duser,prefix=sparkapp,tagMapping=spark" \
2. the profiler jar
--jars $GEOMESA_JAR,$PROFJAR \
===== Profile your code =====
- Ensure influxdb is up and running
- submit your job
When the job has finished, dump your stacktraces:
python2.7 $flaminggraph_installation/influxdb_dump.py -o $local_ip -r $port -u profiler -p profiler -d profiler -t spark -e sparkapp -x stack_traces
You can **filter**/exclude specific classes by adding an option
-f /path/to/filterfile
Your filterfile must contain lines with classnames to filter, e.g.
sun.nio
Now you can create your flamegraph
perl $flaminggraph_installation/flamegraph.pl --title "$MAINCLASS" stack_traces/all_*.txt > flamegraph.svg
and open it e.g. in Firefox.
The flamegraph is interactive, you can click into a cell to investigate.
Read more [[http://www.brendangregg.com/flamegraphs.html|here]].
{{:fg.png|}}
===== Submitting jobs =====
==== Providing spark jars ====
https://spark.apache.org/docs/latest/running-on-yarn.html#preparations
Download the required version [https://spark.apache.org/downloads.html|here].
How to setup provided jars (found [[https://mapr.com/docs/60/Spark/ConfigureSparkJARLocation_2.0.1.html|here]]):
cd /opt/spark-2.2.0-bin-hadoop2.7/jars
zip /opt/spark-2.2.0-bin-hadoop2.7/spark220-jars.zip ./*
# and then copy the archive to your HDFS
hdfs dfs -put /tmp/spark220-jars.zip /user/hdfs/
Then you can make use of the provided archive by adding to spark-submit
--conf spark.yarn.archive=hdfs:///user/hdfs/spark220-jars.zip
====== Testing ======
look into
https://github.com/holdenk/spark-testing-base
https://github.com/MrPowers/spark-fast-tests
====== Tuning ======