Unless noted otherwise, code is tested with Spark **2.2**

====== Non-committal testdrive ======

Minimum-effort way to test-drive Spark with a 
[[https://databricks.com/spark/getting-started-with-apache-spark/quick-start#overview|Databricks tutorial]] (no local setup required)


====== Machine learning ======

Quora Q/A: [[https://www.quora.com/Why-are-there-two-ML-implementations-in-Spark-ML-and-MLlib-and-what-are-their-different-features|Why are there two ML implementations in Spark?]]

  *   spark.mllib contains the original API built on top of RDDs.
  *     spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.


====== Profiling ======


[[https://stackoverflow.com/questions/30900104/profiling-a-scala-spark-application|SO]] led me to
[[https://spektom.blogspot.com/2017/06/profiling-spark-applications-easy-way.html|a blog entry]] which did not work out for me, although it's said to be a platform-agnostic script - YMMV.

I base my notes on the manual process described [[https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-spark-applications-using-flame-graphs/|here]].

===== Installation ===== 

In order of how it will be used later on.

==== influxdb ====

<code>sudo apt install influxdb</code>

You can manage the service with
<code>
    sudo service  influxdb stop
    sudo service  influxdb start
    </code>
    
    
==== statsd ====


Build or download jar from https://github.com/etsy/statsd-jvm-profiler
I tested successfully with version 2.1.0.
    

==== stacktrace export utility ====

Download this [[https://github.com/aviemzur/statsd-jvm-profiler/blob/master/visualization/influxdb_dump.py|Python script]].


==== flamegraph ====

Download this [[https://github.com/brendangregg/FlameGraph/blob/master/flamegraph.pl|Perl script]].


===== Prepare =====


Set some variables:
<code>
db_user=myuser
local_ip=$(hostname -s)
port=48081
influx_uri=http://${local_ip}:$port
flaminggraph_installation=/path/to/flamegraph/
MAINCLASS=my.Mainclass
</code>
    
    
Setup influxdb: create database and password    
<code>    
curl -sS -X POST $influx_uri/query --data-urlencode "q=DROP DATABASE $duser" # >/dev/null
curl -sS -X POST $influx_uri/query --data-urlencode "q=CREATE DATABASE $duser" # >/dev/null
curl -sS -X POST $influx_uri/query --data-urlencode "q=CREATE USER $duser WITH PASSWORD '$duser' WITH ALL PRIVILEGES" # >/dev/null
 </code>


Add to your submit some lines (modify as desired)  

1. the db connection configuration
<code>
--conf "spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler.jar=server=your.ip.or.hostname,port=$port,reporter=InfluxDBReporter,database=$duser,username=$duser,password=$duser,prefix=sparkapp,tagMapping=spark" \
</code>

2. the profiler jar
<code>
--jars $GEOMESA_JAR,$PROFJAR \

</code>


===== Profile your code =====

  - Ensure influxdb is up and running
  - submit your job

When the job has finished, dump your stacktraces:


<code>python2.7 $flaminggraph_installation/influxdb_dump.py -o $local_ip -r $port -u profiler -p profiler -d profiler -t spark -e sparkapp -x stack_traces </code>

You can **filter**/exclude specific classes by adding an option

<code> -f /path/to/filterfile</code>

Your filterfile must contain lines with classnames to filter, e.g. 
<code>sun.nio</code>

Now you can create your flamegraph 
<code>perl $flaminggraph_installation/flamegraph.pl --title "$MAINCLASS" stack_traces/all_*.txt > flamegraph.svg</code>

and open it e.g. in Firefox.

The flamegraph is interactive, you can click into a cell to investigate.


Read more [[http://www.brendangregg.com/flamegraphs.html|here]].

{{:fg.png|}}

===== Submitting jobs =====

==== Providing spark jars ====
https://spark.apache.org/docs/latest/running-on-yarn.html#preparations

Download the required version [https://spark.apache.org/downloads.html|here].

How to setup provided jars (found [[https://mapr.com/docs/60/Spark/ConfigureSparkJARLocation_2.0.1.html|here]]):

<code bash>
cd /opt/spark-2.2.0-bin-hadoop2.7/jars
zip /opt/spark-2.2.0-bin-hadoop2.7/spark220-jars.zip ./*
# and then copy the archive to your HDFS
hdfs dfs -put /tmp/spark220-jars.zip  /user/hdfs/</code>


Then you can make use of the provided archive by adding to spark-submit

<code>    --conf spark.yarn.archive=hdfs:///user/hdfs/spark220-jars.zip </code>

====== Testing ======

<todo>look into
https://github.com/holdenk/spark-testing-base 

 https://github.com/MrPowers/spark-fast-tests


====== Tuning ======