This is an old revision of the document!
Unless noted otherwise, code is tested with Spark 2.2
Quora Q/A: Why are there two ML implementations in Spark?
SO led me to a blog entry which did not work out for me, although it's said to be a platform-agnostic script - YMMV.
I used the manual process described here.
sudo apt install influxdb
You can manage the service with
sudo service influxdb stop sudo service influxdb start
Install profiling utility
Build or download jar from https://github.com/etsy/statsd-jvm-profiler I tested successfully with version 2.1.0.
Download this Perl script.
Download this Python script.
Set some variables:
db_user=myuser local_ip=$(hostname -s) port=48081 influx_uri=http://${local_ip}:$port flaminggraph_installation=/path/to/flamegraph/ MAINCLASS=my.Mainclass
Setup influxdb: create database and password
curl -sS -X POST $influx_uri/query --data-urlencode "q=DROP DATABASE $duser" # >/dev/null curl -sS -X POST $influx_uri/query --data-urlencode "q=CREATE DATABASE $duser" # >/dev/null curl -sS -X POST $influx_uri/query --data-urlencode "q=CREATE USER $duser WITH PASSWORD '$duser' WITH ALL PRIVILEGES" # >/dev/null
Add to your submit some lines (modify as desired)
1. the db connection configuration
--conf "spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler.jar=server=your.ip.or.hostname,port=$port,reporter=InfluxDBReporter,database=$duser,username=$duser,password=$duser,prefix=sparkapp,tagMapping=spark" \
2. the profiler jar
--jars $GEOMESA_JAR,$PROFJAR \
When the job has finished, dump your stacktraces:
python2.7 $flaminggraph_installation/influxdb_dump.py -o $local_ip -r $port -u profiler -p profiler -d profiler -t spark -e sparkapp -x stack_traces
You can filter/exclude specific classes by adding
-f /path/to/filterfile
Your filterfile must contain lines with classnames to filter, e.g.
sun.nio
Now you can create your flamegraph: perl $flaminggraph_installation/flamegraph.pl –title “$MAINCLASS” stack_traces/all_*.txt > flamegraph.svg