This is an old revision of the document!

Machine learning

Quora Q/A: Why are there two ML implementations in Spark?

spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

Profiling

SO led me to a blog entry which did not work out for me, although it's said to be a platform-agnostic script - YMMV.

I used the manual process described here.

Installation

influxdb

sudo apt install influxdb

You can manage the service with

    sudo service  influxdb stop
    sudo service  influxdb start

statsd

Install profiling utility

Build or download jar from https://github.com/etsy/statsd-jvm-profiler I tested successfully with version 2.1.0.

flamegraph

Download this Perl script.

stacktrace export utility

Download this Python script.

Prepare

Set some variables:

db_user=myuser
local_ip=$(hostname -s)
port=48081
influx_uri=http://${local_ip}:$port
flaminggraph_installation=/path/to/flamegraph/
MAINCLASS=my.Mainclass

Setup influxdb: create database and password

    
curl -sS -X POST $influx_uri/query --data-urlencode "q=DROP DATABASE $duser" # >/dev/null
curl -sS -X POST $influx_uri/query --data-urlencode "q=CREATE DATABASE $duser" # >/dev/null
curl -sS -X POST $influx_uri/query --data-urlencode "q=CREATE USER $duser WITH PASSWORD '$duser' WITH ALL PRIVILEGES" # >/dev/null

Add to your submit some lines (modify as desired)

1. the db connection configuration

--conf "spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler.jar=server=your.ip.or.hostname,port=$port,reporter=InfluxDBReporter,database=$duser,username=$duser,password=$duser,prefix=sparkapp,tagMapping=spark" \

2. the profiler jar

--jars $GEOMESA_JAR,$PROFJAR \

Profile your code

Ensure influxdb is up and running
submit your job

When the job has finished, dump your stacktraces:

python2.7 $flaminggraph_installation/influxdb_dump.py -o $local_ip -r $port -u profiler -p profiler -d profiler -t spark -e sparkapp -x stack_traces

You can filter/exclude specific classes by adding

 -f /path/to/filterfile

Your filterfile must contain lines with classnames to filter, e.g.

sun.nio

Now you can create your flamegraph: perl $flaminggraph_installation/flamegraph.pl –title “$MAINCLASS” stack_traces/all_*.txt > flamegraph.svg

Small heaps of code

Table of Contents

Machine learning

Profiling

Installation

influxdb

statsd

flamegraph

stacktrace export utility

Prepare

Profile your code

Small heaps of code

User Tools

Site Tools

Table of Contents

Machine learning

Profiling

Installation

influxdb

statsd

flamegraph

stacktrace export utility

Prepare

Profile your code

Page Tools