====== Apache Kafka ======

[[https://kafka.apache.org/intro|Apache Kafka]] is a fast, scalable, durable, and fault-tolerant [[https://hortonworks.com/apache/kafka/#section_2|publish-subscribe messaging system]].

Kafka is run as a cluster that stores streams of **records** in categories called topics. A **topic** is a category or feed name to which records are published.

Kafka has four core APIs:
  * **Producers** publish a stream of records to one or more Kafka topics.
  * **Consumers** subscribe to one or more topics and process the stream of records produced to them.
  * **Stream** processors allow an application to consume input streams and produce output streams.
  * **Connectors** are producers or consumers that connect Kafka topics to existing applications or data systems

More information about Kafka's [[https://kafka.apache.org/documentation/#design|design philosophy]], [[https://kafka.apache.org/uses|use cases]].

Implementation examples for [[https://dzone.com/articles/kafka-producer-and-consumer-example|Producers and Consumers]], [[https://dzone.com/articles/live-dashboard-using-apache-kafka-and-spring-webso|Kafka with Spring WebSockets]].

====== Apache Storm ======

[[https://storm.apache.org/|Storm]]


====== Apache Flume ======

[[https://github.com/apache/flume/tree/flume-1.5|Version 1.5. repository]] for looking into implementation details.

===== Agents =====


[[https://flume.apache.org/releases/content/1.5.0/FlumeUserGuide.html#a-simple-example|Simple example]]


===== Custom source =====

Example for a [[https://flume.apache.org/releases//content/1.5.0/FlumeDeveloperGuide.html#source|PollableSource]] .

==== EventDrivenSource ====

A tiny example for an EventDrivenSource
  * implements Configurable: configure(Context ctx)
  * implement EventDrivenSource
  * overrides start() and stop()
  * based on WebsocketClient, a [[https://docs.oracle.com/javaee/7/tutorial/websocket004.htm|Endpoint]] implementation.

<code java>
public class CustomSource extends AbstractSource implements Configurable, EventDrivenSource {
	private WebsocketClient clientEndPoint;
	
        @Override
  	public void configure(Context ctx) {		
		try {
			String endpoint = ctx.getString("endpoint");
			String user = ctx.getString("user");
			String pw = ctx.getString("password");
			String subscribe = ctx.getString("subscriptionString");
			clientEndPoint = new WebsocketClient(new URI(endpoint), user, pw, subscribe);
		} catch (DeploymentException | IOException | URISyntaxException e) {
			e.printStackTrace();
		}

		clientEndPoint.addMessageHandler(new MessageHandler() {
			public void handleMessage(String message) {
				ChannelProcessor processor = getChannelProcessor();						
				Event event = new SimpleEvent();
				JsonValue value = Json.parse(message);

				event.setHeaders(createEventHeaders(value));
				event.setBody(message.getBytes());
				processor.processEvent(event);

			}		
			
			private Map<String, String> createEventHeaders(JsonValue value) {
                                Map<String, String> h = new TreeMap<String, String>();
				
				String oid = value.asObject().get("payload").asObject().get("id").asString();
				
				h.put("oid", oid);
				return h;
			}
		}) ;
	}

	@Override
	public void start() {
		super.start();
	}

	@Override
	public void stop() {
		super.stop();
	}
}

</code>


===== HDFS sinks =====


[[https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/ds_flume/FlumeUserGuide.html#hdfs-sink|HDFS sinks]] can produce different file formats: SequenceFile, DataStream or CompressedStream. 

I used human-readable SequenceFiles
<code>ais.sinks.hdfs1.hdfs.fileType = SequenceFile
ais.sinks.hdfs1.hdfs.writeFormat = Text</code>

which results in:
<code>
1552319652676   {"id":"219014435","position":[11.858878135681152,57.645938873291016]}
1552319652684   {"id":"205484390","position":[3.687695026397705,51.09416961669922]}
</code>


The key (first part of each line) is by default, the event header timestamp (read e.g. [[https://stackoverflow.com/questions/32440926/flume-how-to-create-a-custom-key-for-a-hdfs-sequencefile|here for custom keys]]).

If a **key is not required**, use a DataStream
<code>ais.sinks.hdfs1.hdfs.fileType = DataStream 	
ais.sinks.hdfs1.hdfs.writeFormat = Text</code>
instead.

Using a CompressedStream saves storage space:
<code>ais.sinks.hdfs1.hdfs.fileType = DataStream 	
ais.sinks.hdfs1.hdfs.codeC = gzip
ais.sinks.hdfs1.hdfs.writeFormat = Text</code>


Creating a more customised file name is possible via event header items. As [[bigdata#eventdrivensource|implemented above]], the header contains an object id called "oid" that can be used in the agent configuration:

<code>ais.sinks.hdfs1.hdfs.filePrefix = %{oid}</code>

===== Troubleshooting =====

==== ChannelFullException: Space for commit to queue couldn't be acquired. ====

I tried to write to a [[https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/ds_flume/FlumeUserGuide.html#hdfs-sink|HDFS sink]], and got this exception in the logs:

<code bash>org.apache.flume.ChannelException: Unable to put event on required channel: org.apache.flume.channel.MemoryChannel{name: c1}
        at org.apache.flume.channel.ChannelProcessor.processEvent(ChannelProcessor.java:275)
...

Caused by: org.apache.flume.ChannelFullException: Space for commit to queue couldn't be acquired. Sinks are likely not keeping up with sources, or the buffer size is too tight
        at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doCommit(MemoryChannel.java:132)
        at org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:151)
        at org.apache.flume.channel.ChannelProcessor.processEvent(ChannelProcessor.java:267)
        ... 33 more
</code>

After researching, increasing [[https://bigdatadiscussionspr.blogspot.com/|JVM_OPTS]], and unsuccessfully tuning the channel parameters, the problem turned out to be that the flume user was** not allowed to write** to the hdfs.path.
===== Installing plugins in Ambari =====

[[https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/ds_flume/FlumeUserGuide.html#the-plugins-d-directory|Flume plugins]] have their own dedicated directory. In Ambari it's:

<code bash>mkdir /usr/hdp/current/flume-server/plugins.d/</code>

Dependencies go in to a subdir:

<code bash>
plugins.d/
plugins.d/MySource/
plugins.d/MySource/lib/my-source.jar
plugins.d/MySource/libext/a-dependency-6.6.6.jar
</code>

or provide a fat jar instead.
===== Managing agents in Ambari =====


[[https://developer.ibm.com/hadoop/2017/01/06/start-flume-agents-using-ambari-web-interface/|Flume in Ambari]]


====== GeoMesa ======

===== Simple Feature Types =====

[[https://www.geomesa.org/documentation/user/cli/sfts.html?|SFTs]]

===== Ingesting into table =====

General advice for developing converters
  * use the **convert** command (faster, doesn't modify tables) instead of ingest
  * use a small subset of your data
  * use  --run-mode local
  * enable debugging in conf/log4j.properties <code>org.locationtech.geomesa.convert=debug </code>
  * set the error mode to 'raise-errors' in the converter options: https://www.geomesa.org/documentation/user/convert/parsing_and_validation.html#error-mode


==== CSV ====


The [[https://www.geomesa.org/documentation/user/convert/delimited_text.html|CSV converter docs]] contain a lot of useful information.

==== JSON ====

The [[https://www.geomesa.org/documentation/user/convert/json.html|JSON converter docs]] already explains a lot.
To ingest a ship data file which contains **multiple** JSON documents:

<code>
{"payload":{"mmsi":"11111","position":[5.061738442342835,52.4892342343],"SOG":0.0,"COG":0.0,"navStatus":"engaged in fishing","recordedAt":"2019-01-14T00:00:00.072Z"}}
{"payload":{"mmsi":"66666","position":[10.869939804077148,59.077205657958984],"SOG":0.0,"COG":290.3,"navStatus":"engaged in fishing","recordedAt":"2019-03-14T00:00:02.084Z"}}
{"payload":{"mmsi":"88888","position":[30.880006790161133,70.88723754882812],"SOG":4.9,"COG":232.6,"navStatus":"engaged in fishing","recordedAt":"2019-03-14T00:00:02.103Z"}}

</code>

you can use this converter definition

<code>
cat conf/sfts/shipdata/reference.conf
geomesa {
  sfts {
    shipdata = {
      fields = [
        { name = "mmsi",        type = "Integer",            index = "full" }
        { name = "geom",        type = "Point", srid = 4326, index = true   }
        { name = "recordedAt",  type = "Date",               index = "full" }
        { name = "navStatus",   type = "String"                             }
        { name = "sog",         type = "Double"                             }
        { name = "cog",         type = "Double"                             }
      ]
      user-data = {
        geomesa.table.sharing = "false"
      }
    }
  }


  converters {
    shipdata = {
      type = "json"
      id-field = "concatenate($mmsi,$recordedAt)"
      options {
           error-mode = "raise-errors"
       }
      fields = [
        { name = "mmsi",        json-type = "string",   path = "$.payload.mmsi",             transform = "$0::int"             }
        { name = "lon",         json-type = "double",   path = "$.payload.position[0]"                                         }
        { name = "lat",         json-type = "double",   path = "$.payload.position[1]"                                         }
        { name = "geom",                                                                     transform = "point($lon, $lat)"   }
        { name = "recordedAt",  json-type = "string",   path = "$.payload.recordedAt",       transform = "dateTime($0)"        }        
        { name = "navStatus",   json-type = "string",   path = "$.payload.navStatus"                                           }
        { name = "sog",         json-type = "double",   path = "$.payload.SOG"                                                 }
        { name = "cog",         json-type = "double",   path = "$.payload.COG"                                                 }
      ]
    }
  }
}

</code>


Note
  * the **feature-path** parameter is optional, and not needed for our simple documents
  * to create points from a json array, an intermediate transformation step is defined to extract **lat** and **lon** 

The json files can be gzipped, and can reside in HDFS:

<code bash>ingest -c geomesa.shipdata -C shipdata -s shipdata --user xx --password xx --zookeepers a,b,c  --input-format json  --instance myinstance --run-mode local   hdfs://server:port/ship/archive/2019-01.gz</code>
==== Troubleshooting ====

The [[https://gitter.im/locationtech/geomesa|chat]] is a helpful place to deposit questions.


====== Hortonworks ======

[[https://community.hortonworks.com/questions/80005/confusion-with-hdf-and-hdp-and-their-capabilities.html|HDP vs HDF]]

====== Hortonworks Data Platform (HDP) ======


===== Alerts =====


https://community.hortonworks.com/questions/140101/what-does-hdfs-storage-capacity-usage-weekly-alert.html

===== Logging =====

[[https://www.ibm.com/support/knowledgecenter/en/SSC2HF_3.0.1/insurance/t_hdfs_audit_logging.html|hdfs audit log]]

===== Time series =====

==== Libraries ====

[[https://github.com/twosigma/flint|twosigma flint]] python only

[[https://github.com/sryza/spark-timeseries|spark-timeseries]] no longer active, the dev [[https://groups.google.com/d/msg/spark-ts/iIhJBgQuhr8/bpzgj1kHBAAJ|recommends flint]]


====== Data protection ======

===== HDFS =====

hdfs-audit.log files


====== Clouds ======

https://aws.amazon.com/pricing/

https://www.cloudera.com/products/pricing.html

https://azure.microsoft.com/en-us/pricing/


====== Trainings and Certifications ======


From [[https://www.simplilearn.com/top-big-data-certifications-article|here]] and [[https://www.cio.com/article/3222879/15-data-science-certifications-that-will-pay-off.html|here]]


===== Cloudera (has swallowed Hortonworks) =====

[[https://www.computerworld.com/article/3428025/everything-you-need-to-know-about-the-new-cloudera-data-platform--vision--migration-and-roadmap.html|roadmap for merging HDP into Cloudera]]

==== HDP Spark Developer DEV-343 ====


[[https://www.cloudera.com/about/training/courses/hdp-spark-developer.html#?classType=virtual|HDP Spark Developer DEV-343]]

DataFrames/DataSets/RDD, shuffling, transformations and performance tuning, streaming

  * dataOps, analysts
  * 4 days
  * 3200 $ virtual classroom or
  * 3500 € Paris

==== HDP Data Science SCI-241 ====

[[https://www.cloudera.com/about/training/courses/applying-data-science-using-apache-hadoop.html|HDP Data Science SCI-241]]

key concepts like NN, classification, regression, clustering, tools and frameworks like spark mlib, tensorflow

  * analysts, scientists
  * 4 days
  * on request

==== Cloudera Certified Associate (CCA)  ====

[[https://www.cloudera.com/about/training/certification.html|Cloudera Certified Associate (CCA)]] consists of 3 parts:

  * [[https://university.cloudera.com/content/cca175|CCA Spark and Hadoop Developer exam]]
  * [[https://www.cloudera.com/about/training/certification/cca-data-analyst.html|CCA Data Analyst exam]]
  * [[https://www.cloudera.com/about/training/certification/cca-admin.html|CCA Administrator exam]]

each exam costs $295

[[https://www.cloudera.com/about/training/course-listing.html#?course=all|course locations]]
===== Coursera =====


https://www.coursera.org/courses?query=apache%20spark

===== DataBricks =====

https://academy.databricks.com/catalog

[[https://academy.databricks.com/instructor-led-training/DB301|DB 301 - Apache Spark™ for Machine Learning and Data Science]] 
  * 3 days
  * virtual classroom
  * $2500


short, self-paced courses based on AWS or Azure: $75

[[https://academy.databricks.com/category/certifications|certifications]]
===== Dell =====

[[https://education.dellemc.com/index_downtime.htm|site down]]


example: [[https://education.dellemc.com/content/dam/dell-emc/documents/en-us/E20_065_Advanced_Analytics_Specialist_Exam.pdf|Specialist -Data Scientist, Advanced Analytics Version 1.0]]

===== Google =====

===== Heinlein =====
[[https://www.heinlein-support.de/schulung/big-data-mit-hadoop|Big Data mit Hadoop]]

introduction, overview, first steps          


  * analysts, dataOps
  * 3 days
  * 2000 €
  * Berlin


===== IBM  =====

[[https://www.coursera.org/professional-certificates/ibm-data-science|Data Science Professional Certificate]]

[[https://www.coursera.org/specializations/advanced-data-science-ibm|Advanced Data Science with IBM Specialization]]


===== Hewlett Packard (has swallowed MapR) =====

([[https://www.hpe.com/us/en/newsroom/press-release/2019/08/hpe-advances-its-intelligent-data-platform-with-acquisition-of-mapr-business-assets.html|merger announcement]])

[[https://mapr.com/training/certification/|The MapR Academy Certification Program is closed to new registration as we work to update the exams.]]

===== Microsoft Azure =====


===== SimpliLearn =====

[[https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training|Big Data Hadoop Certification Training Course]]


Aligned to Cloudera [[https://www.cloudera.com/about/training/certification/cca-spark.html|CCA175]] certification exam
====== Meetups ======

https://www.meetup.com/Wien-Cloud-Computing-Meetup/

https://www.meetup.com/Big-Data-Vienna/

https://www.meetup.com/Vienna-Data-Science-Group-Meetup/

https://www.ocg.at/cloud-computing-big-data