Developer Blog

Tipps und Tricks für Entwickler und IT-Interessierte

Apache Spark | Getting started

Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations which includes Interactive Queries and Stream Processing.

This is an extract from this brief tutorial that explains the basics of Spark Core programming.

Environment / Requirements

Installation on Mac OS X

Check or install java

$ java -version
java version "12.0.1" 2019-04-16
Java(TM) SE Runtime Environment (build 12.0.1+12)
Java HotSpot(TM) 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)

Check or install Scala

$ brew install scala
$ scala -version
Scala code runner version 2.13.0 -- Copyright 2002-2019, LAMP/EPFL and Lightbend, Inc.

Check or install Apache Spark

Setup environment in .bashrc

export PATH="$PATH:$SPARK_HOME/bin"

Installation on Ubuntu

Prepate Upuntu

apt update
apt upgrade
 apt-get install openjdk-8-jdk
 java -version

Links and Resources

Jekyll | Build a Jekyll Template based on Bootstrap 4

TL;DR

Combine two amazing open source tools: Jekyll and Bootstrap. The final template is here.

Bootstrap Template and Jekyll: two powerful tools

Start Point

While i want to learn about and work with bootstrap, i decided to build a Jekyll Template, so that i can build a dynamic website.

Asking Google for first inspiration leads me to this wonderful Blog entry:

Choose a Bootstrap Template

Quite nice. So I decided to use one of the free templates from Start Bootstrap: Modern Business

When i downloaded the template from Github and examine the content, i find out, that for each component (Pricing, Service, Contact), there is a corresponding HTML-file with all the content and all the formatting code:

  • about.html
  • blog-home-1.html
  • blog-home-2.html
  • blog-post.html
  • contact.html
  • faq.html
  • full-width.html
  • index.html
  • portfolio-1-col.html
  • portfolio-2-col.html
  • portfolio-3-col.html
  • portfolio-4-col.html
  • portfolio-item.html
  • pricing.html
  • services.html
  • sidebar.html

The Plan

My plan was to separate the presentation layer (what you will see) from the business layer (what creates the content for the presentation layer).

To achieve this with Jekyll, i convert the Bootstrap pages to Jekyll include pages. The final result should look like this:

The frontpage for the component

The jekyll include file with the component

---
layout: page
title: Services
---
<div class="container">
    <h1 class="mt-4 mb-3">{{ page.title }}</h1>
    
</div>

<h2>Services: {{ site.services.title }}</h2>

<!-- Image Header -->
<img class="img-fluid rounded mb-4" src="{{ images }}/header.jpg" alt="">

<div class="row">
    
        <div class="col-lg-4 mb-4">
            <div class="card h-100">
                <h4 class="card-header">{{ item.title }}</h4>
                <div class="card-body">
                    <p class="card-text">{{ item.text | markdownify }}</p>
                </div>
                <div class="card-footer">
                    <a href="#" class="btn btn-primary">Learn More</a>
                </div>
            </div>
        </div>
    
</div>

Next step was to convert every Bootstrap Template Page to a Jekyll Include File

Slider image

About Page

Slider image

FAQ Page

Slider image

Portfolio Page with 1 Column

Slider image

Portfolio Page with 2 Column

Slider image

Services Page

Slider image
Slider image

Pricing Page

The main challenge in separating the presentation from the business layer was: where to place the data to be displayed?

Depending on the type of the component, i choose three different solutions:

  1. Place the data in the corresponding include file of the component
  2. Place the date in the page, which calls the corresponding include file of the component
  3. Place the data in a Jekyll collection file

Data in corresponding include file of the component

I used this approach for components, which are used only once on the website and have a mostly static content, e.g. the FAQ Page

The component page

The frontend page

Date in the page, which calls the corresponding include file of the component

I used this approach for components, which are used more than once on the website, e.g. a Blog Post

The component page

The frontend page

Data in a Jekyll collection file

I used this approach for components, which are used only once on the website, but needs more configuration information, e.g. the Services- or Portfolio Page.

This step needs an additional configuration task: create the Jekyll Collections.

Jekyll collections are a great way to group related content like members of a team or talks at a conference.

To use a Collection you first need to define it in your _config.yml.

#
collections_dir: collections # folder, where collections files are stored
collections:
  services:
    title: "Services"
    output: true # store output files for each item under the collections folder

Then, you have to create the collection files, for each item in your collection one file:

These files look like this:

---
img: 1.jpg
title: Development
subtitle: 
footer: 
text: Lorem ipsum dolor sit amet, consectetur adipisicing elit. Possimus aut mollitia eum ipsum fugiat odio officiis odit.
---

And the data of this files can be accessed in the Jekyll include file with this code fragment:

  • all items of the collection: site.services,
        <div class="col-lg-4 mb-4">
            <div class="card h-100">
                <h4 class="card-header">{{ item.title }}</h4>
                <div class="card-body">
                    <p class="card-text">{{ item.text | markdownify }}</p>
                </div>
                <div class="card-footer">
                    <a href="#" class="btn btn-primary">Learn More</a>
                </div>
            </div>
        </div>
    
    
    
    
    

    The final result

    Bootstrap Template and Jekyll: two powerful tools

Jekyll | Cookbook

Working with Arrays

Define the array

---
layout: post
title:  "Universe"
date:   2019-06-17 10:00:00
planets:
    - mercury 
    - venus
    - earth

Access the array

    <a href="https://{{planet}}.universe}">{{planet}}</a>




Liquid

Links

https://github.com/Shopify/liquid/wiki/Liquid-for-Designers#optional-arguments

Code Snippets

for-loop-sorted-collection

<ul>
    
    
    <li>{{ item.title }}</li>
    
</ul>



Code Snippets and recieps

https://gist.github.com/ryerh/b2fa73829f1b7b1c39988f09a65eb227

Learning | Path for Data Scientist

  • Portfolio
  • Python Pandas / Numpy /SciPy
  • Apache Spark
  • Apache Hadoop

Learning

Mathematics for Data Science

Linear Algebra

  1. Khan Academy Linear Algebra series (beginner friendly).
  2. Coding the Matrix course (and book).
  3. 3Blue1Brown Linear Algebra series.
  4. fast.ai Linear Algebra for coders course, highly related to modern ML workflow.
  5. First course in Coursera Mathematics for Machine Learning specialization.
  6. “Introduction to Applied Linear Algebra — Vectors, Matrices, and Least Squares” book.
  7. MIT Linear Algebra course, highly comprehensive.
  8. Stanford CS229 Linear Algebra review.

Calculus

  1. Khan Academy Calculus series (beginner friendly).
  2. 3Blue1Brown Calculus series.
  3. Second course in Coursera Mathematics for Machine Learning specialization.
  4. The Matrix Calculus You Need For Deep Learning paper.
  5. MIT Single Variable Calculus.
  6. MIT Multivariable Calculus.
  7. Stanford CS224n Differential Calculus review.

Statistics and Probability

  1. Khan Academy Statistics and probability series (beginner friendly).
  2. A visual introduction to probability and statistics, Seeing Theory.
  3. Intro to Descriptive Statistics from Udacity.
  4. Intro to Inferential Statistics from Udacity.
  5. Statistics with R Specialization from Coursera.
  6. Stanford CS229 Probability Theory review.

Bonus materials

  1. Part one of Deep Learning book.
  2. CMU Math Background for ML course.
  3. The Math of Intelligence playlist by Siraj Raval.

Hadoop | Getting started

Modules

HDFSHadoop’s File Share which can be local or shared depending on your setup
MapReduceHadoop’s Aggregation/Synchronization tool enabling highly parallel processing…this is the true “engine” or time saver in Hadoop
HiveHadoop’s SQL query window, equivalent to Microsoft Query Analyzer
PigDataflow scripting tool similar to a Batch job or simplistic ETL processer
FlumeCollector/Facilitator of Log file information
AmbariWeb-based Admin tool utilized for managing, provisioning, and monitoring Hadoop Cluster
CassandraHigh-Availability, Scalable, Multi-Master database platform…RDBMS on sterioids
MahoutMachine Learning engine, which translates into, it does complex calculations, algorithmic processing, and statistical/stochastic operations using R and other frameworks…it does serious math!
SparkProgrammatic based compute engine allowing for ETL, machine learning, stream processing, and graph computation
ZooKeeperCoordinator service for all your distributed processing
OozieWorkflow scheduler managing Hadoop jobs

Links

Apache

https://sentry.apache.org/

https://de.hortonworks.com/apache/ranger/

https://mahout.apache.org/

https://pig.apache.org/

https://zookeeper.apache.org/

https://oozie.apache.org/

Diverses

http://ercoppa.github.io/HadoopInternals/

Ionic | Advanced Know-How

Working on Android

Start emulator

$ emulator -list-avds
6
6_x86_64
7
$ emulator @6

Show logfile messages

$ adb logcat

Run on Device

$ adb uninstall io.ionic.conference
$ ionic run android

Working on iOS

List available devices

$ ios-sim showdevicetypes

Run on Emulator

$ ionic emulate ios --target="iPhone-6, 10.1"

Run on Device

SAS | Cookbook

Handling data

Split fields

Data Cleaning



Filter out by value of an entry

if prxmatch('/^(TST|TEST|ek-test-)/', USERNAME) then
   output &_TSTDSN.;            
else
   output &_OUTDSN.;

Linux | Cookbook

Run file without execute permission

$ /lib64/ld-2.17.so ./chmod +x ./chmod

Copy permissions from other file

$ getfacl /bin/ls | setfacl --set-file=- thefile

Change permissions with rsync

$ rsync thefile tmp/thefile --chmod=ugo+x