Developer Blog

Tipps und Tricks für Entwickler und IT-Interessierte

Test-Driven Development with Python

Python | Test-Driven Development

  • Part 1: Create a TDD Python Project
  • Part 2: Use Jenkins to automatically test your App

Part 1: Create a TDD Python Project

Final source code is on Github.

Introduction

The task of creating an error free program is not easy. And, if your program runs free of errors, keeping it error-free after an update or change is even more complicated. You don’t want to insert new errors or change correct code with wrong parts.

The answer to this situation (directly from the Oracle of Delphi) is: Testing, Testing, Testing

And the best way to test is to start with tests.

This means: think about what the result should be and then create a Test that checks this. Imagine, you have to write a function for adding two values, and you should describe the functionality.

So, maybe, your description contains one or two examples:

My functions add’s two numbers, e.g 5 plus 7 is 12 (or at least should be 12 :))

The procedure with the TDD is:

  • think and define, what the function should to
  • write a stub for the function, e.g. only function parameters and return type
  • write a function, that tests you function with defines parameters and know result

For our example above, this means:

Write the python script with the desired functionality: src/main.py

def add(val1,val2):
    return 0 # this is only a dummy return value

Write the Python Testscript: tst/main.p

def_test_add():
    result = add(5,7)

    if (result = 12):
        print("everything fine")
    else:
        printf("ups, problems with base arithmetics")

Now, with these in your toolbox, you can always verify your code by running the tests.

$ python test_add.py
ups, problems with base arithmetics

dfdf

Setup virtual environment

Mostly, tests are repeated after every change. So, to be sure, that each test is running the same way and with the same environment, we will use pythons virtual environment feature to create a new fresh python environment for the tests.

Create virtual environment

$ python3 -m venv .env/python

Activate environment

Add the following line to .bashrc (or .envrc if you are using direnv)

$ . .env/python/bin/activate

Install required packages

$ pip install pytest

Create a sample Application

Prepare folder

Create folder for sources

$ mkdir src

Create sample package

$ mkdir src/CalculatorLib
$ touch src/CalculatorLib/__init__.py
$ touch src/CalculatorLib/Calculator.py

At least, create a simple Calculator: src/CalculatorLib/Calculator.py

class Calculator:
    def __init__(self):
        print("Init Calculator")

    def add(self, a, b):
        return a + b

    def subtract(self, a, b):
        return a - b

    def multiply(self, a, b):
        return a * b

    def divide(self, a, b):
        return a / b

    def power(self, base, exp):
        return base ** exp

Create the Main App for your Calculator: src/main.py

from CalculatorLib.Calculator import Calculator

class Main(object):

    def run(self):
        c = Calculator()

        print("5 + 3 =
        print("8 - 4 =
        print("5 * 3 =
        print("8 / 4 =

        print("8 ^ 4 =

if __name__ == '__main__':
    Main().run()

Yur done with the fist development step. Try your app:

$ python src/main.py
Init Calculator
5 + 3 =     8
8 - 4 =     4
5 * 3 =    15
8 / 4 =     2
8 ^ 4 =  4096

Add Unit Tests

We will start with our first test. Create folder for tests and a file tst/main.py

$ mkdir tst
$ touch tst/main.py

Use the following for your test script tst/main.py

from CalculatorLib.Calculator import Calculator
import unittest

class CalculatorTest(unittest.TestCase):

    @classmethod
    def setUpClass(self):
        self.c = Calculator()

    def test_add(self):
        self.assertEqual(8, self.c.add(5, 3))

    def test_subtract(self):
        self.assertEqual(4, self.c.subtract(8, 4))

    def test_multiply(self):
        self.assertEqual(32, self.c.multiply(8, 4))

    def test_divide(self):
        self.assertEqual(2, self.c.divide(8, 4))
            
    def test_power(self):
        self.assertEqual(16, self.c.power(2, 4))
                                    
if __name__ == '__main__':
    unittest.main()

Finally try your test script:

$ PYTHONPATH=./src python -m pytest tst/main.py  --verbose
================================= test session starts ================================
platform darwin -- Python 3.7.4, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- <Testproject_Python-Calculator/.env/python/bin/python>
cachedir: .pytest_cache
rootdir: <Testproject_Python-Calculator>
plugins: cov-2.6.1
collected 5 items

tst/main.py::CalculatorTest::test_add PASSED             [ 20%]
tst/main.py::CalculatorTest::test_divide PASSED          [ 40%]
tst/main.py::CalculatorTest::test_multiply PASSED        [ 60%]
tst/main.py::CalculatorTest::test_power PASSED           [ 80%]
tst/main.py::CalculatorTest::test_subtract PASSED        [100%]

The command to run the test is python -m pytest tst/main.py, but why the lead Variable PYTHONPATH?

Try it without:

$ python -m pytest tst/main.py
=================================== test session starts ==================================
platform darwin -- Python 3.7.4, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- ##/Testproject_Python-Calculator/.env/python/bin/python
cachedir: .pytest_cache
rootdir: ##/Testproject_Python-Calculator
plugins: cov-2.6.1
collected 0 items / 1 errors

========================================= ERRORS =========================================
____________________________________ ERROR collecting tst/main.py ________________________
ImportError while importing test module '##/Testproject_Python-Calculator/tst/main.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tst/main.py:2: in <module>
    from CalculatorLib.Calculator import Calculator
E   ModuleNotFoundError: No module named 'CalculatorLib'
!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!
================================== 1 error in 1.84 secon==================================

Recognize the ModuleNotFoundError in line 16! This means, that Python could not find the desired CalculatorLib.

Look at your folder structure:

$ tree .
.
├── src
│   ├── CalculatorLib
│   │   ├── Calculator.py
│   │   ├── init__.py
│   └── main.py
└── tst
    └── main.py

.

In your Testscript, we import the CalculatorLib whit this statement:

from CalculatorLib.Calculator import Calculator

Python is interpreting this in the following way:

  • Look in the folder of the test script for a subfolder with the name CalculatorLib
  • There, look for a file Calculator.py
  • And in this file, use the class Calculator

Obviously, the folder CalculatorLib is NOT in the same folder as the test script: it is part of the src folder.

So, using the environment variable PYTHONPATH, we inform python where to search python scripts and folders.

Add additional functionality

Add a function at the end of your Calculator: src/CalculatorLib/Calculator.py

    ....
    def factorial(self, n):
        return 0

Add a call of the new function to your main app: src/main.py

    ...
    def run(self):
        ...
        print("4!    =



Add a test for the new function to your test script: tst/main.py

    ...
    def test_factorial(self):
        self.assertEqual(24, self.c.factorial(4))

Try it:

$ python src/main.py
Init Calculator
5 + 3 =     8
8 - 4 =     4
5 * 3 =    15
8 / 4 =     2
8 ^ 4 =  4096
$ PYTHONPATH=./src python -m pytest tst/main.py
==================================== test session starts =====================================
platform darwin -- Python 3.7.4, pytest-4.4.1, py-1.8.0, pluggy-0.9.0
rootdir: ##/Testproject_Python-Calculator
plugins: cov-2.6.1
collected 6 items

tst/main.py ..F...                                                                      [100%]

========================================== FAILURES ==========================================
_______________________________ CalculatorTest.test_factorial ________________________________

self = <main.CalculatorTest testMethod=test_factorial>

    def test_factorial(self):
>       self.assertEqual(24, self.c.factorial(4))
E       AssertionError: 24 != 0

tst/main.py:31: AssertionError
============================= 1 failed, 5 passed in 0.14 seconds =============================

Test failed, was we expect it.

Now, implement the function correctly and startover the test:

Add a function at the end of your Calculator: src/CalculatorLib/Calculator.py

import math

class Calculator:
    ...
    def factorial(self, n):
       if not n >= 0:
            raise ValueError("n must be >= 0")

        if math.floor(n) != n:
            raise ValueError("n must be exact integer")

        if n+1 == n:  # catch a value like 1e300
            raise OverflowError("n too large")

        result, factor = 1, 2
        
        while factor <= n:
            result *= factor
            factor += 1

        return result
$ PYTHONPATH=./src python -m pytest tst/main.py  --verbose
==================================== test session starts =====================================
platform darwin -- Python 3.7.4, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- ##/Testproject_Python-Calculator/.env/python/bin/python
cachedir: .pytest_cache
rootdir: ##/Testproject_Python-Calculator
plugins: cov-2.6.1
collected 6 items

tst/main.py::CalculatorTest::test_add PASSED                                             [ 16%]
tst/main.py::CalculatorTest::test_divide PASSED                                          [ 33%]
tst/main.py::CalculatorTest::test_factorial PASSED                                       [ 50%]
tst/main.py::CalculatorTest::test_multiply PASSED                                        [ 66%]
tst/main.py::CalculatorTest::test_power PASSED                                           [ 83%]
tst/main.py::CalculatorTest::test_subtract PASSED                                        [100%]

================================== 6 passed in 0.01 seconds ==================================

Testing Frameworks

https://wiki.python.org/moin/PythonTestingToolsTaxonomy

Unit testing framework

import unittest

class TestStringMethods(unittest.TestCase):

    def test_upper(self):
        self.assertEqual('foo'.upper(), 'FOO')

    def test_isupper(self):
        self.assertTrue('FOO'.isupper())
        self.assertFalse('Foo'.isupper())

    def test_split(self):
        s = 'hello world'
        self.assertEqual(s.split(), ['hello', 'world'])
        
        with self.assertRaises(TypeError):
            s.split(2)

if __name__ == '__main__':
    unittest.main()

pytest – helps you write better programms

# content of test_sample.py
def inc(x):
    return x + 1

def test_answer():
    assert inc(3) == 5
$ pytest

nose – is nicer testing for python

def test_numbers_3_4():
    assert multiply(3,4) == 12 
 
def test_strings_a_3():
    assert multiply('a',3) == 'aaa

Python BDD Pattern

class MangoUseCase(TestCase):
  def setUp(self):
    self.user = 'placeholder'

  @mango.given('I am logged-in')
  def test_profile(self):
    self.given.profile = 'profile'
    self.given.photo = 'photo'

    self.given.notifications = 3
    self.given.notifications_unread = 1

    @mango.when('I click profile')
    def when_click_profile():
      print('click')

      @mango.then('I see profile')
      def then_profile():
        self.assertEqual(self.given.profile, 'profile')

      @mango.then('I see my photo')
        def then_photo():
          self.assertEqual(self.given.photo, 'photo')

radsh is not just another BDD tool …THE ROOT FROM RED TO GREEN

from radish import given, when, then

@given("I have the numbers {number1:g} and {number2:g}")
def have_numbers(step, number1, number2):
    step.context.number1 = number1
    step.context.number2 = number2

@when("I sum them")
def sum_numbers(step):
    step.context.result = step.context.number1 + \
        step.context.number2

@then("I expect the result to be {result:g}")
def expect_result(step, result):
    assert step.context.result == result

doctest

"""
The example module supplies one function, factorial().  For example,

>>> factorial(5)
120
"""

def factorial(n):
    """Return the factorial of n, an exact integer >= 0.

    >>> [factorial(n) for n in range(6)]
    [1, 1, 2, 6, 24, 120]
    >>> factorial(30)
    265252859812191058636308480000000
    >>> factorial(-1)
    Traceback (most recent call last):
        ...
    ValueError: n must be >= 0

    Factorials of floats are OK, but the float must be an exact integer:
    >>> factorial(30.1)
    Traceback (most recent call last):
        ...
    ValueError: n must be exact integer
    >>> factorial(30.0)
    265252859812191058636308480000000

    It must also not be ridiculously large:
    >>> factorial(1e100)
    Traceback (most recent call last):
        ...
    OverflowError: n too large
    """

    import math
    if not n >= 0:
        raise ValueError("n must be >= 0")
    if math.floor(n) != n:
        raise ValueError("n must be exact integer")
    if n+1 == n:  # catch a value like 1e300
        raise OverflowError("n too large")
    result = 1
    factor = 2
    while factor <= n:
        result *= factor
        factor += 1
    return result

if __name__ == "__main__":
    import doctest
    doctest.testmod()

Sample Session with Test Frameworks

$ py.test -v
========================================================= test session starts ==========================================================
platform darwin -- Python 3.7.3, pytest-4.3.1, py-1.8.0, pluggy-0.9.0 -- /CLOUD/Development.Anaconda/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /CLOUD/Development.Python/Repositories.FromGithub/repositories/python-toolbox/Working-with-TDD/app, inifile:
plugins: remotedata-0.3.1, openfiles-0.3.2, doctestplus-0.3.0, arraydiff-0.3
collected 4 items

test_base.py::test_should_pass PASSED                                                                                            [ 25%]
test_base.py::test_should_raise_error PASSED                                                                                     [ 50%]
test_base.py::test_check_if_true_is_true PASSED                                                                                  [ 75%]
test_base.py::test_check_if_inc_works PASSED
$ nosetests -v
test_base.test_should_pass ... ok
test_base.test_should_raise_error ... ok
test_base.test_check_if_true_is_true ... ok
test_base.test_check_if_inc_works ... ok

----------------------------------------------------------------------
Ran 4 tests in 0.001s

OK

Links and additional information

http://pythontesting.net/

https://www.xenonstack.com/blog/test-driven-development-big-data/

https://realpython.com/python-testing/

Flask | Cookbook

Installation

$ pip install flask
$ flask --version
Python 3.7.3
Flask 1.1.1
Werkzeug 0.15.5

Creating a App

Create base python script app.py

from flask import Flask

app = Flask(__name__)

@app.route('/')
def example():
   return '{"name":"Bob"}'

if __name__ == '__main__':
    app.run()

Start Flask

flask run
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [01/Aug/2019 12:19:00] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [01/Aug/2019 12:19:00] "GET /favicon.ico HTTP/1.1" 404 -

Apache Zeppelin | Getting Started

First Steps with Zeppelin

Zeppelin and MySQL

Create a new Interpreter

Create a new interpreter

or confgure existing mysql interpreter

Configure Mysql Interpreter

Under artifact, add absoulte path of mysql-connector-java-8.0.19.jar.

Add/modify properties for

default.user

default.password

Prepare MySQL Database

Create a database user spark with password spark

Create a database spark wirth all permissions to user spark

Add demo values

Test Mysql Conection

Create a new notebook with mysql interpreter

Write sample code

select * from spark.demo;

Installation

Install with Docker

docker run -p 8080:8080 — rm — name zeppelin apache/zeppelin:0.8.1

Set docker volume options to persist notebooks and logs like

docker run -p 8080:8080 — rm -v $PWD/logs:/logs -v $PWD/notebook:/notebook -e ZEPPELIN_LOG_DIR=’/logs’ -e ZEPPELIN_NOTEBOOK_DIR=’/notebook’ — name zeppelin apache/zeppelin:0.8.1

Install in a vagrant box

Setup base Vagrant Box

vagrant init ubuntu/trusty64
vagrant up
vagrant ssh

Update Operating System

sudo apt-get update -y
sudo apt-get upgrade -y

Install the Vagrant Key

The only way that all the vagrant commands will be able to communicate over ssh from the host machine to the guest server is if the guest server has this “insecure vagrant key” installed. It’s called “insecure” because essentially everyone has this same key and anyone can hack into everyone’s vagrant box if you use it.

mkdir -p /home/vagrant/.ssh
chmod 0700 /home/vagrant/.ssh
wget --no-check-certificate \
    https://raw.github.com/mitchellh/vagrant/master/keys/vagrant.pub \
    -O /home/vagrant/.ssh/authorized_keys
chmod 0600 /home/vagrant/.ssh/authorized_keys
chown -R vagrant /home/vagrant/.ssh

Install Zeppelin and required Software

Detailed description can be found here.

sudo apt-get install -y gcc build-essential linux-headers-server
sudo apt-get install git
sudo apt-get install openjdk-7-jdk
sudo apt-get install npm
sudo apt-get install libfontconfig
sudo apt-get install r-base-dev
sudo apt-get install r-cran-evaluate
git clone https://github.com/apache/zeppelin.git
sudo apt-get -y install maven
mvn clean package -DskipTests -Pspark-2.0 -Phadoop-2.4 -Pr -Pscala-2.11

Configure Zeppelin

Apache Spark | Getting started

Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations which includes Interactive Queries and Stream Processing.

This is an extract from this brief tutorial that explains the basics of Spark Core programming.

Environment / Requirements

Installation on Mac OS X

Check or install java

$ java -version
java version "12.0.1" 2019-04-16
Java(TM) SE Runtime Environment (build 12.0.1+12)
Java HotSpot(TM) 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)

Check or install Scala

$ brew install scala
$ scala -version
Scala code runner version 2.13.0 -- Copyright 2002-2019, LAMP/EPFL and Lightbend, Inc.

Check or install Apache Spark

Setup environment in .bashrc

export PATH="$PATH:$SPARK_HOME/bin"

Installation on Ubuntu

Prepate Upuntu

apt update
apt upgrade
 apt-get install openjdk-8-jdk
 java -version

Links and Resources

Jekyll | Build a Jekyll Template based on Bootstrap 4

TL;DR

Combine two amazing open source tools: Jekyll and Bootstrap. The final template is here.

Bootstrap Template and Jekyll: two powerful tools

Start Point

While i want to learn about and work with bootstrap, i decided to build a Jekyll Template, so that i can build a dynamic website.

Asking Google for first inspiration leads me to this wonderful Blog entry:

Choose a Bootstrap Template

Quite nice. So I decided to use one of the free templates from Start Bootstrap: Modern Business

When i downloaded the template from Github and examine the content, i find out, that for each component (Pricing, Service, Contact), there is a corresponding HTML-file with all the content and all the formatting code:

  • about.html
  • blog-home-1.html
  • blog-home-2.html
  • blog-post.html
  • contact.html
  • faq.html
  • full-width.html
  • index.html
  • portfolio-1-col.html
  • portfolio-2-col.html
  • portfolio-3-col.html
  • portfolio-4-col.html
  • portfolio-item.html
  • pricing.html
  • services.html
  • sidebar.html

The Plan

My plan was to separate the presentation layer (what you will see) from the business layer (what creates the content for the presentation layer).

To achieve this with Jekyll, i convert the Bootstrap pages to Jekyll include pages. The final result should look like this:

The frontpage for the component

The jekyll include file with the component

---
layout: page
title: Services
---
<div class="container">
    <h1 class="mt-4 mb-3">{{ page.title }}</h1>
    
</div>

<h2>Services: {{ site.services.title }}</h2>

<!-- Image Header -->
<img class="img-fluid rounded mb-4" src="{{ images }}/header.jpg" alt="">

<div class="row">
    
        <div class="col-lg-4 mb-4">
            <div class="card h-100">
                <h4 class="card-header">{{ item.title }}</h4>
                <div class="card-body">
                    <p class="card-text">{{ item.text | markdownify }}</p>
                </div>
                <div class="card-footer">
                    <a href="#" class="btn btn-primary">Learn More</a>
                </div>
            </div>
        </div>
    
</div>

Next step was to convert every Bootstrap Template Page to a Jekyll Include File

Slider image

About Page

Slider image

FAQ Page

Slider image

Portfolio Page with 1 Column

Slider image

Portfolio Page with 2 Column

Slider image

Services Page

Slider image
Slider image

Pricing Page

The main challenge in separating the presentation from the business layer was: where to place the data to be displayed?

Depending on the type of the component, i choose three different solutions:

  1. Place the data in the corresponding include file of the component
  2. Place the date in the page, which calls the corresponding include file of the component
  3. Place the data in a Jekyll collection file

Data in corresponding include file of the component

I used this approach for components, which are used only once on the website and have a mostly static content, e.g. the FAQ Page

The component page

The frontend page

Date in the page, which calls the corresponding include file of the component

I used this approach for components, which are used more than once on the website, e.g. a Blog Post

The component page

The frontend page

Data in a Jekyll collection file

I used this approach for components, which are used only once on the website, but needs more configuration information, e.g. the Services- or Portfolio Page.

This step needs an additional configuration task: create the Jekyll Collections.

Jekyll collections are a great way to group related content like members of a team or talks at a conference.

To use a Collection you first need to define it in your _config.yml.

#
collections_dir: collections # folder, where collections files are stored
collections:
  services:
    title: "Services"
    output: true # store output files for each item under the collections folder

Then, you have to create the collection files, for each item in your collection one file:

These files look like this:

---
img: 1.jpg
title: Development
subtitle: 
footer: 
text: Lorem ipsum dolor sit amet, consectetur adipisicing elit. Possimus aut mollitia eum ipsum fugiat odio officiis odit.
---

And the data of this files can be accessed in the Jekyll include file with this code fragment:

  • all items of the collection: site.services,
        <div class="col-lg-4 mb-4">
            <div class="card h-100">
                <h4 class="card-header">{{ item.title }}</h4>
                <div class="card-body">
                    <p class="card-text">{{ item.text | markdownify }}</p>
                </div>
                <div class="card-footer">
                    <a href="#" class="btn btn-primary">Learn More</a>
                </div>
            </div>
        </div>
    
    
    
    
    

    The final result

    Bootstrap Template and Jekyll: two powerful tools

Jekyll | Cookbook

Working with Arrays

Define the array

---
layout: post
title:  "Universe"
date:   2019-06-17 10:00:00
planets:
    - mercury 
    - venus
    - earth

Access the array

    <a href="https://{{planet}}.universe}">{{planet}}</a>




Liquid

Links

https://github.com/Shopify/liquid/wiki/Liquid-for-Designers#optional-arguments

Code Snippets

for-loop-sorted-collection

<ul>
    
    
    <li>{{ item.title }}</li>
    
</ul>

Code Snippets and recieps

https://gist.github.com/ryerh/b2fa73829f1b7b1c39988f09a65eb227

Learning | Path for Data Scientist

  • Portfolio
  • Python Pandas / Numpy /SciPy
  • Apache Spark
  • Apache Hadoop

Learning

Mathematics for Data Science

Linear Algebra

  1. Khan Academy Linear Algebra series (beginner friendly).
  2. Coding the Matrix course (and book).
  3. 3Blue1Brown Linear Algebra series.
  4. fast.ai Linear Algebra for coders course, highly related to modern ML workflow.
  5. First course in Coursera Mathematics for Machine Learning specialization.
  6. “Introduction to Applied Linear Algebra — Vectors, Matrices, and Least Squares” book.
  7. MIT Linear Algebra course, highly comprehensive.
  8. Stanford CS229 Linear Algebra review.

Calculus

  1. Khan Academy Calculus series (beginner friendly).
  2. 3Blue1Brown Calculus series.
  3. Second course in Coursera Mathematics for Machine Learning specialization.
  4. The Matrix Calculus You Need For Deep Learning paper.
  5. MIT Single Variable Calculus.
  6. MIT Multivariable Calculus.
  7. Stanford CS224n Differential Calculus review.

Statistics and Probability

  1. Khan Academy Statistics and probability series (beginner friendly).
  2. A visual introduction to probability and statistics, Seeing Theory.
  3. Intro to Descriptive Statistics from Udacity.
  4. Intro to Inferential Statistics from Udacity.
  5. Statistics with R Specialization from Coursera.
  6. Stanford CS229 Probability Theory review.

Bonus materials

  1. Part one of Deep Learning book.
  2. CMU Math Background for ML course.
  3. The Math of Intelligence playlist by Siraj Raval.

Hadoop | Getting started

Modules

HDFSHadoop’s File Share which can be local or shared depending on your setup
MapReduceHadoop’s Aggregation/Synchronization tool enabling highly parallel processing…this is the true “engine” or time saver in Hadoop
HiveHadoop’s SQL query window, equivalent to Microsoft Query Analyzer
PigDataflow scripting tool similar to a Batch job or simplistic ETL processer
FlumeCollector/Facilitator of Log file information
AmbariWeb-based Admin tool utilized for managing, provisioning, and monitoring Hadoop Cluster
CassandraHigh-Availability, Scalable, Multi-Master database platform…RDBMS on sterioids
MahoutMachine Learning engine, which translates into, it does complex calculations, algorithmic processing, and statistical/stochastic operations using R and other frameworks…it does serious math!
SparkProgrammatic based compute engine allowing for ETL, machine learning, stream processing, and graph computation
ZooKeeperCoordinator service for all your distributed processing
OozieWorkflow scheduler managing Hadoop jobs

Links

Apache

https://sentry.apache.org/

https://de.hortonworks.com/apache/ranger/

https://mahout.apache.org/

https://pig.apache.org/

https://zookeeper.apache.org/

https://oozie.apache.org/

Diverses

http://ercoppa.github.io/HadoopInternals/