Apache Spark – Developer Blog

PySpark: Getting Started

by Ralph Allgemein, Apache Spark, PySpark, Python

PySpark and Jupyter Notebook

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Then, run

pyspark

31Mar

PySpark | Cookbook

by Ralph Apache Spark, Cookbook, PySpark

Websites

The Blaze Ecosystem (Blaze)
Dask: Flexible library for parallel computing in Python.
DataShape: Data layout language for array programming.
Odo: Shapeshifting for your data
It efficiently migrates data from the source to the target through a network of conversions.

Reading Textfiles

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Demo") \
    .config("spark.demo.config.option", "demo-value") \
    .getOrCreate()

df = spark.read.csv("input.csv",header=True,sep="|");

from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

sc = SparkContext('local','example') 
sql_sc = SQLContext(sc)

p_df= pd.read_csv('file.csv')  # assuming the file contains a header
# p_df = pd.read_csv('file.csv', names = ['column 1','column 2']) # if no header
s_df = sql_sc.createDataFrame(pandas_df)

sc.textFile("file.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)>1) \
    .map(lambda line: (line[0],line[1])) \
    .collect()

sc.textFile("input.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)<=1) \
    .collect()

spark.read.csv(
    "input.csv", header=True, mode="DROPMALFORMED", schema=schema
)

(spark.read
    .schema(schema)
    .option("header", "true")
    .option("mode", "DROPMALFORMED")
    .csv("input.csv"))

Read CSV file with known structure

from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType

schema = StructType([
    StructField("A", IntegerType()),
    StructField("B", DoubleType()),
    StructField("C", StringType())
])

(sqlContext
    .read
    .format("com.databricks.spark.csv")
    .schema(schema)
    .option("header", "true")
    .option("mode", "DROPMALFORMED")
    .load("input.csv"))

28Jun

Apache Spark | Getting started

by Ralph Apache, Apache Spark

Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations which includes Interactive Queries and Stream Processing.

This is an extract from this brief tutorial that explains the basics of Spark Core programming.

Environment / Requirements

Installation on Mac OS X

Check or install java

$ java -version
java version "12.0.1" 2019-04-16
Java(TM) SE Runtime Environment (build 12.0.1+12)
Java HotSpot(TM) 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)

Check or install Scala

$ brew install scala

$ scala -version
Scala code runner version 2.13.0 -- Copyright 2002-2019, LAMP/EPFL and Lightbend, Inc.

Check or install Apache Spark

Setup environment in .bashrc

export PATH="$PATH:$SPARK_HOME/bin"

Installation on Ubuntu

Prepate Upuntu

apt update
apt upgrade

 apt-get install openjdk-8-jdk
 java -version

Links and Resources

Apache Spark in Python: Beginner’s Guide

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.