PySpark: Getting Started
PySpark and Jupyter Notebook
export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Then, run
pyspark
Tipps und Tricks für Entwickler und IT-Interessierte
export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Then, run
pyspark
from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Demo") \ .config("spark.demo.config.option", "demo-value") \ .getOrCreate() df = spark.read.csv("input.csv",header=True,sep="|");
from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sc = SparkContext('local','example') sql_sc = SQLContext(sc) p_df= pd.read_csv('file.csv') # assuming the file contains a header # p_df = pd.read_csv('file.csv', names = ['column 1','column 2']) # if no header s_df = sql_sc.createDataFrame(pandas_df)
sc.textFile("file.csv") \ .map(lambda line: line.split(",")) \ .filter(lambda line: len(line)>1) \ .map(lambda line: (line[0],line[1])) \ .collect()
sc.textFile("input.csv") \ .map(lambda line: line.split(",")) \ .filter(lambda line: len(line)<=1) \ .collect()
spark.read.csv( "input.csv", header=True, mode="DROPMALFORMED", schema=schema )
(spark.read .schema(schema) .option("header", "true") .option("mode", "DROPMALFORMED") .csv("input.csv"))
from pyspark.sql.types import StructType, StructField from pyspark.sql.types import DoubleType, IntegerType, StringType schema = StructType([ StructField("A", IntegerType()), StructField("B", DoubleType()), StructField("C", StringType()) ]) (sqlContext .read .format("com.databricks.spark.csv") .schema(schema) .option("header", "true") .option("mode", "DROPMALFORMED") .load("input.csv"))
Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations which includes Interactive Queries and Stream Processing.
This is an extract from this brief tutorial that explains the basics of Spark Core programming.
$ java -version java version "12.0.1" 2019-04-16 Java(TM) SE Runtime Environment (build 12.0.1+12) Java HotSpot(TM) 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)
$ brew install scala
$ scala -version Scala code runner version 2.13.0 -- Copyright 2002-2019, LAMP/EPFL and Lightbend, Inc.
export PATH="$PATH:$SPARK_HOME/bin"
Prepate Upuntu
apt update apt upgrade
apt-get install openjdk-8-jdk java -version