from sas to python: road to analytics

21
From SAS to PySpark Road to analytics

Upload: datio-big-data

Post on 07-Apr-2017

709 views

Category:

Data & Analytics


6 download

TRANSCRIPT

Page 1: From SAS to Python: Road to Analytics

From SAS to PySparkRoad to analytics

Page 2: From SAS to Python: Road to Analytics

Contents

123

SAS vs Spark

SAS Proc SQL vs Spark SQL

Advantage Analytics

Page 3: From SAS to Python: Road to Analytics

1. SAS vs Spark

Page 4: From SAS to Python: Road to Analytics

OVERVIEW

SAS

○ The largest independent vendor in “advanced analytics”

○ 1976 foundation of the SAS Institute, Cary, North Carolina

○ Commercial software product

SPARK

○ A fast and general engine for large-scale data processing

○ Started in 2009 as a research project in the UC Berkeley, AMPLab

○ Open source

Page 5: From SAS to Python: Road to Analytics

CODE

SAS

Basic programming model consists of code blocks:○ SAS Data Step

■ generation of data■ concatenation of data

○ SAS PROCedures■ special functionalities

SPARK

“Line based” programmingNative Language is Scala, but flexible programming model:

○ Scala○ Java○ Python○ R

Page 6: From SAS to Python: Road to Analytics

DATA

SAS: DATASET

○ Computed in memory (RAM)

○ A data set contains:● observations: organized in

rows● variables: organized in

columns

SPARK: DATAFRAME

○ A distributed collection of data organized into named columns.

○ It is conceptually equivalent to:table in a relational database anddataframe in R/Python

○ It is a programming abstraction

Page 7: From SAS to Python: Road to Analytics

IMMUTABLE, PARTITIONED,DISTRIBUTEDDATA STRUCTURE

Transformations like: map, filter, union, join, group by… results in an other dataset

Page 8: From SAS to Python: Road to Analytics

SAS:data sasData

set sasData;

Fare2 = Fare + 2;

run;

Python Pandas:pandasDF['Fare2'] = pandasDF['Fare']+2

Spark:sparkDF = sparkDF

.withColumn('Fare2',sparkDF['Fare']+2)

NOTEBOOK

IMMUTABLE, PARTITIONED,DISTRIBUTEDDATA STRUCTURE

Page 9: From SAS to Python: Road to Analytics

READ SAS DATASETS

The SAS-FILE (sas7bdat) is a file with special structure created by SAS and binary stored

● PYTHON: SAS7BDAT PACKAGE ● R: HAVEN LIBRARY

Page 10: From SAS to Python: Road to Analytics

2. SAS Proc SQL vs Spark SQL

Page 11: From SAS to Python: Road to Analytics

SQL sentences

SAS ProC SQL

SAS Procedure that combines the functionality of DATA and PROC steps. It can sort, summarize, subset, join, concatenate datasets, create new variables...

Spark SQL

○ Spark’s interface for working with structured and semi-structured data, query using SQL

○ Load data from JSON, Hive, Parquet

○ Evaluated “lazily”

Page 12: From SAS to Python: Road to Analytics

SQL sentences

SAS ProC SQL

PROC SQL;

CREATE TABLE newTable ASSELECT ColumnsFROM TableWHERE Column > ValueGROUP BY Columns;QUIT;

Spark SQL

sqlContext = new org.apache.spark.sql.SQLContext(sc)newTable = sqlContext.sql(“SELECT ColumnsFROM TableWHERE Column > ValueGROUP BY Columns”)

NOTEBOOK

Page 13: From SAS to Python: Road to Analytics

AGGREGATE FUNCTION IN SPARK SQL

sum, avg, mean, count, max, min, first, last, sttedev, variance, skewness, kurtosis…

After aggregation

Act on each group of data, return a single

value as a result

Page 14: From SAS to Python: Road to Analytics

WINDOW FUNCTION IN SPARK SQL

Ranking: rank, dense_rank, percent_rank, ntile, row_number

Analytics: cume_dist, lag, first_value, last_value, leadAggregate: aggregate funcs

Calculate a return value over a set of rows called

window that are somehow related to the

currentNOTEBOOK

Page 16: From SAS to Python: Road to Analytics

BUILT-IN FUNCTIONS, UDFs

“User Defined Function” Define new Column-based functions that extend the vocabulary of Spark

Act on a single row as an input, single return value for

every input row

NOTEBOOK

Page 17: From SAS to Python: Road to Analytics

TIPS○ Not thinking in sorted data. In parallel process we can’t acces per row.

○ Cache tables/DFs when they are used more than once

○ Merge doesn’t need ordered data as SAS

○ Use functions already defined instead of creating your own UDF

○ Save data in columnar format as Parquet

○ Avoid collecting data when you are working with Big Data, take a sample

Page 18: From SAS to Python: Road to Analytics

3. Advantage Analytics

Page 19: From SAS to Python: Road to Analytics

ADVANTAGE ANALYTICS

SAS Stats

Traditional Add-on package to SAS for Statistics

○ Analysis of variance○ Bayesian analysis○ Categorical data analysis○ Distribution analysis○ Mixed models○ Predictive modeling...

Spark MLlib

Scalable machine learning library

○ Basic statistics○ Classification and regression○ Collaborative filtering○ Clustering○ Dimensionality reduction○ Feature extraction and

transformation...

Page 20: From SAS to Python: Road to Analytics

BIBLIOGRAPHY

SPARK DOCUMENTATION:https://spark.apache.org/docs/2.0.0/

PYSPARK API:https://spark.apache.org/docs/2.0.0/api/python/index.html

PYSPARK FUNCTIONS: https://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/functions.html