using mysql, hadoop and spark for data analysis -...
TRANSCRIPT
![Page 1: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/1.jpg)
Using MySQL, Hadoop and Spark for Data Analysis
Alexander Rubin Principle Architect, Percona September 21, 2015
![Page 2: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/2.jpg)
www.percona.com
About Me
• Working with MySQL for over 10 years – Started at MySQL AB, Sun Microsystems, Oracle
(MySQL Consulting) – Worked at Hortonworks (Hadoop company) – Joined Percona in 2013
Alexander Rubin, Principal Consultant, Percona
![Page 3: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/3.jpg)
www.percona.com
Agenda
• Why Hadoop? • Why Spark? • Hadoop – Big Data Analytics with Hadoop – Star Schema benchmark – MySQL and Hadoop Integration
• Spark examples
![Page 4: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/4.jpg)
www.percona.com
• Petabytes of data • Unstructured/raw data
• No normalization in the first place • Data is collected at a high rate http://en.wikipedia.org/wiki/Big_data#Definition
Why Hadoop and not MySQL?
![Page 5: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/5.jpg)
www.percona.com
• Claimed to be faster • … in memory processing
• Direct access to data sources (i.e.MySQL) >>> df = sqlContext.load(source="jdbc", url="jdbc:mysql://localhost?user=root", dbtable="ontime.ontime_sm”) • Native R (and Python) integration
Why Spark?
![Page 6: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/6.jpg)
www.percona.com
Inside Hadoop
![Page 7: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/7.jpg)
www.percona.com
Where is my data?
• HDFS = Data is spread between MANY machines • Also Amazon S3 is supported
HDFS (Distributed File System)
Data Nodes
![Page 8: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/8.jpg)
www.percona.com
The Famous Picture!
MapReduce (Distributed Programming Framework)
HDFS (Distributed File System)
Hive (SQL)
Other components
(PIG, etc)
• HDFS = Data is spread between MANY machines • Write files, “append-only” mode
Impala (Faster SQL)
Other componets (HBASE, etc)
![Page 9: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/9.jpg)
www.percona.com
![Page 10: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/10.jpg)
www.percona.com
Hive
• SQL level for Hadoop • Translates SQL to Map-Reduce jobs • Schema on Read – does not check the
data on load
![Page 11: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/11.jpg)
www.percona.com
Hive Example
hive> create table lineitem ( l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, ... l_shipmode string, l_comment string) row format delimited fields terminated by '|';
![Page 12: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/12.jpg)
www.percona.com
Hive Example
hive> create external table lineitem ( ...) row format delimited fields terminated by '|’ location ‘/ssb/lineorder/’;
![Page 13: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/13.jpg)
www.percona.com
Impala: Faster SQL Not based on Map-Reduce, directly get data from HDFS
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/Installing-and-Using-Impala.html
![Page 14: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/14.jpg)
www.percona.com
Hadoop vs MySQL
![Page 15: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/15.jpg)
www.percona.com
Hadoop vs. MySQL for BigData
Indexes
Partitioning
“Sharding”
Full table scan
Partitioning
Map/Reduce
![Page 16: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/16.jpg)
www.percona.com
Hadoop (vs. MySQL)
• No indexes • All processing is full scan • BUT: distributed and parallel
• No transactions • High latency (usually)
MySQL: 1 query = 1 CPU core
![Page 17: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/17.jpg)
www.percona.com
Indexes (BTree) for Big Data challenge
• Creating an index for Petabytes of data? • Updating an index for Petabytes of data? • Reading a terabyte index? • Random read of Petabyte?
Full scan in parallel is better for big data
![Page 18: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/18.jpg)
www.percona.com
ETL vs ELT
1. Extract data from external source
2. Transform before loading
3. Load data into MySQL
1. Extract data from external source
2. Load data into Hadoop (as is)
3. Transform data/Analyze data/Visualize data;
Parallelism
![Page 19: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/19.jpg)
www.percona.com
Example: Loading wikistat into MySQL
1. Extract data from external source (uncompress!)
2. Load data into MySQL and Transform
Wikipedia page counts – download, >10TB load data local infile '$file' into table wikistats.wikistats_full CHARACTER SET latin1 FIELDS TERMINATED BY ' ' (project_name, title, num_requests, content_size) set request_date = STR_TO_DATE('$datestr', '%Y%m%d %H%i%S'), title_md5=unhex(md5(title));
http://dumps.wikimedia.org/other/pagecounts-raw/
![Page 20: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/20.jpg)
www.percona.com
Load timing per hour of wikistat
• InnoDB: 52.34 sec • MyISAM: 11.08 sec (+ indexes) • 1 hour of wikistats =1 minute • 1 year will load in 6 days – (8765.81 hours in 1 year)
• 6 year = > 1 month to load
![Page 21: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/21.jpg)
www.percona.com
Loading wikistat into Hadoop
• Just copy files to HDFS…and create hive structure • How fast to search? – Depends upon the number of nodes
• 1000 nodes spark cluster – 4.5 TB, 104 Billion records – Exec time: 45 sec – Scanning 4.5TB of data
• http://spark-summit.org/wp-content/uploads/2014/07/Building-1000-node-Spark-Cluster-on-EMR.pdf
![Page 22: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/22.jpg)
www.percona.com
Amazon Elastic Map Reduce
![Page 23: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/23.jpg)
www.percona.com
Hive on Amazon S3
hive> create external table lineitem ( l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipinstruct string, l_shipmode string, l_comment string) row format delimited fields terminated by '|’ location 's3n://data.s3ndemo.hive/tpch/lineitem';
![Page 24: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/24.jpg)
www.percona.com
Amazon Elastic Map Reduce
• Store data on S3 • Prepare SQL file (create table, select, etc) • Run Elastic Map Reduce
• Will start N boxes then stop them • Results loaded to S3
![Page 25: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/25.jpg)
www.percona.com
Hadoop and MySQL Together Integrating MySQL and Hadoop
![Page 26: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/26.jpg)
www.percona.com
OLTP / Web site
BI / Data Analysis
Archiving to Hadoop
Hadoop Cluster
MySQL
ELT
Goal: keep 100G – 1TB
Can store Petabytes for archiving
![Page 27: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/27.jpg)
www.percona.com
Integration: Hadoop -> MySQL
MySQL Server Hadoop Cluster
![Page 28: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/28.jpg)
www.percona.com
MySQL -> Hadoop: Sqoop
$ sqoop import --connect jdbc:mysql://mysql_host/db_name --table ORDERS --hive-import
MySQL Server Hadoop Cluster
![Page 29: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/29.jpg)
www.percona.com
MySQL to Hadoop: Hadoop Applier
Only inserts are supported
Hadoop Cluster Replication Master
Regular MySQL Slaves
MySQL Applier (reads binlogs from Master)
![Page 30: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/30.jpg)
www.percona.com
MySQL to Hadoop: Hadoop Applier
• Download from: http://labs.mysql.com/ • Still Alpha version right now • We need to write code how to process data
![Page 31: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/31.jpg)
www.percona.com
Apache
![Page 32: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/32.jpg)
www.percona.com
Start Spark (no Hadoop)
root@thor:~/spark/spark-‐1.4.1-‐bin-‐hadoop2.6# ./sbin/start-‐master.sh less ../logs/spark-‐root-‐org.apache.spark.deploy.master.Master-‐1-‐thor.out
15/08/25 11:21:21 INFO Master: Starting Spark master at spark://thor:7077 15/08/25 11:21:21 INFO Master: Running Spark version 1.4.1 15/08/25 11:21:21 INFO Utils: Successfully started service 'MasterUI' on port 8080. 15/08/25 11:21:21 INFO MasterWebUI: Started MasterWebUI at http://10.60.23.188:8080
root@thor:~/spark/spark-‐1.4.1-‐bin-‐hadoop2.6# ./sbin/start-‐slave.sh spark://thor:7077
![Page 33: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/33.jpg)
www.percona.com
Prepare Spark
root@thor:~/spark/spark-1.4.1-bin-hadoop2.6# cat env.sh !export MASTER=spark://`hostname`:7077 !export SPARK_CLASSPATH=/root/spark/mysql-connector-java-5.1.36-bin.jar !# optional !export SPARK_WORKER_MEMORY=6g!export SPARK_MEM=6g!export SPARK_DAEMON_MEMORY=6g!export SPARK_DAEMON_JAVA_OPTS="-Dspark.executor.memory=6g” !
![Page 34: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/34.jpg)
www.percona.com
PySpark and MySQL
root@thor:~/spark/spark-1.4.1-bin-hadoop2.6# ./bin/pyspark Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.4.1 /_/ Using Python version 2.7.3 (default, Jun 22 2015 19:33:41) SparkContext available as sc, HiveContext available as sqlContext.
![Page 35: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/35.jpg)
www.percona.com
PySpark and MySQL
df = sqlContext.load(source="jdbc", url="jdbc:mysql://localhost?user=root", dbtable="world.City") df.registerTempTable("City") res = sqlContext.sql("select * from City order by Population desc limit 10") for x in res.collect(): print x.Name, x.Population
![Page 36: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/36.jpg)
www.percona.com
SparkSQL and MySQL
root@thor:~/spark/spark-‐1.4.1-‐bin-‐hadoop2.6# ./bin/spark-‐sql 2>error.log
CREATE TEMPORARY TABLE City USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:mysql://localhost?user=root", dbtable "world.City" ); select * from City order by Population desc limit 10;
![Page 37: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/37.jpg)
www.percona.com
PySpark and GZIP file from pyspark.sql import SQLContext, Row sqlContext = SQLContext(sc) # Load a text file and convert each line to a Row. lines = sc.textFile("/home/consultant/wikistats/dumps.wikimedia.org/other/pagecounts-‐raw/2008/2008-‐01/pagecounts-‐20080112-‐120000.gz") parts = lines.map(lambda l: l.split(" ")) wiki = parts.map(lambda p: Row(project=p[0], url=p[1], uniq_visitors=int(p[2]), total_visitors=int(p[3]))) # Infer the schema, and register the DataFrame as a table. schemaWiki = sqlContext.createDataFrame(wiki) schemaWiki.registerTempTable("wikistats") res = sqlContext.sql("SELECT * FROM wikistats limit 100") for x in res.collect(): print x.project, x.url, x.uniq_visitors
![Page 38: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/38.jpg)
www.percona.com
PySpark and GZIP file # Load a text file and convert each line to a Row. lines = sc.textFile("/home/consultant/wikistats/dumps.wikimedia.org/other/pagecounts-‐raw/2008/2008-‐01/") parts = lines.map(lambda l: l.split(" ")) wiki = parts.map(lambda p: Row(project=p[0], url=p[1])) # Infer the schema, and register the DataFrame as a table. schemaWiki = sqlContext.createDataFrame(wiki) schemaWiki.registerTempTable("wikistats") res = sqlContext.sql("SELECT url, count(*) as cnt FROM wikistats where project="en" group by url order by cnt desc limit 10") for x in res.collect(): print x.url, x.cnt
![Page 39: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5c5d66d942b5f2d16a704/html5/thumbnails/39.jpg)
www.percona.com
PySpark and GZIP file Cpu0 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 5.7%us, 0.0%sy, 0.0%ni, 92.4%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st Cpu2 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.6%us, 0.0%sy, 0.0%ni, 99.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 1.3%us, 0.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 49454372k total, 40479496k used, 8974876k free, 357360k buffers