Transcript
Page 2: When big data meet python @ COSCUP 2012

2012

自我介紹

• 賴弘哲 (Jimmy Lai)

• Interests: Data mining, Machine Learning, Natural Language Processing, Distributed Computing, Python

• LindedIn profile: http://goo.gl/XTEM5

• 現任職於引京聚點知識結構搜索公司,

從事大資料語意分析

2

Page 3: When big data meet python @ COSCUP 2012

2012

Outline

1. Big Data

a. Concept

b. Technical issues

2. Big Data + Python

a. Related open source tools

b. Example

3

Page 4: When big data meet python @ COSCUP 2012

2012

Benefits of Big Data

1. Creating transparency(透明度) 2. Enabling experimentation to discover needs,

expose variability, and improve performance(發現需求及潛在威脅、改善產能)

3. Segmenting populations to customize(客製化) actions

4. Replacing/supporting human decision making with automated algorithms(自動決策)

5. Innovating new business models, products and services(創新的服務、產業)

4

(May 2011). Big Data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.

e.g. http://www.data.gov/

深度資料分析人才的短缺

Page 5: When big data meet python @ COSCUP 2012

2012

Initiative from the White House

• (Mar 2012) Big Data Research and Development Initiative, the White House.

• National Science Foundation encourages education on Big Data.

• Government invest on developing state-of-the-art technologies, harness those technologies, and expand the workforce for Big Data.

5

Page 6: When big data meet python @ COSCUP 2012

2012

Big Data Issues

6

Collecting

User Generated Content Machine Generated Data

Storage

Computing

Analysis

Visualization

Page 7: When big data meet python @ COSCUP 2012

2012

Big Data Techniques

7

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

Analysis

Visualization

• Crawler

– Collect raw data

– E.g. Heritrix, Nutch

• Scraping

– Parse information from raw data

– E.g. Yahoo! Pipes, Scrapy

Page 8: When big data meet python @ COSCUP 2012

2012

Big Data Techniques

8

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

Analysis

Visualization

• Big Table – Distributed key-value

storage – E.g.Hbase, Cassandra

• NoSQL – Not use SQL for

manipulation – Not use relational

database model – E.g. MongoDB, Redis,

CouchDB

Page 9: When big data meet python @ COSCUP 2012

2012

Big Data Techniques

9

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

Analysis

Visualization

• Batch

– MapReduce

– E.g. Hadoop

• Real-time

– Stream processing

– E.g. S4, Storm

Page 10: When big data meet python @ COSCUP 2012

2012

Big Data Techniques

10

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

Analysis

Visualization

• Data mining – Weka

• Machine learning – scikit-learn

• Natural language processing – NLTK, Stanford NLP

• Statistics – R

Page 11: When big data meet python @ COSCUP 2012

2012

Big Data Techniques

11

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

Analysis

Visualization

• Abstract

• Interactive

• E.g. Processing, Gephi, D3.js

Page 12: When big data meet python @ COSCUP 2012

2012

Why Python?

• Good code readability for fast development.

• Scripting language: the less code, the more productivity.

• Fast growing among open source communities.

– Commits statistics from ohloh.net

12

Page 13: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

13

Collecting

User Generated Content

Machine Generated Data

Scrapy: scraping framework

PyMongo: Python client for Mongodb

Hadoop streaming: Linux pipe interface Disco: lightweight MapReduce in Python

Storage

Computing

Analysis

Visualization

Pandas: data analysis/manipulation Statsmodels: statistics NLTK: natural language processing Scikit-learn: machine learning

Matplotlib: plotting NetworkX: graph visualization

Infr

astr

uct

ure

Page 14: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

web scraping framework

• Simple and Extensible

• Components: • Scheduler

• Downloader

• Spider(Scraper)

• Item pipeline

14

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

Analysis

Visualization

http://scrapy.org/

Page 15: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

NoSQL database

• PyMongo: client for python

• Document(JSON)-oriented

• No schema

• Scalable • Auto-sharding

• Replica-set

• File storage

• MapReduce aggregation

15

Collecting

User Generated Content

Machine Generated Data

Computing

Analysis

Visualization

http://www.mongodb.org/

Storage

Page 16: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

• Distributed computing: – MapReduce

– Disco distributed file system

• Write code in Python – Easy/fast to profiling

– Easy/fast to debugging

16

Collecting

User Generated Content

Machine Generated Data

Analysis

Visualization

Storage

Computing

http://discoproject.org/

Page 17: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

• Data analysis library

• Datastructure for fast data manipulation – Slicing

– Indexing

– subsetting

• Handling missing data

• Aggregation

• Time series

17

Collecting

User Generated Content

Machine Generated Data

Visualization

Storage

Computing

http://pandas.pydata.org/

Analysis

Page 18: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

Statsmodels

• Statistical analysis

• Statistical models

• Fit data with model

• Statistical tests

• Data exploration

• Time series analysis

18

Collecting

User Generated Content

Machine Generated Data

Visualization

Storage

Computing

http://statsmodels.sourceforge.net/

Analysis

Page 19: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

scikit-learn

• Machine learning algorithms

• Supervised learning

• Unsupervised learning

• Dataset

• Preprocessing

• feature extraction

• Model

• Selection

• Pipeline

19

Collecting

User Generated Content

Machine Generated Data

Visualization

Storage

Computing

http://scikit-learn.org/

Analysis

Page 20: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

NLTK: Natural Language Toolkit

• Natural language processing

• Annotated corpora and resources

20

Collecting

User Generated Content

Machine Generated Data

Visualization

Storage

Computing

http://scikit-learn.org/

Analysis

Sentence Segmentation

Tokenization POS tagging

Named Entity Recognition

Relation Recognition

Information Extraction Work Flow

Page 21: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

NL

• Plotting

– Histograms

– Power spectra

– Bar charts

– Error charts

– Scatter plots

• Full control to detail of plotting

21

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

http://matplotlib.sourceforge.net/

Analysis

Visualization

Page 22: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

NetworkX • Graph algorithms and

visisualization

• Draw graph with layout: – Circular

– Random

– Spectural

– Spring

– Shell

– Graphviz

22

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

http://networkx.lanl.gov/

Analysis

Visualization

Page 23: When big data meet python @ COSCUP 2012

2012

聚寶評 www.ezpao.com

美食搜尋引擎

23

搜尋各大部落格食記

Page 24: When big data meet python @ COSCUP 2012

2012

聚寶評 www.ezpao.com

語意分析搜尋引擎

24

Page 25: When big data meet python @ COSCUP 2012

2012

網友分享菜分析

正評/負評分析

評論主題分析

25

Page 26: When big data meet python @ COSCUP 2012

2012

Thank you for your attention. Q & A

We are hiring! • 核心引擎演算法研發工程師

• 系統研發工程師

• 網路應用研發工程師

Oxygen Intelligence Taiwan Limited

引京聚點 知識結構搜索股份有限公司

• 公司簡介: http://www.ezpao.com/about/

• 職缺簡介: http://www.ezpao.com/join/

• 請將履歷寄到 [email protected]

26

When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.


Top Related