text classification in python – using pandas, scikit-learn, ipython notebook and matplotlib
DESCRIPTION
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.TRANSCRIPT
![Page 1: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/1.jpg)
Text Classification in Python – using Pandas, scikit-learn, IPython
Notebook and matplotlib Jimmy Lai
r97922028 [at] ntu.edu.tw http://tw.linkedin.com/pub/jimmy-lai/27/4a/536
2013/02/17
![Page 2: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/2.jpg)
Critical Technologies for Big Data Analysis
• Please refer http://www.slideshare.net/jimmy
_lai/when-big-data-meet-python for more detail.
2
Collecting
User Generated Content
Machine Generated Data
Storage
Computing
Analysis
Visualization
Infrastructure C/JAVA
Python/R
Javascript
![Page 3: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/3.jpg)
Fast prototyping - IPython Notebook
• Write python code in browser:
– Exploit the remote server resources
– View the graphical results in web page
– Sketch code pieces as blocks
– Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow-
prototyping-using-ipython-notebook for more introduction.
Text Classification in Python 3
![Page 4: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/4.jpg)
Demo Code
• Demo Code: ipython_demo/text_classification_demo.ipynb in https://bitbucket.org/noahsark/slideshare
• Ipython Notebook: – Install
$ pip install ipython
– Execution (under ipython_demo dir)
$ ipython notebook --pylab=inline
– Open notebook with browser, e.g. http://127.0.0.1:8888
Text Classification in Python 4
![Page 5: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/5.jpg)
Machine learning classification
• 𝑋𝑖 = [𝑥1, 𝑥2, … , 𝑥𝑛], 𝑥𝑛 ∈ 𝑅
• 𝑦𝑖 ∈ 𝑁
• 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌
• 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦𝑖 = 𝑓(𝑋𝑖)
Text Classification in Python 5
![Page 6: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/6.jpg)
Text classification
Text Classification in Python 6
Feature Generation
Feature Selection
Classification Model Training
Model Parameter
Tuning
![Page 7: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/7.jpg)
From: [email protected] (zhenghao yeh) Subject: Re: Newsgroup Split Organization: University of Southern California, Los Angeles, CA Lines: 18 Distribution: world NNTP-Posting-Host: caspian.usc.edu In article <[email protected]>, [email protected] (Chris Herringshaw) writes: |> Concerning the proposed newsgroup split, I personally am not in favor of |> doing this. I learn an awful lot about all aspects of graphics by reading |> this group, from code to hardware to algorithms. I just think making 5 |> different groups out of this is a wate, and will only result in a few posts |> a week per group. I kind of like the convenience of having one big forum |> for discussing all aspects of graphics. Anyone else feel this way? |> Just curious. |> |> |> Daemon |> I agree with you. Of cause I'll try to be a daemon :-) Yeh USC
Dataset: 20 newsgroups
dataset
Text Classification in Python 7
Text
Structured Data
![Page 8: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/8.jpg)
Dataset in sklearn
• sklearn.datasets
– Toy datasets
– Download data from http://mldata.org repository
• Data format of classification problem
– Dataset
• data: [raw_data or numerical]
• target: [int]
• target_names: [str]
Text Classification in Python 8
![Page 9: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/9.jpg)
Feature extraction from structured data (1/2)
• Count the frequency of keyword and select the keywords as features: ['From', 'Subject', 'Organization', 'Distribution', 'Lines']
• E.g. From: [email protected] (where's my thing)
Subject: WHAT car is this!?
Organization: University of Maryland, College Park
Distribution: None
Lines: 15
Text Classification in Python 9
Keyword Count Distribution 2549 Summary 397 Disclaimer 125 File 257 Expires 116 Subject 11612 From 11398 Keywords 943 Originator 291 Organization 10872 Lines 11317 Internet 140 To 106
![Page 10: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/10.jpg)
Feature extraction from structured data (2/2)
• Separate structured data and text data
– Text data start from “Line:”
• Transform token matrix as numerical matrix by sklearn.feature_extractionDictVectorizer
• E.g.
[{‘a’: 1, ‘b’: 1}, {‘c’: 1}] => [[1, 1, 0], [0, 0, 1]]
Text Classification in Python 10
![Page 11: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/11.jpg)
Text Feature extraction in sklearn
• sklearn.feature_extraction.text
• CountVectorizer
– Transform articles into token-count matrix
• TfidfVectorizer
– Transform articles into token-TFIDF matrix
• Usage:
– fit(): construct token dictionary given dataset
– transform(): generate numerical matrix
Text Classification in Python 11
![Page 12: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/12.jpg)
Text Feature extraction
• Analyzer – Preprocessor: str -> str
• Default: lowercase
• Extra: strip_accents – handle unicode chars
– Tokenizer: str -> [str] • Default: re.findall(ur"(?u)\b\w\w+\b“, string)
– Analyzer: str -> [str] 1. Call preprocessor and tokenizer
2. Filter stopwords
3. Generate n-gram tokens
Text Classification in Python 12
![Page 13: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/13.jpg)
Text Classification in Python 13
![Page 14: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/14.jpg)
Feature Selection
• Decrease the number of features:
– Reduce the resource usage for faster learning
– Remove the most common tokens and the most rare tokens (words with less information):
• Parameter for Vectorizer: – max_df
– min_df
– max_features
Text Classification in Python 14
![Page 15: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/15.jpg)
Classification Model Training
• Common classifiers in sklearn:
– sklearn.linear_model
– sklearn.svm
• Usage:
– fit(X, Y): train the model
– predict(X): get predicted Y
Text Classification in Python 15
![Page 16: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/16.jpg)
Cross Validation
• When tuning the parameters of model, let each article as training and testing data alternately to ensure the parameters are not dedicated to some specific articles.
– from sklearn.cross_validation import KFold
– for train_index, test_index in KFold(10, 2):
• train_index = [5 6 7 8 9]
• test_index = [0 1 2 3 4]
Text Classification in Python 16
![Page 17: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/17.jpg)
Performance Evaluation
• 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝
𝑡𝑝+𝑓𝑝
• 𝑟𝑒𝑐𝑎𝑙𝑙 =𝑡𝑝
𝑡𝑝+𝑓𝑛
• 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
• sklearn.metrics
– precision_score
– recall_score
– f1_score
Text Classification in Python 17
Source: http://en.wikipedia.org/wiki/Precision_and_recall
![Page 18: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/18.jpg)
Visualization
1. Matplotlib
2. plot() function of Series, DataFrame
Text Classification in Python 18
![Page 19: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib](https://reader030.vdocuments.site/reader030/viewer/2022013105/554a178ab4c90507558b5273/html5/thumbnails/19.jpg)
Experiment Result
• Future works:
– Feature selection by statistics or dimension reduction
– Parameter tuning
– Ensemble models
Text Classification in Python 19