2016 cymer intern

13
SR Text Mining Akhilesh Aji 8/5/16

Upload: akhilesh-aji

Post on 13-Apr-2017

114 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2016 Cymer Intern

SR Text MiningAkhilesh Aji8/5/16

Page 2: 2016 Cymer Intern

Slide 2

Educational BackgroundEducation:

• Bachelor of Science in Computer Science• Georgia Institute of Technology• Expected graduation date May 2019• Big Data Club: entity tagging news sources

Page 3: 2016 Cymer Intern

Project GoalsSlide 3Objective:

To build a text mining model which indicates when the rate of top keywords changes or when a new keyword emerges.

Background:• Service Request (SR) is generated whenever an FSE works on a laser• Some SRs do not replace any part• SR’s main free bodies of text are: Customer Description, Problem

Found, Task Description, and Resolution

Page 4: 2016 Cymer Intern

Project GoalSlide 4

SRYes EWI

No

AnalysisEWIText Mining

SR NoYes:

Part Replacement

Automated Monitoring

Page 5: 2016 Cymer Intern

Data PipelineSlide 5

User filters which SRs to process

Extract SR’s text•Customer Description•Problem Found•Task Description

Tokenize, group, and stem text•MO PRA -> MO_PRA•OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ

Replace similar words•NEON, NE -> NE•SOFTWARE, SW, SOFT WARE -> SW

Remove weak words•Dates•Numbers•Stopwords: AND, IS, BUT•Selected words: DUE END GROUP

Save SR number with terms Calculate frequency Display in Spotfire

Page 6: 2016 Cymer Intern

Pre-Processing

Tokenize• Example: “DOSE ERROR COMMUNICATION …”• Result: [“DOSE”,”ERROR”, “COMMUNICATION”…]

Group• Some words mean more as a group• [“DOSE_ERROR”, “ERROR_COMMUNICATION”…]

Stem• Many words mean roughly the same thing• Optimizing, optimized, optimal, optimize all become optimiz

© 2016 Cymer, LLC

6

User filters which SRs to process

Extract SR’s text• Customer Description• Problem Found• Task Description

Tokenize, group, and stem text• MO PRA -> MO_PRA• OPTIMIZING, OPTIMIZED,

OPTIMIZES -> OPTIMIZ

Replace similar words• NEON, NE -> NE• SOFTWARE, SW, SOFT

WARE -> SW

Remove weak words• Dates• Numbers• Stopwords: AND, IS, BUT• Selected words: DUE END

GROUP

Save SR number with terms Calculate frequency Display in Spotfire

Page 7: 2016 Cymer Intern

Replace

Stemming doesn’t handle all derivations of a word• NEON, NE -> NE• SOFTWARE, SW, SFOT_WARE -> SWHand selection of similar wordsDeep learning spell correction• Not all words in SR have a dictionary spelling• Find similarly used words according to word2vec (Python API)• Compare spelling according to Levenshtein Distance

© 2016 Cymer, LLC

7

User filters which SRs to process

Extract SR’s text• Customer Description• Problem Found• Task Description

Tokenize, group, and stem text• MO PRA -> MO_PRA• OPTIMIZING, OPTIMIZED,

OPTIMIZES -> OPTIMIZ

Replace similar words• NEON, NE -> NE• SOFTWARE, SW, SOFT

WARE -> SW

Remove weak words• Dates• Numbers• Stopwords: AND, IS, BUT• Selected words: DUE END

GROUP

Save SR number with terms Calculate frequency Display in Spotfire

Page 8: 2016 Cymer Intern

RemoveSlide 8

Not all text adds meaning to the analysis• Dates• Numbers• Stopwords• RegexHand selected words that should be removed: GROUP, END Words only to be used in pairs: INCREASE, MO

User filters which SRs to process

Extract SR’s text• Customer Description• Problem Found• Task Description

Tokenize, group, and stem text• MO PRA -> MO_PRA• OPTIMIZING, OPTIMIZED,

OPTIMIZES -> OPTIMIZ

Replace similar words• NEON, NE -> NE• SOFTWARE, SW, SOFT

WARE -> SW

Remove weak words• Dates• Numbers• Stopwords: AND, IS, BUT• Selected words: DUE END

GROUP

Save SR number with terms Calculate frequency Display in Spotfire

Page 9: 2016 Cymer Intern

MethodologySlide 9

Recurring Keywords:• Python script embedded in Spotfire• Each word stored once for overall usage and once for its given month• Word maps to a unique set of SRs that the word is used in• Number of total and monthly SRs are kept

Emerging Trends:• R script embedded in Spotfire• Hypergeometric test compares the most recent two months• Same statistical test used for EWI

Page 10: 2016 Cymer Intern

Project OutcomesSlide 10

Created Spotfire Dashboard:• Pulls data from SQL• Processes data with R and Python• Interactive display

SR Script

Page 11: 2016 Cymer Intern

Text Mining Extension: BackgroundSlide 11

Reliability manually classifies SRs into ~30 categories • Each SR takes about 1 min• Classifying SRs related to XL Immersion • 13,063 classified SRs to date

Objective: To create and train a model that predicts the category for a given SR.

Page 12: 2016 Cymer Intern

Text Mining Extension: MethodologySlide 12Methodology

• Count term usage• TF-IDF: Term frequency – inverse document frequency• Train an SVM classifier against pre-categorized SRsAchieved 75% accuracy using training set of 12000 SRs and testing set of 1000 SRs

This is an example document. This

document means something

This second document represents something

else

[1, 2, 0, 1, 1, 1, 0, 0, 1, 2][0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

[ 0.34, 0.48, 0. , 0.34 …][ 0. , 0.33, 0.47, 0. …]

Page 13: 2016 Cymer Intern