1 natural language processing @ emory eugene agichtein math & computer science and cci andrew...

66
1 Natural Language Natural Language Processing Processing @ Emory @ Emory Eugene Agichtein Eugene Agichtein Math & Computer Science and CCI Math & Computer Science and CCI Andrew Post Andrew Post CCI and Biomedical Engineering (?) CCI and Biomedical Engineering (?)

Upload: oswin-goodman

Post on 27-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

11

Natural Language Natural Language ProcessingProcessing@ Emory@ Emory

Eugene AgichteinEugene AgichteinMath & Computer Science and CCIMath & Computer Science and CCI

Andrew PostAndrew PostCCI and Biomedical Engineering (?)CCI and Biomedical Engineering (?)

22

Projects in the IR Lab (Agichtein Projects in the IR Lab (Agichtein Lab)Lab)

and Question Answering

Patterns in Text (Author

Behavior)

Patterns in Search

(Searcher Behavior)

Structuring Information in

Bio- and Medical text

Discovering Implicit Networks: Entity, Relation,

and Event Extraction

Content Creation and Discovery in

Social Media

Understanding Searcher

Inference and Decision Process

Question Answering

33

NLP & Text Mining Projects in NLP & Text Mining Projects in IRLabIRLab

EMTextEMText: Information Extraction from : Information Extraction from Text in Electronic Medical RecordsText in Electronic Medical Records

Other projects:Other projects:Collaborative filtering for Med. LiteratureCollaborative filtering for Med. LiteratureRecognizing textual entailment (TAC 2008 Recognizing textual entailment (TAC 2008 RTE track)RTE track)Web-scale semantic network extractionWeb-scale semantic network extraction

44

Information Extraction From EMR Information Extraction From EMR TextText Electronic Medical Records (EMRs) contain Electronic Medical Records (EMRs) contain

important metadata for analysis, data important metadata for analysis, data mining, and decision supportmining, and decision support– Example: patient who has had diabetes should Example: patient who has had diabetes should

have different interpretation of MPI results; have different interpretation of MPI results; depends on how long, how severe, and how long depends on how long, how severe, and how long since has been controlledsince has been controlled

– This information often resides in the text of the This information often resides in the text of the EMR (physican/nurse reports, notes, discharge EMR (physican/nurse reports, notes, discharge summaries)summaries)

Challenges:Challenges:– Access to dataAccess to data– Inconsistent informationInconsistent information– Little or no manually labeled data Little or no manually labeled data

55

I2B2 NLP 2008 Obesity ChallengeI2B2 NLP 2008 Obesity Challenge(SUNY/MIT/Partners Healthcare)(SUNY/MIT/Partners Healthcare)

Participated in the I2B2 2008 NLP Obesity Participated in the I2B2 2008 NLP Obesity ChallengeChallenge– The Challenge: to build systems that will The Challenge: to build systems that will

correctly replicate the textual and intuitive correctly replicate the textual and intuitive judgments of the obesity experts on obesity judgments of the obesity experts on obesity and [15] co-morbidities based on the narrative and [15] co-morbidities based on the narrative patient records.patient records.

Our approach: machine learning over Our approach: machine learning over lexical, semantic, and statistical featureslexical, semantic, and statistical features– Words, phrases, UMLS terms in textWords, phrases, UMLS terms in text– NegationNegation– Corpus co-occurrence statisticsCorpus co-occurrence statistics– SVM, boosting, TBL to combine predictionsSVM, boosting, TBL to combine predictions

Outcome:Outcome:– Much room for improvement exists both for Much room for improvement exists both for

accuracy and efficiency, great learning accuracy and efficiency, great learning experienceexperience

I2B2 NLP Challenge 2010

77

User Behavior:User Behavior:The 3The 3rdrd Dimension of the Dimension of the WebWeb

Amount exceeds web Amount exceeds web content and content and structurestructure– Published: 4Gb/day; Published: 4Gb/day; Social Media: Social Media:

10gb/Day 10gb/Day – Page views: 100Gb/day Page views: 100Gb/day

[Andrew Tomkins, Yahoo! Search, [Andrew Tomkins, Yahoo! Search, 2007]2007]

88

Web search user behavior: Web search user behavior: goldmine of noisy data goldmine of noisy data

Relative clickthrough for queries with known relevant results in position 1 and 3

respectively

1 2 3 5 10

Result Position

Re

lati

ve

Cli

ck

Fre

qu

en

cy

All queries

PTR=1

PTR=3

Higher clickthrough at top non-relevant than at top relevant

document

99

Approach: go beyond Approach: go beyond clickthrough/download countsclickthrough/download counts

PresentationPresentation

ResultPositionResultPosition Position of the URL in Current Position of the URL in Current rankingranking

QueryTitleOverQueryTitleOverlaplap

Fraction of query terms in result Fraction of query terms in result TitleTitle

Clickthrough Clickthrough

DeliberationTiDeliberationTimeme

Seconds between query and first Seconds between query and first clickclick

ClickFrequencyClickFrequency Fraction of all clicks landing on Fraction of all clicks landing on pagepage

ClickDeviationClickDeviation Deviation from expected click Deviation from expected click frequencyfrequency

Browsing Browsing

DwellTimeDwellTime Result page dwell timeResult page dwell time

DwellTimeDeviDwellTimeDeviationation

Deviation from expected dwell time Deviation from expected dwell time for queryfor query

1010

Example results: Predicting Example results: Predicting User PreferencesUser Preferences

SA+N

0.6

0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.8

0 0.1 0.2 0.3 0.4

Recall

Pre

cis

ion

SA+N

CD

UserBehavior

Baseline

• Baseline < SA+N < CD << UserBehavior• Rich user behavior features result in dramatic improvement

1111

User Behavior User Behavior Complements Content Complements Content and Web Topology and Web Topology

0.45

0.5

0.55

0.6

0.65

0.7

1 3 5 10K

Pre

cis

ion

RNRN+AllBM25BM25+All

MethodMethod P@1P@1 GainGain

RN (Content + Links)RN (Content + Links) 0.6320.632

RN + All (User Behavior)RN + All (User Behavior) 0.6930.693 0.061(10%)0.061(10%)

BM25BM25 0.5250.525

BM25+AllBM25+All 0.6870.687 0.162 (31%)0.162 (31%)

1212

Instrumenting the Emory Instrumenting the Emory Library and BeyondLibrary and Beyond

Evaluate effectiveness of Evaluate effectiveness of search/discovery with behavioral search/discovery with behavioral metrics (task-specific)metrics (task-specific)– Perform aggregate, longitudinal studiesPerform aggregate, longitudinal studies

Develop tools for usability studies Develop tools for usability studies ““in in the wildthe wild””– Scale (hundreds/thousands of Scale (hundreds/thousands of

““participantsparticipants””))– Realistic behavior and tasksRealistic behavior and tasks– On-demand playback of On-demand playback of ““interestinginteresting””

sessionssessions

Unified analysis/query framework for Unified analysis/query framework for internal and external resource access internal and external resource access and usage statisticsand usage statistics– Web-based query and statistics interfaceWeb-based query and statistics interface– Access auditing, privacy, anonymity Access auditing, privacy, anonymity

enforcedenforced

1313

Emory User Behavior Analysis Emory User Behavior Analysis System (EUBA)System (EUBA)

EUBA:EUBA:– Client-side instrumentation Client-side instrumentation

(Firefox toolbar)(Firefox toolbar)– Data mining/machine learning Data mining/machine learning

componentscomponents– Log DB management system, web-Log DB management system, web-

based interface for querying, based interface for querying, playback, annotation playback, annotation

Plan: to release the system to Plan: to release the system to research/library community (Q2 research/library community (Q2 2009)?2009)?

141414

Simple featuresSimple features Basic FeaturesBasic Features

– Trajectory Trajectory lengthlength

– Horizontal Horizontal rangerange

– Vertical rangeVertical range

Horizontal range

Vertical range

Trajectory length

151515Intelligent Information Access Lab

http://ir.mathcs.emory.edu/

Mouse Movement Mouse Movement Representation Representation FeaturesFeatures

Second Second representation: representation: – 5 segments: 5 segments: initial, early, middle, initial, early, middle, late, and endlate, and end– Each segment: Each segment: speed, acceleration, speed, acceleration, rotation, slope, etc.rotation, slope, etc.

1

2

3

4

5

1616

Summary of Summary of Experimental ResultsExperimental Results Client-side behavior mining Client-side behavior mining

significantly outperforms aggregate, significantly outperforms aggregate, server-side measures for user intent server-side measures for user intent detection and satisfaction tasks detection and satisfaction tasks

Can be used even if user does not Can be used even if user does not generate server-trackable action (e.g., generate server-trackable action (e.g., click or download)click or download)

Feasible to perform inference on Feasible to perform inference on search instance vs. aggregating across search instance vs. aggregating across different users/searchersdifferent users/searchers 16

1717

OutlineOutline

Overview of Intelligent Overview of Intelligent Information Access Lab ResearchInformation Access Lab Research– Information retrieval & extraction, Information retrieval & extraction,

text mining, and data integrationtext mining, and data integration– User behavior modeling, User behavior modeling,

interactions, and collaborative interactions, and collaborative filteringfiltering

Mining User-generated contentMining User-generated content

Current and Future CollaborationsCurrent and Future Collaborations

1818

User Generated ContentUser Generated Content

1919

http://answers.yahoo.com/question/index;_ylt=3?qid=20071008115118AAh1HdO

2020

Some goals of mining social Some goals of mining social mediamedia

Find high-quality contentFind high-quality content Find Find relevantrelevant and high quality and high quality

contentcontent Use millions of interactions toUse millions of interactions to

– Understand complex information Understand complex information needsneeds

– Model subjective information Model subjective information seekingseeking

– Understand cultural dynamicsUnderstand cultural dynamics

2121

2222

2323

2424

2525

2626

2727

2828

2929

CommunityCommunity

3030

3131

3232

3333

3434

3535

Editorial Quality != Editorial Quality != User Perception!User Perception!

3636

Lifecycle of a QuestionLifecycle of a Question

User

Choose a category

Choose a category

Compose the question

Compose the question

Openquestion

Openquestion Examine

Find the answer?Find the answer?

Close questionChoose best answers

Give ratings

Close questionChoose best answers

Give ratings

Question is closed by system.Best answer is chosen by voters

Question is closed by system.Best answer is chosen by voters

Yes

No

AnswerAnswer AnswerAnswer AnswerAnswer

User User UserUser User User User

+-

--+ ++

3737

Yahoo! Answers: The Yahoo! Answers: The Good NewsGood News

Active community of millions Active community of millions of users in many countries of users in many countries and languagesand languages

Accumulated a great number Accumulated a great number of questions and answersof questions and answers

Effective for Effective for subjectivesubjective information needsinformation needs– Great forum for Great forum for

socialization/chatsocialization/chat (Can be) invaluable for hard-(Can be) invaluable for hard-

to-find information not to-find information not available on webavailable on web

3838

3939

Yahoo! Answers: The Yahoo! Answers: The Bad NewsBad News

May have to wait a May have to wait a longlong time to get a time to get a satisfactory answersatisfactory answer

May May nevernever obtain a satisfying answer obtain a satisfying answer

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10

1. 2006 FIFA World Cup2. Optical3. Poetry4. Football (American)5. Scottish Football (Soccer)6. Medicine7. Winter Sports8. Special Education9. General Health Care10. Outdoor Recreation

Time to close a question (hours) for sample question categories

Tim

e t

o

clo

se

4040

The Problem of Asker The Problem of Asker SatisfactionSatisfaction Given a question submitted Given a question submitted

by an asker in CQA, predict by an asker in CQA, predict whether the user will be whether the user will be satisfiedsatisfied with the answers with the answers contributed by the contributed by the community.community.

– Where Where ““SatisfiedSatisfied”” is defined as:is defined as: The asker personally has closed the The asker personally has closed the

question ANDquestion AND Selected the best answer ANDSelected the best answer AND Provided a rating of at least 3 Provided a rating of at least 3 ““starsstars””

for the best answerfor the best answer

– Otherwise, the asker is Otherwise, the asker is ““UnsatisfiedUnsatisfied””

4141

ClassifierSupport Vector MachinesDecision TreeBoostingNaïve Bayes

asker is satisfied

asker is not satisfied

Satisfaction Prediction Satisfaction Prediction FrameworkFramework

Approach: Classification algorithms from Approach: Classification algorithms from machine learningmachine learning

Textual Features

Category Features

Answerer HistoryFeaturesAsker History

Features

Answer FeaturesQuestion Features

4242

Question-Answer Question-Answer FeaturesFeatures

Q: length, posting time…

QA: length, KL divergence

Q:Votes

Q:Terms

4343

User FeaturesUser FeaturesU: Member since

U: Total points

U: #Questions

U: #Answers

4444

Category FeaturesCategory Features CA: Average time CA: Average time

to close a to close a questionquestion

CA: Average # CA: Average # answers per answers per questionquestion

CA: Average CA: Average asker ratingasker rating

CA: Average CA: Average voter ratingvoter rating

CA: Average # CA: Average # questions per questions per hourhour

CA: Average # CA: Average # answers per houranswers per hour

CategoryCategory #Q#Q #A#A #A per #A per QQ

SatisfieSatisfiedd

Avg asker Avg asker ratingrating

Time to close by Time to close by askerasker

General General HealthHealth

134134 737377

5.465.46 70.4%70.4% 4.494.49 1 day and 13 1 day and 13 hourshours

4545

Classification Classification AlgorithmsAlgorithms Weka implementationWeka implementation

– http://www.cs.waikato.ac.nz/ml/http://www.cs.waikato.ac.nz/ml/wekaweka

Decision TreeDecision Tree– C4.5: confidence factor 0.05. Ross C4.5: confidence factor 0.05. Ross

Quinlan (1993) Quinlan (1993) – RandomForest: Leo Breiman RandomForest: Leo Breiman

(2001) (2001) Support Vector MachineSupport Vector Machine: : J. Platt J. Platt

(1999)(1999).. Boosting(AdaBoost): Boosting(AdaBoost): Yoav Yoav

Freund, Robert E. Schapire Freund, Robert E. Schapire (1996)(1996)

NaNaïïve Bayes: George H. John, ve Bayes: George H. John, Pat Langley (1995)Pat Langley (1995)

4646

MethodsMethods Heuristic: Heuristic: # answers # answers Baseline: Baseline: Simply predicts the majority Simply predicts the majority

class (satisfied).class (satisfied). ASP_SVM: ASP_SVM: Our system with the SVM Our system with the SVM

classifierclassifier ASP_C4.5:ASP_C4.5: with the C4.5 classifier with the C4.5 classifier ASP_RandomForest: ASP_RandomForest: with the with the

RandomForest classifierRandomForest classifier ASP_Boosting: ASP_Boosting: with the AdaBoost with the AdaBoost

algorithm combining weak learnersalgorithm combining weak learners ASP_NaiveBayes: ASP_NaiveBayes: with the Naive Bayes with the Naive Bayes

classifierclassifier

4747

Evaluation metricsEvaluation metrics

PrecisionPrecision– The fraction of the predicted satisfied The fraction of the predicted satisfied

asker information needs that were asker information needs that were indeed rated satisfactory by the asker.indeed rated satisfactory by the asker.

RecallRecall– The fraction of all rated satisfied The fraction of all rated satisfied

questions that were correctly identified questions that were correctly identified by the system.by the system.

F-scoreF-score– The geometric mean of Precision and The geometric mean of Precision and

Recall measures,Recall measures,– Computed as Computed as

2*(precision*recall)/(precision+recall)2*(precision*recall)/(precision+recall) AccuracyAccuracy

– The overall fraction of instances The overall fraction of instances classified correctly into the proper class. classified correctly into the proper class.

4848

DatasetDataset

Crawled from Yahoo! Answers in early 2008

Data is available at http://ir.mathcs.emory.edu/

QuestiQuestionon

AnsweAnswerr

AskeAskerr

CategoCategoriesries

% % SatisfieSatisfie

dd216,17

01,963,615

158,515

100 50.7%

4949

Dataset (cont.)Dataset (cont.) Realistic prediction task: given askers’

previous history, we try to predict satisfaction with her current (most recent) question

216,170 questions1,963,615 answers

158,515 askers100 categories

most recent 10,000 questions

random 5000 questions

training test

randomize

5050

Dataset StatisticsDataset StatisticsCategoryCategory #Q#Q #A#A #A per Q#A per Q SatisfiedSatisfied Avg asker Avg asker

ratingratingTime to Time to close by close by askerasker

2006 FIFA 2006 FIFA World World Cup(TM)Cup(TM)

11119494

3563565959

329.86329.86 55.4%55.4% 2.632.63 47 47 minutesminutes

Mental Mental HealthHealth

151511

11511599

7.687.68 70.9%70.9% 4.304.30 1 day and 1 day and 13 hours13 hours

MathematicMathematicss

656511

23223299

3.583.58 44.5%44.5% 4.484.48 33 33 minutesminutes

Diet & Diet & FitnessFitness

454500

24324366

5.415.41 68.4%68.4% 4.304.30 1.5 days1.5 days

Asker satisfaction varies significantly across different categories.

#Q, #A, Time to close… -> Asker Satisfaction

5151

Human Satisfaction Human Satisfaction PredictionPrediction

Truth: askerTruth: asker’’s ratings rating A random sample of 130 A random sample of 130

questionsquestions Annotated by researchers to Annotated by researchers to

calibrate the asker calibrate the asker satisfactionsatisfaction– Agreement: 0.82Agreement: 0.82– F1: 0.45F1: 0.45

5252

Human Satisfaction Human Satisfaction Prediction (ContPrediction (Cont’’d):d): Amazon Mechanical TurkAmazon Mechanical Turk

A service provided by Amazon. A service provided by Amazon. Workers submit responses to a Workers submit responses to a Human Intelligence Task (HIT)Human Intelligence Task (HIT) for a for a small feesmall fee

HIT:HIT:– Used the same 130 questionsUsed the same 130 questions– For each question, list the best answer, For each question, list the best answer,

as well as other four answers ordered by as well as other four answers ordered by votesvotes

– Five independent raters for each Five independent raters for each question. question.

– Agreement: 0.9 F1: 0.61. Agreement: 0.9 F1: 0.61. – Best accuracy achieved when at least 4 Best accuracy achieved when at least 4

out of 5 raters predicted asker to be out of 5 raters predicted asker to be ‘‘satisfiedsatisfied’’ (otherwise, labeled as (otherwise, labeled as ““unsatisfiedunsatisfied””).).

5353

Amazon Mechanical Amazon Mechanical TurkTurk

5454

Comparison of Classifiers Comparison of Classifiers (F-score)(F-score)

ClassifierClassifier With TextWith Text Without TextWithout Text Selected Selected FeaturesFeatures

ASP_SVMASP_SVM 0.690.69 0.720.72 0.620.62

ASP_C4.5ASP_C4.5 0.750.75 0.760.76 0.770.77

ASP_RandomFASP_RandomForestorest

0.700.70 0.740.74 0.680.68

ASP_BoostingASP_Boosting 0.670.67 0.670.67 0.670.67

ASP_NBASP_NB 0.610.61 0.650.65 0.580.58

HumanHuman 0.610.61

BaselineBaseline 0.660.66

C4.5 is the most effective classifier in this task

Human F1 performance is lower than the naïve baseline!

5555

F1 (Satisfied) with varying F1 (Satisfied) with varying training sizestraining sizes

ASP_C4.5 substantially outperforms others

2000 questions is sufficient to achieve 0.75 F1

5656

Features by Information Gain Features by Information Gain (Satisfied)(Satisfied)

0.14219 Q: Askers’ previous rating 0.13965 Q: Average past rating by asker 0.10237 UH: Member since (interval) 0.04878 UH: Average # answers for by past

Q 0.04878 UH: Previous Q resolved for the

asker 0.04381 CA: Average asker rating for the

category 0.04306 UH: Total number of answers

received 0.03274 CA: Average voter rating 0.03159 Q: Question posting time 0.02840 CA: Average # answers per Q

5757

““OfflineOffline”” vs. vs. ““OnlineOnline”” PredictionPrediction

Offline prediction:Offline prediction:– All features( question, answer, asker All features( question, answer, asker

& category)& category)– F1: 0.77F1: 0.77

Online prediction:Online prediction:– all answer featuresall answer features– question features (stars, question features (stars,

#comments, sum of votes#comments, sum of votes……))– F1: 0.74F1: 0.74

5858

Feature AblationFeature AblationPrecision Recall F1

Selected features 0.80 0.73 0.77

No question-answer features

0.76 0.74 0.75

No answerer features 0.76 0.75 0.75

No category features 0.75 0.76 0.75

No asker features 0.72 0.69 0.71

No question features 0.68 0.72 0.70

Asker & Question features are most important.

Answer quality/Answerer expertise/Category characteristics:

may not be important

caring or supportive answers might be preferred sometimes

5959

Satisfaction with varying Satisfaction with varying experienceexperience

Group together questions from askers with the same number of previous questionsAccuracy of prediction increase dramaticallyReaching F1 of 0.9 for askers with >= 5 questions

6060

SummarySummary Asker satisfaction is predictableAsker satisfaction is predictable

– Can achieve higher than human accuracy Can achieve higher than human accuracy by exploiting historyby exploiting history

UserUser’’s experience is importants experience is important General model: one-size-fits-allGeneral model: one-size-fits-all

– 2000 questions for training model are 2000 questions for training model are enoughenough

Current workCurrent work– Personalized satisfaction predictionPersonalized satisfaction prediction– Y.Liu, E. Agichtein.Y.Liu, E. Agichtein. You've Got Answers: Towards You've Got Answers: Towards

Personalized Models for Predicting Success in Personalized Models for Predicting Success in Community Question Answering (ACL 2008)Community Question Answering (ACL 2008)

6161

ACL08ACL08

Textual features only become helpful Textual features only become helpful for users with more than 20 questionsfor users with more than 20 questions

Personalized classifier achieves Personalized classifier achieves surprisingly good accuracysurprisingly good accuracy

For users with only 1 previous question, For users with only 1 previous question, personalized classifiers works very wellpersonalized classifiers works very well

Simple strategy of grouping users by Simple strategy of grouping users by number of previous questions is even number of previous questions is even more effective than other methods for more effective than other methods for users with moderate amount of historyusers with moderate amount of history

For users with few questions, non-For users with few questions, non-textual features are dominanttextual features are dominant

For users with lots of questions, textual For users with lots of questions, textual features are more significantfeatures are more significant

Some personalized Some personalized modelsmodels

6262

6363

Other tasksOther tasks

Subjectivity, sentiment Subjectivity, sentiment analysisanalysis– B. Li, Y. Liu, and E. Agichtein, B. Li, Y. Liu, and E. Agichtein, CoCQA: CoCQA:

Co-Training Over Questions and Co-Training Over Questions and Answers with an Application to Answers with an Application to Predicting Question Subjectivity Predicting Question Subjectivity OrientationOrientation, in EMNLP 2008, in EMNLP 2008

Discourse analysisDiscourse analysis Cross-cultural comparisonsCross-cultural comparisons CQA vs. web search CQA vs. web search

comparisoncomparison

6464

6565

OutlineOutline

Overview of Intelligent Overview of Intelligent Information Access Lab ResearchInformation Access Lab Research– Information retrieval & extraction, Information retrieval & extraction,

text mining, and data integrationtext mining, and data integration– User behavior modeling, User behavior modeling,

interactions, and collaborative interactions, and collaborative filteringfiltering

Mining User-generated contentMining User-generated content

Current and Future ResearchCurrent and Future Research