text analytics national center for supercomputing applications university of illinois at...

57
Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

Upload: darrell-ferguson

Post on 29-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Text Analytics

National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign

Outline

• Text Analytics Applications

• Text Analysis Overview

• Meandre Server Interface

• Hands-On

SEASR @ Work – MONK

• Executes flows for each analysis requested– Predictive

modeling using Naïve Bayes

– Predictive modeling using Support Vector Machines (SVM)

– Feature Comparison (Dunning Loglikelihood)

SEASR @ Work – DISCUS• On-demand usage of

analytics while surfing

– While navigating request analytics to be performed on page

– Text extraction and cleaning

• Summarization and key work extraction

– List the important terms on the page being analyzed

– Provide relevant short summaries

• Visual maps– Provide a visual

representation of the key concepts

– Show the graph of relations between concepts

SEASR @ Work – Entity Mash-up

• Entity Extraction with OpenNLP

• Locations viewed on Google Map

• Dates viewed on Simile Timeline

Feature Lens

“The discussion of the children introduces each of the short internal narratives. This champions the view that her method of repetition was patterned: controlled, intended, and a measured means to an end. It would have been impossible to discern through traditional reading“

SEASR @ Work – Emotion TrackingGoal is to have this type of Visualization to track

emotions across a text document (Leveraging flare.prefuse.org)

SEASR Text Analytics Goals

Address the Scholarly text analytics needs by:• Efficiently managing distributed Literary and Historical

textual assets• Structuring extracted information to facilitate

knowledge discovery• Extracting information from text at a level of

semantic/functional abstraction that is sufficiently rich and efficient for analysis

• Devising a representation for the extracted information

• Devising algorithms for question answering and inference

• Developing UI for effective visual knowledge discovery and data exploration with separate query logic from application logic

• Leveraging existing machine learning approaches for text

• Enabling the text analytics through SEASR components

Text Analytics Definition

Many definitions in the literature

• The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data

• An exploration and analysis of textual (natural-language) data by automatic and semi automatic means to discover new knowledge

Text Analytics Process

Text Analytics Process

• Text Preprocessing– Syntactic Text Analysis

– Semantic Text Analysis

• Features Generation – Bag of Words

– Ngrams

• Feature Selection– Simple Counting

– Statistics

– Selection based on POS

• Text/Data Analytics

– Classification - Supervised Learning

– Clustering - Unsupervised Learning

– Information Extraction

• Analyzing Results

– Visual Exploration, Discovery and Knowledge Extraction

– Query-based – question answering

Text Representation

• Many machine learning algorithms need numerical data, so text must be transformed

• Determining this representation can be challenging

Text Characteristics (1)

• Large textual data base – Enormous wealth of textual information on the Web– Publications are electronic

• High dimensionality – Consider each word/phrase as a dimension

• Noisy data– Spelling mistakes– Abbreviations– Acronyms

• Text messages are very dynamic– Web pages are constantly being generated (removed)– Web pages are generated from database queries

• Not well structured text– Email/Chat rooms

• “r u available ?”• “Hey whazzzzzz up”

– Speech

Text Characteristics (2)

• Dependency – Relevant information is a complex conjunction of

words/phrases– Order of words in the query

• hot dog stand in the amusement park • hot amusement stand in the dog park

• Ambiguity – Word ambiguity

• Pronouns (he, she …)• Synonyms (buy, purchase)• Multiple meanings (bat – it is related to baseball or mammal)

– Semantic ambiguity• The king saw the monkey with his glasses. (multiple meanings)

• Authority of the source– IBM is more likely to be an authorized source then my

second far cousin

Text Preprocessing

• Syntactic analysis– Tokenization– Lemmitization– Part Of Speech (POS) tagging– Shallow parsing– Custom literary tagging

• Semantic analysis– Information Extraction

• Named Entity tagging• Unnamed Entity tagging

– Co-reference resolution– Ontological association (WordNet, VerbNet)– Semantic Role analysis– Concept-Relation extraction

Feature Selection

• Reduce Dimensionality

– Learners have difficulty addressing tasks with high dimensionality

• Irrelevant Features

– Not all features help!

– Remove features that occur in only a few documents

– Reduce features that occur in too many documents

Syntactic Analysis• Tokenization

– Text document is represented by the words it contains (and their occurrences)

– e.g., “Lord of the rings” {“the”, “Lord”, “rings”, “of”}– Highly efficient– Makes learning far simpler and easier– Order of words is not that important for certain

applications

• Lemmitization/Stemming– Involves the reduction of corpus words to their

respective headwords (i.e. lemmas) – Means removal suffixes, prefixes and infixes to root– Reduces dimensionality– Identifies a word by its root– e.g., flying, flew fly

• Bigrams and trigrams– Retains semantic content

Syntactic Analysis• Stop words

– Identifies the most common words that are unlikely to help with text analytics, e.g., “the”, “a”, “an”, “you”

– Identifies context dependent words to be removed, e.g., “computer” from a collection of computer science documents

• Scaling words– Important words should be scaled upwards, and vice versa– TF-IDF stands for Term Frequency and Inverse Document

Frequency product

• Parsing / Part of Speech (POS) tagging– Generates a parse tree (graph) for each sentence– Each sentence is a stand alone graph – Find the corresponding POS for each word– e.g., John (noun) gave (verb) the (det) ball (noun) – Shallow Parsing: analysis of a sentence which identifies the

constituents (noun groups, verbs,...), but does not specify their internal structure, nor their role in the main sentence

Semantic Analysis

• Deep Parsing

– more sophisticated syntactic, semantic and contextual processing must be performed to extract or construct the answer

• Information extraction is the identification of specific semantic elements within a text (e.g., entities, properties, relations)

• Extract the relevant information and ignore non-relevant information (important!)

• Link related information and output in a predetermined format

Information ExtractionInformation Type State of the art (Accuracy)

Entitiesan object of interest such as a

person or organization.

90-98%

Attributesa property of an entity such as its name, alias, descriptor, or type.

80%

Factsa relationship held between two or more entities such as Position of a

Person in a Company.

60-70%

Eventsan activity involving several entities such as a terrorist act, airline crash, management change, new product

introduction.

50-60%

“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL

Information Extraction Approaches• Terminology (name) lists

– This works very well if the list of names and name expressions is stable and available

• Tokenization and morphology– This works well for things like formulas or

dates, which are readily recognized by their internal format (e.g., DD/MM/YY or chemical formulas)

• Use of characteristic patterns– This works fairly well for novel entities– Rules can be created by hand or learned via

machine learning or statistical algorithms– Rules capture local patterns that characterize

entities from instances of annotated training data

Relation (Event) Extraction

• Identify (and tag) the relation among two entities– A person is_located_at a location (news)– A gene codes_for a protein (biology)

• Relations require more information – Identification of two entities & their

relationship– Predicted relation accuracy

• Pr(E1)*Pr(E2)*Pr(R) ~= (.93) * (.93) * (.93) = .80

• Information in relations is less local– Contextual information is a problem: right word

may not be explicitly present in the sentence– Events involve more relations and are even

harder

Mayor Rex Luthor announced today the establishment

of a new research facility in Alderwood. It will be

known as Boynton Laboratory.

NE:Person NE:Time

NE:Location

NE:Organization

Semantic Analytics

– Named Entity (NE) Tagging

Mayor Rex Luthor announced today the establishment

of a new research facility in Alderwood. It will be

known as Boynton Laboratory.

UNE:Organization

Semantic Analysis

• Semantic Category (unnamed entity, UNE) Tagging

Mayor Rex Luthor announced today the establishment

of a new research facility in Alderwood. It will be

known as Boynton Laboratory.

UNE:Organization

Semantic Analysis

• Co-reference Resolution for entities and unnamed entities

Mayor Rex Luthor announced today the establishment

known as Boynton Laboratory

of a new research facility in Alderwoon. It will be

ACTIONACTOR WHEN OBJECT

WHERE

ACTION

OBJECT

COMPL

Semantic Analysis

• Semantic Role Analysis

Rex Luthor

person

announce

action

establ.

event

Boynton Lab

organiz.

today

time

Alderwood

location

location(where)

object(w

hat)

time

(when)

object

(wha

t)

actor(who)

Semantic Analysis

• Concept-Relation Extraction

(c) 2001, Chicago Tribune. Visit the Chicago Tribune on the Internet at http://www.chicago.tribune.com/ Distributed by Knight Ridder/Tribune Information Services. By Stephen J. Hedges and Cam Simpson …….

The Finsbury Park Mosque is the center of radical Muslim activism in England. Through its doors have passed at least three of the men now held on suspicion of terrorist activity in France, England and Belgium, as well as one Algerian man in prison in the United States.

``The mosque's chief cleric, Abu Hamza al-Masri lost two hands fighting the Soviet Union in Afghanistan and he advocates the elimination of Western influence from Muslim countries. He was arrested in London in 1999 for his alleged involvement in a Yemen bomb plot, but was set free after Yemen failed to produce enough evidence to have him extradited. .'‘ …

Information Extraction:Template Extraction

<Facility>Finsbury Park Mosque</Facility>

<PersonPositionOrganization>  <OFFLEN OFFSET="3576" LENGTH=“33" />  <Person>Abu Hamza al-Masri</Person> <Position>chief cleric</Position> <Organization>Finsbury Park Mosque</Organization> </PersonPositionOrganization>

<Country>England</Country>

<PersonArrest>  <OFFLEN OFFSET="3814" LENGTH="61" />   <Person>Abu Hamza al-Masri</Person>   <Location>London</Location> <Date>1999</Date> <Reason>his alleged involvement in a Yemen bomb plot</Reason>   </PersonArrest>

<Country>England</Country>

<Country>France </Country>

<Country>United States</Country>

<Country>Belgium</Country>

<Person>Abu Hamza al-Masri</Person>

<City>London</City>

Streaming Text: Knowledge Extraction

• Leveraging some earlier work on information extraction from text streams

Information extraction

• process of using advanced automated machine learning approaches

• to identify entities in text documents

• extract this information along with the relationships these entities may have in the text documents

The visualization above demonstrates information extraction of names, places and organizations from real-time news feeds. As news articles arrive, the information is extracted and displayed. Relationships are defined when entities co-occur within a specific window of words.

Results: Social Network (Tom in Red)

Ontological Association (WordNet)

• As of 2006, the database contains about 150,000 words organized in over 115,000 synsets for a total of 207,000 word-sense pairs

POS Unique Strings

Synsets Total Strings Word-Sense Pairs

Noun 117798 82115 146312

Verb 11529 13767 25047

Adjective 21479 18156 30002

Adverb 4481 3621 5580

Totals 155287 117659 206941

Ontological Association (WordNet)

Search for table• Noun

– S: (n) table, tabular array (a set of data arranged in rows and columns) "see table 1”

– S: (n) table (a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs) "it was a sturdy table”

– S: (n) table (a piece of furniture with tableware for a meal laid out on it) "I reserved a table at my favorite restaurant”

– S: (n) mesa, table (flat tableland with steep edges) "the tribe was relatively safe on the mesa but they had to descend into the valley for water”

– S: (n) table (a company of people assembled at a table for a meal or game) "he entertained the whole table with his witty remarks”

– S: (n) board, table (food or meals in general) "she sets a fine table"; "room and board”

• Verb

– S: (v) postpone, prorogue, hold over, put over, table, shelve, set back, defer, remit, put off (hold back to a later time) "let's postpone the exam”

– S: (v) table, tabularize, tabularise, tabulate (arrange or enter in tabular form)

Text Analytics: General Application Areas

• Information Retrieval– Indexing and retrieval of textual documents

– Finding a set of (ranked) documents that are relevant to the query

• Information Extraction– Extraction of partial knowledge in the text

• Web Mining– Indexing and retrieval of textual documents and

extraction of partial knowledge using the web

• Classification– Predict a class for each text document

• Clustering– Generating collections of similar text documents

Text Analytics: Supervised vs. Unsupervised

• Supervised learning (Classification)– Data (observations, measurements, etc.) are

accompanied by labels indicating the class of the observations

– Split into training data and test data for model building process

– New data is classified based on the model built with the training data

– Techniques• Bayesian classification, Decision trees, Neural networks,

Instance-Based Methods, Support Vector Machines

• Unsupervised learning (Clustering)– Class labels of training data is unknown– Given a set of measurements, observations, etc. with

the aim of establishing the existence of classes or clusters in the data

Text Analytics: Classification

• Given: Collection of labeled records– Each record contains a set of features (attributes), and the true

class (label)– Create a training set to build the model– Create a testing set to test the model

• Find: Model for the class as a function of the values of the features

• Goal: Assign a class (as accurately as possible) to previously unseen records

• Evaluation: What Is Good Classification?– Correct classification

• Known label of test example is identical to the predicted class from the model

– Accuracy ratio• Percent of test set examples that are correctly classified by the

model

– Distance measure between classes can be used• e.g., classifying “football” document as a “basketball” document is

not as bad as classifying it as “crime”

Text Analytics: Clustering

• Given: Set of documents and a similarity measure among documents

• Find: Clusters such that– Documents in one cluster are more similar to one

another– Documents in separate clusters are less similar to

one another

• Similarity Measures:– Euclidean distance if attributes are continuous– Other problem-specific measures

• e.g., how many words are common in these documents

• Evaluation: What Is Good Clustering?– Produce high quality clusters with

• high intra-class similarity• low inter-class similarity

– Quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

Text Analytics: Frequent Patterns

• Given: Set of documents • Find Frequent Patterns such that

– Common words patterns used in the collection

• Evaluation: What Is Good Patterns?• Results:1060 patterns discovered.

322: Lincoln147: Abe117: man100: Mr.100: time98: Lincoln Abe91: father85: Lincoln Mr.85: Lincoln man75: day70: Abraham

70: President68: boy67: Lincoln time65: Lincoln Abraham65: life63: Lincoln father57: men57: work52: Lincoln day…

Text Analytics: Discus

• Given: Set of documents• Find Top Sentences and

Top Tokens– Top sentences contain top tokens– Top tokens exist in top sentences

• Results:

Simile Timeline

• Constructed by Hand

Simile Timeline in SEASR

• Dates are automatically extracted with their sentences

Meandre Server Interface• Made mainly for

administrators• Provide one-click access to

most of the services• Sections

– Repository– Publish– Execution– Cluster– Location– Security– Server Logs– Public– About

Repository

• Navigates the current user’s repository

• List components and flows

• Searching components or flows

• Navigating by tags

• Clear all the components and flows

• Regenerates the repository using the registered locations

Publish

• Allows one to manipulate the public repository of components and flows

• Allows one to publish and unpublish components and flows

• Publish and Unpublish All functionality

Execution

• Exposes the execution engine• Allows one to list current running flows• Allows one to upload and execute all the flows in a

self-contained repository• Allows one to tune and execute by changing

property settings• Exposes the list of jobs tracked by the server• Give access to the executing flows output consoles

Cluster

• Exposes the single image cluster log

• List the status of the servers forming the cluster

• Expose basic information about the servers

• List internal properties used by the Meandre cluster

• Allows the shutdown of the current Meandre Server being accessed

Location

• Allows addition of locations to import components and flows

• Allows removal of a location which deletes components and flows

• List the current locations used

Security

• Manage users

– Create

– Remove

– Update

– List

• Grant and revoke roles

– Execution

– Location

Server Logs

• Allows the logs from the server to be seen via the web application

Public

• Expose basic public services

– Published repository

– A simple demo repository of components and flows

– A ping/pong service

About

• Expose server information

– Installation information

– Version

– Plugins installed on this server

Demonstration

• Meandre Server Interface

–Tag Cloud Viewer

–Text Clustering,

–Entity Extraction

Learning Exercises

1. Explore Meandre Server Interface A. Open browser and point to

http://SERVER:1714/public/services/ping.htmlB. Explore flows C. Execute flows D. Tune and execute flowsE. Other functionality

Learning Exercises: Text Clustering2. Execute the "Text Clustering" flow on a

hard coded web page A. Click "flows" B. Under "Action", click "run"

3. Execute the "Text Clustering" flow on a webpage of your choice A. Click "flows” B. Under "Action", click "tune&run” C. On the webpage, replace the

"http://www.gutenberg.org/files/22925/22925.txt" with a web url of interest to you.

D. Click on the “Running flows”E. Click on the url of the flow you just executed

Learning Exercises: Text Frequent Patterns

4. Execute the "Text Frequent Patterns" flow on a hard coded web page A. Click "flows" B. Under "Action", click "run"

5. Execute the "Text Frequent Patterns" flow on a webpage of your choice A. Click "flows” B. Under "Action", click "tune&run” C. On the webpage, replace the

"http://www.gutenberg.org/files/22925/22925.txt" with a web url of interest to you.

D. Results will be displayed in the console

Learning Exercises: Date Entities to Simile

6. Execute the ”Date Entities to Simile Timeline" flow on a hard coded web page A. Click "flows" B. Under "Action", click "run"

7. Execute the "Date Entities to Simile Timeline" flow on a webpage of your choice A. Click "flows” B. Under "Action", click "tune&run” C. On the webpage, replace the

"http://www.gutenberg.org/files/22925/22925.txt" with a web url of interest to you.

D. Click on the “Running flows”E. Click on the url of the flow you just executed

Learning Exercises: HITS Summarizer

8. Execute the "HITS Summarizer " flow on a hard coded web page A. Click "flows" B. Under "Action", click "run"

9. Execute the HITS Summarizer " flow on a webpage of your choice A. Click "flows” B. Under "Action", click "tune&run” C. On the webpage, replace the

"http://www.gutenberg.org/files/22925/22925.txt" with a web url of interest to you.

D. Results will be displayed in the console

Discussion Questions

• Identify and discuss three other text tools that could be useful in the Humanities?

• What are the obstacles to using this technology for text analysis - what will your colleagues say?