text analytics national center for supercomputing applications university of illinois at...
TRANSCRIPT
Text Analytics
National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign
Outline
• Text Analytics Applications
• Text Analysis Overview
• Meandre Server Interface
• Hands-On
SEASR @ Work – MONK
• Executes flows for each analysis requested– Predictive
modeling using Naïve Bayes
– Predictive modeling using Support Vector Machines (SVM)
– Feature Comparison (Dunning Loglikelihood)
SEASR @ Work – DISCUS• On-demand usage of
analytics while surfing
– While navigating request analytics to be performed on page
– Text extraction and cleaning
• Summarization and key work extraction
– List the important terms on the page being analyzed
– Provide relevant short summaries
• Visual maps– Provide a visual
representation of the key concepts
– Show the graph of relations between concepts
SEASR @ Work – Entity Mash-up
• Entity Extraction with OpenNLP
• Locations viewed on Google Map
• Dates viewed on Simile Timeline
Feature Lens
“The discussion of the children introduces each of the short internal narratives. This champions the view that her method of repetition was patterned: controlled, intended, and a measured means to an end. It would have been impossible to discern through traditional reading“
SEASR @ Work – Emotion TrackingGoal is to have this type of Visualization to track
emotions across a text document (Leveraging flare.prefuse.org)
SEASR Text Analytics Goals
Address the Scholarly text analytics needs by:• Efficiently managing distributed Literary and Historical
textual assets• Structuring extracted information to facilitate
knowledge discovery• Extracting information from text at a level of
semantic/functional abstraction that is sufficiently rich and efficient for analysis
• Devising a representation for the extracted information
• Devising algorithms for question answering and inference
• Developing UI for effective visual knowledge discovery and data exploration with separate query logic from application logic
• Leveraging existing machine learning approaches for text
• Enabling the text analytics through SEASR components
Text Analytics Definition
Many definitions in the literature
• The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data
• An exploration and analysis of textual (natural-language) data by automatic and semi automatic means to discover new knowledge
Text Analytics Process
• Text Preprocessing– Syntactic Text Analysis
– Semantic Text Analysis
• Features Generation – Bag of Words
– Ngrams
• Feature Selection– Simple Counting
– Statistics
– Selection based on POS
• Text/Data Analytics
– Classification - Supervised Learning
– Clustering - Unsupervised Learning
– Information Extraction
• Analyzing Results
– Visual Exploration, Discovery and Knowledge Extraction
– Query-based – question answering
Text Representation
• Many machine learning algorithms need numerical data, so text must be transformed
• Determining this representation can be challenging
Text Characteristics (1)
• Large textual data base – Enormous wealth of textual information on the Web– Publications are electronic
• High dimensionality – Consider each word/phrase as a dimension
• Noisy data– Spelling mistakes– Abbreviations– Acronyms
• Text messages are very dynamic– Web pages are constantly being generated (removed)– Web pages are generated from database queries
• Not well structured text– Email/Chat rooms
• “r u available ?”• “Hey whazzzzzz up”
– Speech
Text Characteristics (2)
• Dependency – Relevant information is a complex conjunction of
words/phrases– Order of words in the query
• hot dog stand in the amusement park • hot amusement stand in the dog park
• Ambiguity – Word ambiguity
• Pronouns (he, she …)• Synonyms (buy, purchase)• Multiple meanings (bat – it is related to baseball or mammal)
– Semantic ambiguity• The king saw the monkey with his glasses. (multiple meanings)
• Authority of the source– IBM is more likely to be an authorized source then my
second far cousin
Text Preprocessing
• Syntactic analysis– Tokenization– Lemmitization– Part Of Speech (POS) tagging– Shallow parsing– Custom literary tagging
• Semantic analysis– Information Extraction
• Named Entity tagging• Unnamed Entity tagging
– Co-reference resolution– Ontological association (WordNet, VerbNet)– Semantic Role analysis– Concept-Relation extraction
Feature Selection
• Reduce Dimensionality
– Learners have difficulty addressing tasks with high dimensionality
• Irrelevant Features
– Not all features help!
– Remove features that occur in only a few documents
– Reduce features that occur in too many documents
Syntactic Analysis• Tokenization
– Text document is represented by the words it contains (and their occurrences)
– e.g., “Lord of the rings” {“the”, “Lord”, “rings”, “of”}– Highly efficient– Makes learning far simpler and easier– Order of words is not that important for certain
applications
• Lemmitization/Stemming– Involves the reduction of corpus words to their
respective headwords (i.e. lemmas) – Means removal suffixes, prefixes and infixes to root– Reduces dimensionality– Identifies a word by its root– e.g., flying, flew fly
• Bigrams and trigrams– Retains semantic content
Syntactic Analysis• Stop words
– Identifies the most common words that are unlikely to help with text analytics, e.g., “the”, “a”, “an”, “you”
– Identifies context dependent words to be removed, e.g., “computer” from a collection of computer science documents
• Scaling words– Important words should be scaled upwards, and vice versa– TF-IDF stands for Term Frequency and Inverse Document
Frequency product
• Parsing / Part of Speech (POS) tagging– Generates a parse tree (graph) for each sentence– Each sentence is a stand alone graph – Find the corresponding POS for each word– e.g., John (noun) gave (verb) the (det) ball (noun) – Shallow Parsing: analysis of a sentence which identifies the
constituents (noun groups, verbs,...), but does not specify their internal structure, nor their role in the main sentence
Semantic Analysis
• Deep Parsing
– more sophisticated syntactic, semantic and contextual processing must be performed to extract or construct the answer
• Information extraction is the identification of specific semantic elements within a text (e.g., entities, properties, relations)
• Extract the relevant information and ignore non-relevant information (important!)
• Link related information and output in a predetermined format
Information ExtractionInformation Type State of the art (Accuracy)
Entitiesan object of interest such as a
person or organization.
90-98%
Attributesa property of an entity such as its name, alias, descriptor, or type.
80%
Factsa relationship held between two or more entities such as Position of a
Person in a Company.
60-70%
Eventsan activity involving several entities such as a terrorist act, airline crash, management change, new product
introduction.
50-60%
“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL
Information Extraction Approaches• Terminology (name) lists
– This works very well if the list of names and name expressions is stable and available
• Tokenization and morphology– This works well for things like formulas or
dates, which are readily recognized by their internal format (e.g., DD/MM/YY or chemical formulas)
• Use of characteristic patterns– This works fairly well for novel entities– Rules can be created by hand or learned via
machine learning or statistical algorithms– Rules capture local patterns that characterize
entities from instances of annotated training data
Relation (Event) Extraction
• Identify (and tag) the relation among two entities– A person is_located_at a location (news)– A gene codes_for a protein (biology)
• Relations require more information – Identification of two entities & their
relationship– Predicted relation accuracy
• Pr(E1)*Pr(E2)*Pr(R) ~= (.93) * (.93) * (.93) = .80
• Information in relations is less local– Contextual information is a problem: right word
may not be explicitly present in the sentence– Events involve more relations and are even
harder
Mayor Rex Luthor announced today the establishment
of a new research facility in Alderwood. It will be
known as Boynton Laboratory.
NE:Person NE:Time
NE:Location
NE:Organization
Semantic Analytics
– Named Entity (NE) Tagging
Mayor Rex Luthor announced today the establishment
of a new research facility in Alderwood. It will be
known as Boynton Laboratory.
UNE:Organization
Semantic Analysis
• Semantic Category (unnamed entity, UNE) Tagging
Mayor Rex Luthor announced today the establishment
of a new research facility in Alderwood. It will be
known as Boynton Laboratory.
UNE:Organization
Semantic Analysis
• Co-reference Resolution for entities and unnamed entities
Mayor Rex Luthor announced today the establishment
known as Boynton Laboratory
of a new research facility in Alderwoon. It will be
ACTIONACTOR WHEN OBJECT
WHERE
ACTION
OBJECT
COMPL
Semantic Analysis
• Semantic Role Analysis
Rex Luthor
person
announce
action
establ.
event
Boynton Lab
organiz.
today
time
Alderwood
location
location(where)
object(w
hat)
time
(when)
object
(wha
t)
actor(who)
Semantic Analysis
• Concept-Relation Extraction
(c) 2001, Chicago Tribune. Visit the Chicago Tribune on the Internet at http://www.chicago.tribune.com/ Distributed by Knight Ridder/Tribune Information Services. By Stephen J. Hedges and Cam Simpson …….
The Finsbury Park Mosque is the center of radical Muslim activism in England. Through its doors have passed at least three of the men now held on suspicion of terrorist activity in France, England and Belgium, as well as one Algerian man in prison in the United States.
``The mosque's chief cleric, Abu Hamza al-Masri lost two hands fighting the Soviet Union in Afghanistan and he advocates the elimination of Western influence from Muslim countries. He was arrested in London in 1999 for his alleged involvement in a Yemen bomb plot, but was set free after Yemen failed to produce enough evidence to have him extradited. .'‘ …
Information Extraction:Template Extraction
<Facility>Finsbury Park Mosque</Facility>
<PersonPositionOrganization> <OFFLEN OFFSET="3576" LENGTH=“33" /> <Person>Abu Hamza al-Masri</Person> <Position>chief cleric</Position> <Organization>Finsbury Park Mosque</Organization> </PersonPositionOrganization>
<Country>England</Country>
<PersonArrest> <OFFLEN OFFSET="3814" LENGTH="61" /> <Person>Abu Hamza al-Masri</Person> <Location>London</Location> <Date>1999</Date> <Reason>his alleged involvement in a Yemen bomb plot</Reason> </PersonArrest>
<Country>England</Country>
<Country>France </Country>
<Country>United States</Country>
<Country>Belgium</Country>
<Person>Abu Hamza al-Masri</Person>
<City>London</City>
Streaming Text: Knowledge Extraction
• Leveraging some earlier work on information extraction from text streams
Information extraction
• process of using advanced automated machine learning approaches
• to identify entities in text documents
• extract this information along with the relationships these entities may have in the text documents
The visualization above demonstrates information extraction of names, places and organizations from real-time news feeds. As news articles arrive, the information is extracted and displayed. Relationships are defined when entities co-occur within a specific window of words.
Ontological Association (WordNet)
• As of 2006, the database contains about 150,000 words organized in over 115,000 synsets for a total of 207,000 word-sense pairs
POS Unique Strings
Synsets Total Strings Word-Sense Pairs
Noun 117798 82115 146312
Verb 11529 13767 25047
Adjective 21479 18156 30002
Adverb 4481 3621 5580
Totals 155287 117659 206941
Ontological Association (WordNet)
Search for table• Noun
– S: (n) table, tabular array (a set of data arranged in rows and columns) "see table 1”
– S: (n) table (a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs) "it was a sturdy table”
– S: (n) table (a piece of furniture with tableware for a meal laid out on it) "I reserved a table at my favorite restaurant”
– S: (n) mesa, table (flat tableland with steep edges) "the tribe was relatively safe on the mesa but they had to descend into the valley for water”
– S: (n) table (a company of people assembled at a table for a meal or game) "he entertained the whole table with his witty remarks”
– S: (n) board, table (food or meals in general) "she sets a fine table"; "room and board”
• Verb
– S: (v) postpone, prorogue, hold over, put over, table, shelve, set back, defer, remit, put off (hold back to a later time) "let's postpone the exam”
– S: (v) table, tabularize, tabularise, tabulate (arrange or enter in tabular form)
Text Analytics: General Application Areas
• Information Retrieval– Indexing and retrieval of textual documents
– Finding a set of (ranked) documents that are relevant to the query
• Information Extraction– Extraction of partial knowledge in the text
• Web Mining– Indexing and retrieval of textual documents and
extraction of partial knowledge using the web
• Classification– Predict a class for each text document
• Clustering– Generating collections of similar text documents
Text Analytics: Supervised vs. Unsupervised
• Supervised learning (Classification)– Data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
– Split into training data and test data for model building process
– New data is classified based on the model built with the training data
– Techniques• Bayesian classification, Decision trees, Neural networks,
Instance-Based Methods, Support Vector Machines
• Unsupervised learning (Clustering)– Class labels of training data is unknown– Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or clusters in the data
Text Analytics: Classification
• Given: Collection of labeled records– Each record contains a set of features (attributes), and the true
class (label)– Create a training set to build the model– Create a testing set to test the model
• Find: Model for the class as a function of the values of the features
• Goal: Assign a class (as accurately as possible) to previously unseen records
• Evaluation: What Is Good Classification?– Correct classification
• Known label of test example is identical to the predicted class from the model
– Accuracy ratio• Percent of test set examples that are correctly classified by the
model
– Distance measure between classes can be used• e.g., classifying “football” document as a “basketball” document is
not as bad as classifying it as “crime”
Text Analytics: Clustering
• Given: Set of documents and a similarity measure among documents
• Find: Clusters such that– Documents in one cluster are more similar to one
another– Documents in separate clusters are less similar to
one another
• Similarity Measures:– Euclidean distance if attributes are continuous– Other problem-specific measures
• e.g., how many words are common in these documents
• Evaluation: What Is Good Clustering?– Produce high quality clusters with
• high intra-class similarity• low inter-class similarity
– Quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
Text Analytics: Frequent Patterns
• Given: Set of documents • Find Frequent Patterns such that
– Common words patterns used in the collection
• Evaluation: What Is Good Patterns?• Results:1060 patterns discovered.
322: Lincoln147: Abe117: man100: Mr.100: time98: Lincoln Abe91: father85: Lincoln Mr.85: Lincoln man75: day70: Abraham
70: President68: boy67: Lincoln time65: Lincoln Abraham65: life63: Lincoln father57: men57: work52: Lincoln day…
Text Analytics: Discus
• Given: Set of documents• Find Top Sentences and
Top Tokens– Top sentences contain top tokens– Top tokens exist in top sentences
• Results:
Meandre Server Interface• Made mainly for
administrators• Provide one-click access to
most of the services• Sections
– Repository– Publish– Execution– Cluster– Location– Security– Server Logs– Public– About
Repository
• Navigates the current user’s repository
• List components and flows
• Searching components or flows
• Navigating by tags
• Clear all the components and flows
• Regenerates the repository using the registered locations
Publish
• Allows one to manipulate the public repository of components and flows
• Allows one to publish and unpublish components and flows
• Publish and Unpublish All functionality
Execution
• Exposes the execution engine• Allows one to list current running flows• Allows one to upload and execute all the flows in a
self-contained repository• Allows one to tune and execute by changing
property settings• Exposes the list of jobs tracked by the server• Give access to the executing flows output consoles
Cluster
• Exposes the single image cluster log
• List the status of the servers forming the cluster
• Expose basic information about the servers
• List internal properties used by the Meandre cluster
• Allows the shutdown of the current Meandre Server being accessed
Location
• Allows addition of locations to import components and flows
• Allows removal of a location which deletes components and flows
• List the current locations used
Security
• Manage users
– Create
– Remove
– Update
– List
• Grant and revoke roles
– Execution
– Location
Public
• Expose basic public services
– Published repository
– A simple demo repository of components and flows
– A ping/pong service
About
• Expose server information
– Installation information
– Version
– Plugins installed on this server
Learning Exercises
1. Explore Meandre Server Interface A. Open browser and point to
http://SERVER:1714/public/services/ping.htmlB. Explore flows C. Execute flows D. Tune and execute flowsE. Other functionality
Learning Exercises: Text Clustering2. Execute the "Text Clustering" flow on a
hard coded web page A. Click "flows" B. Under "Action", click "run"
3. Execute the "Text Clustering" flow on a webpage of your choice A. Click "flows” B. Under "Action", click "tune&run” C. On the webpage, replace the
"http://www.gutenberg.org/files/22925/22925.txt" with a web url of interest to you.
D. Click on the “Running flows”E. Click on the url of the flow you just executed
Learning Exercises: Text Frequent Patterns
4. Execute the "Text Frequent Patterns" flow on a hard coded web page A. Click "flows" B. Under "Action", click "run"
5. Execute the "Text Frequent Patterns" flow on a webpage of your choice A. Click "flows” B. Under "Action", click "tune&run” C. On the webpage, replace the
"http://www.gutenberg.org/files/22925/22925.txt" with a web url of interest to you.
D. Results will be displayed in the console
Learning Exercises: Date Entities to Simile
6. Execute the ”Date Entities to Simile Timeline" flow on a hard coded web page A. Click "flows" B. Under "Action", click "run"
7. Execute the "Date Entities to Simile Timeline" flow on a webpage of your choice A. Click "flows” B. Under "Action", click "tune&run” C. On the webpage, replace the
"http://www.gutenberg.org/files/22925/22925.txt" with a web url of interest to you.
D. Click on the “Running flows”E. Click on the url of the flow you just executed
Learning Exercises: HITS Summarizer
8. Execute the "HITS Summarizer " flow on a hard coded web page A. Click "flows" B. Under "Action", click "run"
9. Execute the HITS Summarizer " flow on a webpage of your choice A. Click "flows” B. Under "Action", click "tune&run” C. On the webpage, replace the
"http://www.gutenberg.org/files/22925/22925.txt" with a web url of interest to you.
D. Results will be displayed in the console