portable classification tools mark shewhart lexisnexis 21 june 2001
TRANSCRIPT
Portable Classification Tools
Mark Shewhart
LexisNexis
21 June 2001
Overview
• Classification Tools and Types
• Consistent Controlled Classification Schemes Across All Content
• Benefits of C.C.C.S.
• Approaches to “Portable” Classification
• Challenges
• Examples
• Q & A
Introduction
Mark Shewhart
LexisNexis
One of early innovators in building on-line databases and
search tools, with classification
Currently providing increasing range of tools, solutions and
services to support information needs of government
organizations, companies, and individuals
Uncontrolled Classification PROS
No manual development of classification algorithms or searches
Aids in knowledge discovery & taxonomy development
Adapts to changing terminology and topics
CONS
Difficulty providing meaningful labels to taxonomy
Problematic on fine grained rules
Examples
Verity, Semio, SRA’s NetOwl Extractor, InXight’s Thing
Finder, LEXIS-NEXIS core-terms
Controlled Classification
Machine Leaning
Provide several hundred “on-point” samples per topic
Most systems do not allow for manual intervention
Examples - Verity, Semio, Autonomy, InXight, Purple Yogi,
Webmind, Fulcrum, SmartLogik.
Manually Created “Algorithms”
Human Indexers manually create the algorithm for each topic
Examples - Any Boolean Search Engine, Verity, InXight
Classifier, LEXIS-NEXIS SmartIndexing, Factiva Intelligent
Indexing, Metacode, Sageware.
Basic search tools with complex queries created by domain
experts is a form of controlled classification
Natural Language
Verity, Alta-Vista, LexisNexis, West ...
Boolean
MS Site Server, Alta-Vista, LexisNexis, West, Factiva,
Dialog ...
Enhanced - additional “beyond boolean” operators/control
Verity, Semio ...
Controlled Classification
Taxonomy Development
Several companies market tools focused on taxonomy
development
Knowledge Discovery
Relationships between terms
New or changing terms
Uses for Uncontrolled Classification
Consistent Classification Scheme Everywhere
Your Intranet, The Web, and Premium Content Providers
Search all three using the same taxonomy
A consistent, controlled, classification scheme facilitates
data analysis & visualization - BIZ360, I2
Intra-document linking by taxonomy nodes
Investigative Analysis of content
Consistent Classification - One Stop SearchPremium Content
Your Intranet
Web Content
One Stop Search
Mining
Consistent Classification - Locate & LinkDossier
Case Law
Patents
Computer Company News
Computing & Tech News
Microsoft News
Case with Microsoft as a Party
Explore LEXIS-NEXIS for Microsoft
Microsoft Web Site
Company Tracking and Analysis
MICROSOFT CORP
INTEL
DELL COMPUTER CORP
Your Companies
User pre-selects companies to track.
Microsoft Corp News Coverage
0
50
100
150
200
250
3/1/
01
3/2/
01
3/3/
01
3/4/
01
3/5/
01
3/6/
01
3/7/
01
3/8/
01
Art
icle
s
Series1
Company Tracking and Analysis
MICROSOFT CORPMICROSOFT CORP
INTEL
DELL COMPUTER CORP
Your Companies
MSFT Stock Closing
020406080
100120
3/1/
01
3/2/
01
3/3/
01
3/4/
01
3/5/
01
3/6/
01
3/7/
01
3/8/
01M
SF
T C
los
ing
Series1
User selects Microsoft Corp.
Higher than average coverage flagged
Company Tracking and Analysis
MICROSOFT CORPMICROSOFT CORP
INTEL
DELL COMPUTER CORP
Your Companies Microsoft Corp News Coverage
0
100
200
300
400
3/1/
01
3/2/
01
3/3/
01
3/4/
01
3/5/
01
3/6/
01
3/7/
01
3/8/
01
3/9/
01
Articles
Series1
MSFT Stock Closing
020406080
100120
3/1/
01
3/2/
01
3/3/
01
3/4/
01
3/5/
01
3/6/
01
3/7/
01
3/8/
01
3/9/
01M
SF
T C
los
ing
Series1
The next day - User is back again
Extremely high coverage flagged
Company Tracking and Analysis
MICROSOFT CORPMICROSOFT CORP
INTEL
DELL COMPUTER CORP
Your Companies Microsoft Corp News Coverage
0
100
200
300
400
3/1/
01
3/2/
01
3/3/
01
3/4/
01
3/5/
01
3/6/
01
3/7/
01
3/8/
01
3/9/
01
Articles
Series1
MSFT Stock Closing
020406080
100120
3/1/
01
3/2/
01
3/3/
01
3/4/
01
3/5/
01
3/6/
01
3/7/
01
3/8/
01
3/9/
01M
SF
T C
los
ing
Series1
Click on the red circle for News Topic Analysis
050
100150200250300
1
Articles
Topic Analysis
EXECUTIVE CHANGES
STOCKS
LAWSUITS
BILL GATES
US DEPARTMENT OFJUSTICE
Company Tracking and Analysis
MICROSOFT CORPMICROSOFT CORP
INTEL
DELL COMPUTER CORP
Your Companies Microsoft Corp News Coverage
0
100
200
300
400
3/1/
01
3/2/
01
3/3/
01
3/4/
01
3/5/
01
3/6/
01
3/7/
01
3/8/
01
3/9/
01
Articles
Series1
MSFT Stock Closing
020406080
100120
3/1/
01
3/2/
01
3/3/
01
3/4/
01
3/5/
01
3/6/
01
3/7/
01
3/8/
01
3/9/
01M
SF
T C
los
ing
Series1
User clicks on the “STOCKS” bar for the news
050
100150200250300
1
Articles
Topic Analysis
EXECUTIVE CHANGES
STOCKS
LAWSUITS
BILL GATES
US DEPARTMENT OFJUSTICE
Answer Set Navigation
Executive Changes
Stocks
Lawsuits
050
100150200250300
1
Articles
Topic Analysis
EXECUTIVE CHANGES
STOCKS
LAWSUITS
BILL GATES
US DEPARTMENT OFJUSTICE
User clicks on Topic Analysis
More Executive Changes
More Stocks
More Lawsuits
Consistent Classification - Trending
Trend Analysis of Metadata
0
2000
4000
6000
8000
10000
12000
14000
16000
3Q 99 4Q 99 1Q 00 2Q 00
Online Trading
ElectronicCommerceInternet Crime
NEXerciseUser Selected Indexing Terms:
Download into Excel Spreadsheet
Online Trading
Electronic Commerce
Internet Crime
3Q 99 4Q 99 1Q 00 2Q 00Online Trading 192 1354 3303 15121Electronic Commerce5160 8788 13300 8558Internet Crime 680 918 1565 1426
Consistent Classification - Press Trending
Trending in the News
•International Herald Tribune (Neuilly-sur-Seine, France), July 4, 2000, Tuesday … The National Security Agency certainly features regularly in Mr. Gertz's coverage. A Lexis-Nexis search lists 132 Gertz stories in The Washington Times going
back to 1989 that have mentioned the agency. •The Washington Post, June 28, 2000,...easily discern one of the issues of greatest concern to voters: George W. Bush's position on the death penalty. A Nexis search Monday for stories mentioning Bush at least three times and the words "death penalty" or "executions" or "capital punishment" at least three …
•The New York Times, June 14, 2000, ...tally the Hotline political tip sheet keeps of how often possible vice-presidential choices merit a major media mention. Mr. Danforth had 10 mentions, compared with 49 for Gov. Tom Ridge of
Pennsylvania, No. 1 on the 53-name list.
•The Washington Times, May 05, 2000, … "A Nexis search of 'extreme right' over the past
month scored 212 mentions; a Nexis search of 'extreme left' over the past month yielded 58 items.
•MC Technology Marketing Intelligence, December 1, 1999 … We looked at such quantitative data as stock performance in 1999 and the number of press mentions (as shown in a Lexis- Nexis search),
• Fortune, October 12, 1998, … Just how addicted to cliches are financial media editors? Here's a list of fave words and the number of stock market stories in which they appeared, generated by a Lexis-Nexis search from the end of August to Sept. 11: Turmoil: 1,559; plunge: 1,260; crash: 965; correction: 860; bear market: 750; ...
Consistent Classification- Source Suggestion
Automatic Suggestion of Sources
LEXIS-NEXIS Suggest-a-Source
User Selected Indexing Term
LEXIS-NEXIS top Sources for Denver Broncos
Rocky Mountain NewsDenver PostSports NetworkAssociated Press•Seattle Post-intelligencerUSA Today•Washington Post•Orlando Sentinel•Kansas City Star•Regal-fort Worth Star•San Diego Union Tribune
LEXIS-NEXIS top Sources for IPO’s
Cable News Network F
M&A JournalAFX-Extel NewsPR NewswireBusiness WirePhillips NewsletterFinancial Times Institutional InvestIAC NewsBusiness TimesCable News NetworkAsia Intelligence WireFinancial PostNew York Post
IPOsLEXIS-NEXIS Suggest-a-Source
User Selected Indexing Term
Denver Broncos
•What are these?
Consistent Classification - More Than a Cite List
Source Analyzer
NEXIS Source Analyzer™
Dayton Daily News Topics
2697 Sports•2616 Athletes2181 Basketball1871 Campaigns & Elections1772 College Sports1503 Cities1476 Lawyers1473 Baseball & Softball1438 High School Sports1345 Violent Crime1258 Litigation1207 Sentencing1158 Judges1132 American Football1086 Fundraising 937 Television Programming 931 Deaths & Obituaries 857 Diseases & Disorders 852 Settlements & Decisions 837 Arrests
Source Analyzer™User Selected Sources:
Download into Excel Spreadsheet
Dayton Daily News
Washington Post
LA Times
NEXIS Source Analyzer™
Washington Post Topics
11410 Sports•8567 Campaigns & Elections7439 Athletes6415 Lawyers4665 Basketball4498 Violent Crime4393 Banking & Finance4265 Entertainment & Arts4155 Baseball & Softball3938 Judges3753 International Relations3703 Budget3675 College Sports3557 Cities3397 Litigation3384 Sentencing3243 Candidates3202 American Football3109 Television Programming2758 Fundraising
NEXIS Source Analyzer™
Los Angeles Times Topics
6080 Sports•3375 Cities3101 Campaigns & Elections2915 High School Sports2815 Athletes2800 Lawyers2360 Basketball2347 Baseball & Softball2341 Letters & Comments2241 College Sports2188 Violent Crime2113 San Fernando Valley1918 Television Programming1851 Litigation1793 Judges1711 Deaths & Obituaries1504 Editorials & Opinions1410 Environment1391 Television Industry1380 Sentencing
• Source Analyzer highlights Common Terms
Consistent Classification - More Than a Cite List
Source Analyzer
NEXIS Source Analyzer™
Financial Times Topics
61039 Banking & Finance32061 Mergers & Acquisitions18869 Telecommunications18112 Trade Agreements17499 Campaigns & Elections•13484 Currencies11458 Computing & Technology11121 International Relations11056 Exchange Rates11009 Privatization10229 Emerging Markets10160 Energy9015 Joint Ventures8959 Stock Indexes8680 Debt8609 Budget8606 Automakers8424 Engineering8347 Central Banks8110 Taxes
Source Analyzer™User Selected Sources:
Download into Excel Spreadsheet
Financial Times
USA Today
NEXIS Source Analyzer™
USA Today Topics
30235 Sports17591 Athletes9006 Baseball & Softball9003 College Sports8989 Basketball8287 Television Programming7501 American Football7355 Campaigns & Elections•6485 Lawyers6370 Banking & Finance5662 Olympics4975 Entertainment & Arts4884 Television Industry4469 Polls & Surveys3975 Litigation3832 Airlines3363 Judges3335 Violent Crime3331 International Relations2933 Network Television
• Source Analyzer™ highlights Common Terms
•The New Republic, JULY 26, 1999 … The U.S. section is lambasted for repeating what was reported in the American press. To prove it, Sullivan does a Nexis search on the topic of each article in a random issue and compares what he finds to The Economist. The results are not
surprising.
Reporter Analysis
What is a reporter covering?
NEXIS ByLine Analyzer™
Steve Schmidt reported Topics
13 CITIES 10 NATIONAL PARKS 10 CAMPAIGNS & ELECTIONS 8 SUBURBS 8 MARRIAGE 7 THEME PARKS 6 VIOLENT CRIME 6 SECONDARY SCHOOLS 5 SPORTS 5 PUBLIC TRANSPORTATION
ByLine Analyzer™User Selected Reporter:
Download into Excel Spreadsheet
Steve Schmidt
NEXIS ByLine Analyzer™
Steve Schmidt reported Companies
5 MICROSOFT CORP 1 WALT DISNEY CO INC 1 PACIFIC LUMBER CO 1 PACIFIC BELL 1 MAPES HOTEL 1 DESTINATION PALM BEACH 1 ALTURAS CASINO 1 ALASKA AIR GROUP INC
NEXIS ByLine Analyzer™
Steve Schmidt reported people
4 DAVID KNIGHT 3 SHAWN STINSON 3 EMILIO ESTEVEZ 3 CHARLIE SHEEN 3 BILL GATES 3 ALBERT GORE JR 2 WILLIE L BROWN 2 SCOTT HINSON 2 PETE KNIGHT 2 MICHAEL GONZALEZ
NEXIS ByLine Analyzer™
Steve Schmidt reported Organizations
4 SAN DIEGO STATE UNIVERSITY 4 FEDERAL BUREAU OF INVESTIGATION 3 SAN DIEGO CITY COUNCIL 3 NATIONAL PARK SERVICE 2 WILD HORSE ORGANIZED ASSISTANCE 2 VALLEY MIDDLE SCHOOL 2 UNIVERSITY OF CALIFORNIA (LOS ANGELES) 2 SAN DIEGO PADRES 2 HELIX HIGH SCHOOL 1 YOSEMITE INSTITUTE
Topic Analysis
Who’s involved & Who’s reporting on the recent rash of bacteria related product recalls?
NEXIS Topics Analyzer™
Top Reporters
2 ROBERT WALKER 2 NICOLE BAILEY 2 LYNNE KOZIEY 1 SHAWN OHLER 1 SARAH GREEN 1 QUINTIN ELLISON 1 MATTHEW P BLANCHARD 1 MARTHA M. HAMILTON 1 MARLENE HABIB 1 MARK BROWN 1 LYLE HARVEY 1 KATHERINE HARDING 1 KAREN CLARK LEPOOLE 1 JOHN TAYLOR 1 JESSICA HANSEN 1 IAN MCDOUGALL 1 FRED ANKLAM JR 1 DONNA CASEY 1 DINA CAPPIELLO 1 CHU SHOWWEI 1 CHRISTINE WINTER 1 BILL EGBERT 1 BARBARA DURBIN
Topic Analyzer™User Selected Topics:
Download into Excel Spreadsheet
Product Recalls
Bacteria
NEXIS Topic Analyzer™
Top related Companies
29 MOYER PACKING CO 16 IBP INC 12 PACKERLAND PACKING CO INC 11 KRAFT FOODS 6 LAKESIDE FARM INDUSTRIES 5 PHILIP MORRIS COS INC 5 FOOD SAFETY & INSPECTION SERVICE 4 SNOW BRAND MILK PRODUCTS CO LTD 3 GARDEN BOTANIKA INC 2 XL FOODS 2 STOP & SHOP SUPERMARKET CO 2 LAKESIDE PACKERS 2 GIANT FOOD STORES INC 2 DEL GOULD MEATS INC 2 COSTCO WHOLESALE CORP
Approaches
Documents
• “ASP” Service Model
Categories
Service Provider
Customer
Internet
Approaches
• Port The Classification Application to run in user’s environment
• Software
• Intellectual Capital
Approaches
• Port the Intellectual Capital to another classification system’s format & logic
Verity Users
Semio Users
Autonomy Users
Hummingbird Users
Inxight Users
Challenges
• Operator Incompatibility
• Parsing vs Inverted Word Index Tools
• Document Length Adjustments
Search Operator Compatibility• Many Boolean search systems do not have a frequency
operator - ATLEASTn( term ) at LexisNexis
• Years ago, LexisNexis noticed that many experienced
searchers were simulating a frequency operator by cascading
an existing proximity operator
– cat W/9999 cat W/9999 cat
– To simulate ATLEAST3( cat )
• How do we port an ATLEASTn() search to a system without
a proximity operator or a system that does not cascade
proximity operators?
Porting Boolean Searches - Verity Example
ATLEASTn Operator
LNG Boolean: ATLEASTn( expr )Verity:
<COMPLEMENT>( <YESNO>( <COMPLEMENT>( <AND>( <MULT/[10000/n]>( <FREQ>( expr ) ) )
) ) )
NOTE:• ATLEASTn( expr1 or expr2 or … or exprX ) is equivalent to ATLEASTn( expr1 ) or ATLEASTn(expr2 ) or … or ATLEASTn( exprX )
• ATLEASTn( expr1 and expr2 and … and exprX ) is equivalent to ATLEASTn( expr1 ) and ATLEASTn(expr2 ) and … and ATLEASTn( exprX )
Automatic Stemming - Precision IssuesMany search engines perform automatic stemming which is needed for depluralization which was assumed when the Search Advisor searches were created and tested. Unfortunately, this “stemming” allows words to match morphological variants other then singular/plurals. For example, a search on CONSTITUTION may match CONSTITUTIONAL. This causes the ported searches to retrieve documents that the LN Boolean search does not. Some possible solutions.
• Do nothing. The words are many times similar in concept. This would require more detailed domain by domain analysis.• Some search tools allow the user to put “quotes” around terms to turn off the stemming. If so, put quotes around all terms and generate additional terms in our search to simulate depluralization.•Put quotes around all terms and do NOT generate new terms. This omits depluralization as well. Huge recall hit I would imagine.
Porting Boolean Searches - Recall IssuesProximity operators are impacted by differences in the set of non-searchable “noise” words. Porting LexisNexis searches to a system with less noise words will cause some documents matched by LexisNexis’ search engine not to be retrieved.
For example, the search ATTACHED w/5 POLE matches in LN but may not in the following text
“cable attached to the hopper which the gin-pole”.
This also occurs in phrases which are W/1 (really a phrase). We may also miss documents on the term SURETY CONTRACT when LN matched it in the phrase SURETY TO THE CONTRACT
Possible solution - Increase n by 1 or 2 in the ported search. This could have precision impacts.
Porting Uncontrolled Classification Tools To Yours
.4 cat
.2 dog
.3 puppy
.4 mouse
Natural Language Search :
cat, dog, puppy, mouse
Natural Language Search :
cat, cat, cat, cat, dog, dog, puppy, puppy, puppy, mouse, mouse, mouse, mouse
New Weighted Natural Language Search that does not use TFIDF:
cat(0.4), dog(0.2), puppy(0.3), mouse(0.4)
•Many companies market uncontrolled classification tools that automatically create categories
• Many cluster terms and assign weights different than TFIDF
LN Topical Indexing to Verity Example
#SUBJECT:#CVTS:#SUBJ=CATS & DOGS EXAMPLE#TERMS:#WEIGHT=1#THRESH=5#FREQLMT=4 {fl01 = 4}#TERM01=cat#TERM01=cats#FREQLMT=4 {fl02 = 4}#TERM02=dog#TERM02=dogs
Word Concept Buckets
the #TERM01 word concept counts with a frequency limit of 4 on a scale of 0.0 to 1.0 can be represented in Verity as:
<SUM>( <AND>( <MULT/2500>( <FREQ>(“cat”) ) ),<AND>( <MULT/2500( <FREQ>( “cats” ) ) )
)
The #TERM02 word concept counts with a frequency limit of 4 on a scale of 0.0 to 1.0 is represented in Verity as:
<SUM>( <AND>( <MULT/2500>( <FREQ>(“dog”) ) ),<AND>( <MULT/2500( <FREQ>( “dogs” ) ) )
)
Word Concept Buckets
Examples of the TERM01 word concept counts (FL=4)
# cat/cats <SUM>( <AND>( <MULT/2500>( <FREQ>(“cat”) ) ),<AND>( <MULT/2500( <FREQ>( “cats” ) ) ) )
0 0.001 0.252 0.503 0.754 1.005+ 1.00
Blocking Effect
#SUBJECT:#CVTS:#SUBJ=CAT DOG EXAMPLE#TERMS:#THRESH=4#FREQLMT=5 {fl01 = 5}#TERM01=cat dog#FREQLMT=3 {fl02 = 3}#TERM02=cat#TERM02=dog#BLOCK=cat food#BLOCK=dog food
• In SmartIndexing, we do not count “cat” if it is in the phrase “cat dog”• This is the Blocking Effect• This is not natural in an Inverted word index based search systems• Very unnatural - “cats and dogs, sleeping together - total hysteria”
Blocking Effect
• Verity has the <FREQ> operator which counts term frequency without the Blocking Effect.• So the “cat” in “cat dog” is counted But …
• <LN-FREQ>(“cat”) = <FREQ>(“cat”) - <FREQ>(cat dog”) - <FREQ>(“cat food”)
We have term counts with the blocking effect ….
… Whoops! Verity does not have a <SUBTRACT> operator!
Learning to Subtract
• Introducing <LNG_SUBTRACT> ( b , a ) defined as b – a =
<COMPLEMENT>( <SUM>( <COMPLEMENT>( b ) , a ) )
Where 0<= a <= b <= 1
Follow the math ....
<COMPLEMENT>( <SUM>( <COMPLEMENT>( b ) , a ) ) ) =<COMPLEMENT>( <COMPLEMENT>( b ) + a ) ) =<COMPLEMENT>( 1 - b + a ) =1 - ( 1 - b + a ) =1 -1 + b - a =b – a
Actual Results from CATS & DOGS EXAMPLE
Cats & Dogs Test Summary expected results
Score (Doc_) 0 cat/cats 1 cat/cats 2 cat/cats 3 cat/cats 4 cat/cats 5 + cat/cats0 dog/dogs 0.0 (CD1) 0.125 (CD7) 0.25 (CD11) 0.375 (CD14) 0.50 (CD16) 0.50 (CD17)1 dog/dogs 0.125 (CD2) 0.25 (CD8) 0.375 (CD12) 0.50 (CD15) 0.625 (CD27) 0.625 (CD32)2 dog/dogs 0.25 (CD3) 0.375 (CD9) 0.50 (CD13) 0.625 (CD23) 0.750 (CD28) 0.750 (CD33)3 dog/dogs 0.375 (CD4) 0.50 (CD10) 0.625 (CD20) 0.750 (CD24) 0.875 (CD29) 0.875 (CD34)4 dog/dogs 0.50 (CD5) 0.625 (CD18) 0.750 (CD21) 0.875 (CD25) 1.00 (CD30) 1.00 (CD35)5+ dog/dogs 0.50 (CD6) 0.625 (CD19) 0.750 (CD22) 0.875 (CD26) 1.00 (CD31) 1.00 (CD36)
Cats & Dogs Test Actual Results
Score (Doc_) 0 cat/cats 1 cat/cats 2 cat/cats 3 cat/cats 4 cat/cats 5 + cat/cats0 dog/dogs 0.0000 (CD1) 0.1247 (CD7) 0.2494 (CD11) 0.3746 (CD14) 0.4997 (CD16) 0.5000 (CD17)1 dog/dogs 0.1247 (CD2) 0.2494 (CD8) 0.3742 (CD12) 0.4993 (CD15) 0.6244 (CD27) 0.6247 (CD32)2 dog/dogs 0.2494 (CD3) 0.3742 (CD9) 0.4989 (CD13) 0.6240 (CD23) 0.7492 (CD28) 0.7494 (CD33)3 dog/dogs 0.3746 (CD4) 0.4993 (CD10) 0.6240 (CD20) 0.7492 (CD24) 0.8743 (CD29) 0.8746 (CD34)4 dog/dogs 0.4997 (CD5) 0.6244 (CD18) 0.7492 (CD21) 0.8743 (CD25) 0.9994 (CD30) 0.9997 (CD35)5+ dog/dogs 0.5000 (CD6) 0.6247 (CD19) 0.7494 (CD22) 0.8743 (CD26) 0.9997 (CD31) 1.0000 (CD36)
Verity Threshold = THRESH/MAX = 5/8 = 0.625