georgetown lecture 2012 6 2 full
DESCRIPTION
Guest lecture I gave at Georgetown Law eDiscovery class.TRANSCRIPT
04/13/2023 1
“Triggers,” Preservation & Search
June 2, 2012Georgetown Law
Sonya L. Sigler
04/13/2023 2
Overview
Triggers & Preservation• What is it?• Why Does it Matter?
Search
Keyword Search
Clustering
Ontologies
Technology Enhanced Review - Sampling
Social Networking Analysis
Relationship Analysis
“Triggers” & Preservation
What is a Trigger?– Litigation reasonably anticipated– Who decides
Litigation Hold Continuum – Established in hind sight– Threat– Letter about litigation– Filing Suit
Cases– Pippin, Zubulake, Pension Committee
04/13/2023 3
Pippins v. KPMG
How much data to Preserve?– All hard drives (Pippins’ position)– 100 Sample Hard drives (KPMG’s position)
To Cooperate or NOT to Cooperate?
How Judges React to Lack of Cooperation
04/13/2023 4
Zubulake
Litigation Holds– Cannot send a request into the ether
PreservationHave to follow-up
Take affirmative steps to monitor compliance
In-house Counsel Duty
Cannot leave it to employees discretion
Document what was done
04/13/2023 5
Pension Committee
No intentional destruction of data
Careless & indifferent
No Latchkey Custodians (alone & unsupervised)– Identify Custodians– Monitor their efforts– Including former employees and third parties
Proactive
Consistent
Reasonable Approach
04/13/2023 6
Triggers
When does a duty to preserve arise?
04/13/2023 7
What To Do?
Who to include?– Not about data volume– Not about contact with underlying “litigation”
Key Players (Zubulake opinions)– Likely to have relevant information– CEO, Board, Committees, employees, etc.
Produce it from the Key Player (not others)– Nursing Home Pension Fund v. Oracle– Produce emails from the CEO (15) not others (1,650)
04/13/2023 8
Spoliation
Failure to Preserve– Didn’t Ask
• Right person• Right Place
– Didn’t follow up
Destruction of Data– Intentional– Inadvertent destruction
What can happen– Sanctions– Adverse Inferences
04/13/2023 9
Search
How to Use it To Find Information
How to Use it to Ignore Information
When to use which search methodology
04/13/2023 10
04/13/2023 11
Search - Data Assessment
Where is the Data?– Data Mapping - databases, servers, desktops, laptops,
IMs, smart phones, voicemail, other recordsDefining Process from Collection to Review to ProductionCollection Strategy, Process, Approach– Scope of collection: custodians, date ranges, topics
Reports on the Data Processing– File types, encrypted files, de-duplication rates,
password protected files, encrypted files, etc.Not Reasonably Accessible dataAssessing Risk of Data Loss
04/13/2023 12
Search - Case Assessment
Who - Cast of Characters
What - What the Heck Happened?
Where - Where did it take place?
When - What time period are we concerned with?
How - fraud, antitrust violation, etc.
WHY - What were the motives involved?
Data Assessment ≠ Effective Case Assessment
04/13/2023 13
United States v. O’Keefe (Facciola)– Questioned lawyers’ ability to decide which search terms are more likely to
produce relevant information– Facciola has also suggested that litigants take a look at advanced search
methodologies
Victor Stanley, Inc. v. Creative Pipe, Inc. (Grimm)– Defensibility of process AND execution lies with the party relying upon the
search protocol to meet their obligations which needs to be able to explain search rationale, appropriateness, and proper implementation
– Advocates quality assurance, e.g. by sampling– Searches should be designed by a competent practitioner
Keyword Search Under Scrutiny
04/13/2023 14
Keyword Specific Case
William A. Gross Construction Associates, Inc. v. American Manufacturers Mutual Insurance Company
SDNY, Judge Andrew Peck
Keyword list was in the thousands
Use the actual data set and custodians to figure out keywords
“This case is just the latest example of lawyers designing keyword searches in the dark, by the seat of the pants, without adequate (indeed, here, apparently without any) discussion with those who wrote the emails. Prior decisions from Magistrate Judges in the Baltimore-Washington Beltway have warned counsel of this problem, but the message has not gotten through to the Bar in this District.”
04/13/2023 15
$6M Keyword Mistake
In re Fannie Mae Securities Litigation
3rd Party - OFHEO
DC Circuit - Judge David Tatel
Attorney agreed to something he did NOT understand
Long list of key terms
Taxpayers suffered the consequence
04/13/2023 16
What This Means
• The Courts are finally catching up
• Courts actively ruling on Standards of Care and Process
• Lawyers are Getting Wise
04/13/2023 17
Case Law Effects on Discovery
Defensibility of Review Process is now a focus– Culling now can kill you later– Cooperation is a hot topic– Tussle between inside & outside counsel– Beginning to see planning as a necessity
Increased focus on Quality– Heightened involvement expected from corporate clients
in the overall process– Cases pushing this, Qualcomm, Creative Pipe
04/13/2023 18
What Else Is There?
Effort to establish & codify uniform “Best Practices”– Quickly becoming roadmap for uneducated industry– Increasingly relied upon by judges as measure of reasonable or
standard behaviorPublications have addressed:– Document retention & production– Email management– Search & Retrieval– Protective orders & confidentiality– ESI admissibility
04/13/2023 19
Getting to a Manageable Review Set
Intake Data 100%
Duplicates 25%
Non-Responsive
20%
Produced 12.25%
These figures vary based upon the data set received
NR/Priv 20%
Responsive & Priv 15%
Junk/Spam/ Porn
20%
Focus on finding, reviewing & using the
“right” data, not just filtering data
04/13/2023 20
Search Methodologies
specific exact wordsKeyword
Clustering Ontology
relationships among relevant people
similarity of salient features
generalized words or phrases
Social Network Analysis
specific exact wordsKeyword specific exact wordsKeyword
Clustering Ontologysimilarity of
salient featuresgeneralized
words or phrases
specific exact wordsKeyword
Clustering Ontology
Social Network Analysis
RelationshipAnalysis
documents withcausal or
sequential relationship
relationships among relevant people
similarity of salient features
generalized words or phrases
specific exact words, proximity searches, stemming
Keyword
Clustering Ontology
Social Network Analysis
RelationshipAnalysis
documents withcausal or
sequential relationship
relationships among relevant people
similarity of salient features
generalized words or phrases
Content
Concept
Context
Visualization
Measurement
04/13/2023 21
Keyword Accuracy Example
8,553 responsive documentsmissed by keyword search(Almost 8% of responsivedocuments missed bykeyword search - Under-inclusive)
Keyword search reduced thedocument set by only 47%
And 88% of the documentsreturned by keywordsearch were not responsive(Over-inclusive)
04/13/2023 22
MythKeyword Searching is the Way to Go
If I agree to keyword terms, I am OK
Keyword Search Cases
Keyword replacement example
Keyword substitution
Missing in Action (Under-inclusive)
Unwanted Extras (Over-inclusve)
Multiple subject/persons (Disambiguate)
04/13/2023 23
Manual review by humans of large amounts of information is as accurate and complete as possible - perhaps even perfect - and constitutes the gold standard by which all
searches should be measured
This is “The reigning Myth of ‘perfect’ retrieval using traditional means”
Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-DiscoveryThe Sedona Conference Journal (2007) p. 199
Human beings retrieved less than 20% of the relevant documents when they believed they were retrieving over 75%
An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval SystemBlair & Maron (1985)
Fact or Myth?
IS 240 – Spring 2011
Blair and Maron 1985A classic study of retrieval effectiveness– earlier studies were on unrealistically small collections
Studied an archive of documents for a legal suit– ~350,000 pages of text– 40 queries– focus on high recall– Used IBM’s STAIRS full-text system
Main Result: – The system retrieved less than 20% of the relevant
documents for a particular information need; lawyers thought they had 75%
But many queries had very high precision
IS 240 – Spring 2011
Blair and Maron, cont.
How they estimated recall– generated partially random samples of unseen documents– had users (unaware these were random) judge them for
relevance
Other results:– two lawyers searches had similar performance– lawyers recall was not much different from paralegal’s
IS 240 – Spring 2011
Blair and Maron, cont.
Why recall was low– users can’t foresee exact words and phrases that will
indicate relevant documents• “accident” referred to by those responsible as:“event,” “incident,” “situation,” “problem,” …• differing technical terminology• slang, misspellings
– Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied
04/13/2023 27
Keyword Search Summary
Pro Word Stemming
–Hous* - house, housemate, household
Easy to use/explain/agree Familiar
Fast results
Con Over-inclusive
–Disambiguate
Under-inclusive Word must be present Hard to craft Ineffective with short
messages, IMs
04/13/2023 28
Keyword Truths
Under-inclusive - missing relevant or important info
Over-inclusive - costly to review
“Reasonable Keyword Search” doesn’t exist
Effective keyword search is difficult/impossible– Index Data, Analyze Index– Suggest keywords or approach
Keywords may not be appropriate for the data
Keyword Search is ONE Tool in Your Arsenal
04/13/2023 29
Keyword Accuracy Example
8,553 responsive documentsmissed by keyword search(Almost 8% of responsivedocuments missed bykeyword search - Under-inclusive)
Keyword search reduced thedocument set by only 47%
And 88% of the documentsreturned by keywordsearch were not responsive(Over-inclusive)
04/13/2023 30
Search Methodology Continuum
Review Methodology - Decided Upfront
Identify Issues in the Case– Formulate Queries and Approaches for Finding
Responsive Documents– Formulate Relevancy and Responsiveness Guidelines
Identify Primary Participants
Select or Triage Documents for Review
04/13/2023 31
Review Tools for Relevancy Assessment
Keyword Searches, Culling– Slices of Data are Reviewed
Categorization of Data– Entire Dataset is Categorized– Review Targeted Data
Automated Review– Categorization of Dataset– Random Sampling (Statistically Significant)
04/13/2023 32
Categorization of Data for Review
Categorize Entire Data Set– Spam/Porn/System Files– Personal/Private Data– Non-relevant Business Data
Business Data– Relevancy Assessment by Topic– Privilege Review
Keyword, Topic Analysis - Overlap, Holes
04/13/2023 33
Search Methodologies
specific exact wordsKeyword
Clustering Ontology
relationships among relevant people
similarity of salient features
generalized words or phrases
Social Network Analysis
specific exact wordsKeyword specific exact wordsKeyword
Clustering Ontologysimilarity of
salient featuresgeneralized
words or phrases
specific exact wordsKeyword
Clustering Ontology
Social Network Analysis
RelationshipAnalysis
documents withcausal or
sequential relationship
relationships among relevant people
similarity of salient features
generalized words or phrases
specific exact words, proximity searches, stemming
Keyword
Clustering Ontology
Social Network Analysis
RelationshipAnalysis
documents withcausal or
sequential relationship
relationships among relevant people
similarity of salient features
generalized words or phrases
Content
Concept
Context
Visualization
Measurement
04/13/2023 34
Categorization Methods
Statistical Methods (#s based)– Topic Clustering
• Statistical Similarity • Counting #s of words, appearance together
– Latent Semantic Indexing– Supervised v. Unsupervised Clustering
Linguistic Methods (Word Based)– Keyword (Culling Method)– Ontologies
04/13/2023 35
Clustering
Clustering just means putting documents into groups that have something in common.
Manually (that's what manual review is)
Keyword Searches
Ontologies (linguistic filters)
Automated clustering (using technology)– Automated clustering by document type (all the Word
documents go into one basket– Automated clustering by creation date– Automated clustering by Actor– Automated clustering by statistical similarity (statistical
clustering)– ... and many other approaches
04/13/2023 36
Clustering -- “Options”
1 Cluster or 4 Clusters
Financial/energy trading options
Email/computer menu-driven options
Stock options (ISO's)
The generic idea of an available choice of action
04/13/2023 37
Clustering
Software implements statistical methods of finding groups of “similar” documents– “Similar” must be defined appropriately
for the application
Documents are categorized with very little effort by the user
May help with document review– A single reviewer can look at similar
documents together, produce consistent review decisions
– Tight clustering can be used to detect “near duplicates” caused by OCR errors
04/13/2023 38
Clustering vs. queries
Clustering is unpredictable compared to keywords or taxonomiesThe items that look very similar (to the clustering algorithm) may not actually be similar in ways that matter– Relevancy may depend upon fine legal distinctions– May vary in the same matter by subpoena and/or
jurisdiction
04/13/2023 39
Ontologies
Implement ontologies for directed searches. – Approach searching from a knowledge-representation viewpoint– Field is 25 years old, lots of work done– Advantages:
• Disambiguate different meanings of the same word from their context More accurate
• Encapsulate many ways of saying the same thing More thorough
• Search for concepts, not individual words More intuitive, more reusable, and faster
Can be combined with other methods (unsupervised clustering, discussions).
04/13/2023 40
Subjectivity
GOOD WEATHER– Sun– Calm
BAD WEATHER– Rain– Snow– Wind
04/13/2023 41
A More Realistic Ontology
ROYALTY CONCEPT• royalty• royalties• rty• commission• commissions• comm.• honorarium• honorariums• honoraria• usage fee• usage charge• usg fee• use fee• fee for use• fee for usage• incent*• insent*• earn a fee• eam a fee
• charge for use• charged for use • charging for use• charges for use• licence fee• license fee• lisense fee• “take cut”~2• “takes cut”~2• “took cut”~2• “slice pie”~5• “piece pie”~5• “piece action”~5• “slice action”~5• -king• -queen• -prince• -princess
04/13/2023 42
Ontology as a Query
But it can be slightly cumbersome to deal with directly in that form
q ((+(std:%CapacityReports_% std:%DINCapacity_%) +(std:%ACMEEPPlant_% std:%ProductName_%)) (+(std:%ACMEPNPlant_% std:%ProductName_%) +(std:%ProductiveCapability_% std:%CapacityReports_%)) (+(std:%CapacityCreep_% std:%OperationsImprovement_% std:%CapacityExpansion_% std:%CapacityRestoration_%) +(std:%ACMEPNPlant_% std:%ProductName_%)) (+(std:%EquipmentReplacement_% std:%FinishingColumn_%) +(std:%ACMEPNPlant_% std:%ProductName_%)) (std:%Audit_% actor:%Audit_%) (+(std:%SettlementNegotiations_% std:%ContractNegotiations_% ) +(actor:%ACMEOutsideCounsel_% std:%ACMEOutsideCounsel_% actor:%ACME UBOutsideCounsel_% std:%AcmeSubOutsideCounsel_% actor:%AcmeSub_% std:%AcmeSub_%)) (std:%FTC_% actor:%FTC_%) ((+subject:%ProductName_% +(std:swap std:"supply agreement" std:"exchange agreement" std:"agree to exchange")) std:"name
(About a quarter of its regular size)
04/13/2023 43
Ontology Pros & Cons
Identify acronymsNormalize variants Disambiguate termsIdentify overly broad keywordsIdentify and correct keywords with errorsCreate extensive libraries of ontologiesCan be used as a clustering methodTopics can appear in more than one languagesReusable for different types of litigation, e.g. anti-trust, product liability etc. (and for both offense and defense)
As with Keyword - word basedLabor intensive, upfront
“Search” Terminology
Technology-Enhanced Review
Technology Assisted Review
Automated Review
Predictive Coding
04/13/2023 44
Setup
Sample
Expert judges sample
Non-responsive
Responsive
Model learnsModel predicts
Model categorizes all remaining documents
Responsive Non-responsive
Repeat as needed
Automated Review Methodology
04/13/2023 47
Priv byHigh-Speed
Manual Review
Source Data
Eliminate Duplicates & System Files
Non-Responsive Isolation
ontologies
Responsiveby Technology
Enhanced Review
(removedanother 7%)
NR by Technology Enhanced
Review(removed
another 18%)
30%
30%
15%22%
100%
3%
Technology Enhanced Review:Speed, Predictable Costs, and Accuracy
Automate any portion of the review
Example from a real case
04/13/2023 48
Search Methodologies
specific exact wordsKeyword
Clustering Ontology
relationships among relevant people
similarity of salient features
generalized words or phrases
Social Network Analysis
specific exact wordsKeyword specific exact wordsKeyword
Clustering Ontologysimilarity of
salient featuresgeneralized
words or phrases
specific exact wordsKeyword
Clustering Ontology
Social Network Analysis
RelationshipAnalysis
documents withcausal or
sequential relationship
relationships among relevant people
similarity of salient features
generalized words or phrases
specific exact words, proximity searches, stemming
Keyword
Clustering Ontology
Social Network Analysis
RelationshipAnalysis
documents withcausal or
sequential relationship
relationships among relevant people
similarity of salient features
generalized words or phrases
Content
Concept
Context
Visualization
Measurement
04/13/2023 49
From Document Analysis to Social Network Analysis
04/13/2023 50
From Social Network Analysisto Discussions
04/13/2023 51
Search Methodologies
specific exact wordsKeyword
Clustering Ontology
relationships among relevant people
similarity of salient features
generalized words or phrases
Social Network Analysis
specific exact wordsKeyword specific exact wordsKeyword
Clustering Ontologysimilarity of
salient featuresgeneralized
words or phrases
specific exact wordsKeyword
Clustering Ontology
Social Network Analysis
RelationshipAnalysis
documents withcausal or
sequential relationship
relationships among relevant people
similarity of salient features
generalized words or phrases
specific exact words, proximity searches, stemming
Keyword
Clustering Ontology
Social Network Analysis
RelationshipAnalysis
documents withcausal or
sequential relationship
relationships among relevant people
similarity of salient features
generalized words or phrases
Content
Concept
Context
Visualization
Measurement
04/13/2023 52
Analytics
Analytics are Based on the Model and on Discussions
04/13/2023 53
Better Answers and Better Questions
When were customary work practices circumvented?
When did established norms of behavior change?
Who knew, or likely knew, what facts?
Who interacted with whom and how intimately?
Who was involved in what types of decisions or meetings?
Who are the real ‘insiders’?
What data is hidden or missing?
When were electronically documented conversations “taken off line,” possibly in an attempt to avoid detection?
How did the importance of different actors change over time?
04/13/2023 54
Bear Stearns
Two hedge fund managers
arrested
Charged with securities and
wire fraud, and one with
insider trading
Internal emails:– “I'm fearful of these markets. ... As we discussed it may not be a
meltdown for the general economy but in our world it will be.” – “I think we should close the funds now .”
External communications:– “We are very comfortable with exactly where we are.” – “The funds are performing exactly as they were designed to.”
Lower Bar For Fraud?
04/13/2023 55
Sentiment Analysis Visualization
04/13/2023 56
Analysis of Anomalous Communication Patterns
Unusual levels relative to a
particular type of activity pop out
Color-coded graphs show relative communication densities for apples to apples comparisons
04/13/2023 57
Spread of Information
04/13/2023 58
Emotive ToneWhistle-blower Scenario
04/13/2023 59
“Call Me” EventsSequence Viewer used for analytics-driven review
04/13/2023 60
Search Risks
Failure to find responsive documents
Failure to recognize responsive documents
Failure to recognize privileged documents
Inconsistent treatment of documents (e.g., duplicates)
Failure to complete project in a timely manner
Sophisticated Tools– Understand What They Do and Don’t Do Well– Inform Yourself, Speak to References, Consultants
04/13/2023 61
Transparency of Process
Discussing Review Protocols– Provide transparent, defensible, sophisticated search
based on document content– Clustering, Ontologies, Analytics, and yes, sometimes
Keywords too
Develop search methodologies for each case– Use technology experts in consultation with case / legal
experts
Results verifiable by Quality Control– Defensible sampling
04/13/2023 62
Thank you!
Sonya L. SiglerVice President, Product Strategy
SFL Data415-321-8385
[email protected] www.sfldata.com
04/13/2023 63
Review Protocol
≠ Agreeing to Search Terms
Data Culling (upfront or backend)
Search Methodologies - Continuum– Keyword Positive List– Ontologies– Clustering– Technology Enhanced Review– Relationship Analysis
Quality Control Process & Procedures
Privilege Review, Sensitivities
Production Format & Timing
04/13/2023 64
Search
The Courts are Finally Starting to Catch up to TechnologyMaking more aggressive rulings:– Forcing attorneys to live with the results of bad
searches– Sanctioning those who screw up, even if no allegation
of fraud– Demanding repeatable,
demonstrable process – using
terms like “quality assurance”
04/13/2023 65
Search Under Scrutiny
Facciola’s Opinions - United States v. O’Keefe
“for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than [other] search terms … is truly to go
where angels fear to tread.”
He has also suggested that litigants take a good look at more advanced search methodologies, including the use of computational linguistics and technology assisted review
04/13/2023 66
Reasonableness of Search Methods
Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md., May 29, 2008).
"Common sense suggests that even a properly designed and executed keyword search may prove to be over-inclusive or under-inclusive...the only prudent way to test the reliability of the keyword search is to perform some appropriate sampling."
“Selection of the appropriate search and information retrieval technique requires careful advance planning by persons qualified to design effective search methodology. The implementation of the methodology selected should be tested for quality assurance; and the party selecting the methodology must be prepared to explain the rationale for the method chosen to the court, demonstrate that it is appropriate for the task, and show that it was properly implemented.”
04/13/2023 67
From Pre-Discovery to Production Completeness
Henry v. Quicken Loans --> 26(f) consulting– Lawyers agreed to keyword lists and process– Ran own (unsanctioned) searches with expert– Told to live with bad results, and pay for it
Qualcomm --> Smell Test; Dig Deeper– In-house counsel (Qualcomm) v. Outside Counsel (Day Casebeer)– Sanctions, Attorney Client-Privilege Problems– Associate found docs and told they weren’t relevant; found out the
hard way that those and 230,000 other pages were relevant
Judge Rader’s Protocol in TX for Patent cases– 5 custodians– 5 search terms (can you say over broad…)
04/13/2023 68
Missing abbreviations / acronyms / clippings:
– incentive stock option but not ISO
– Board of Directors but not BOD
– 1998 plan but not 98 plan
Missing inflectional variants:
– grant but not grants, granted, granting
Missing spellings or common misspellings:
– gray but not grey
– privileged but not priviliged, priviledged, privilidged,
priveliged, privelidged, priveledged, …
Under-inclusive - Missing in Action
04/13/2023 69
Missing syntactic variants:
board of directors meeting
but not
Missing in Action II
meeting of the boardof directors
BOD meeting
board meeting
BOD mtg
board mtg
directors’ meeting
directors’mtg
mtg of the BOD
mtg of the directors
BOD meetings
board meetings
BOD mtgs
board mtgs
directors’ meetings
directors’ mtgs
mtgs of the BOD
mtgs of the directors
04/13/2023 70
Missing synonyms / paraphrases:
hire date but not start date
approved by Smith
but not
Missing in Action III
Smith’s approval
the approval of Smith
Smith’s ok
Smith’s go-ahead
Smith’s goahead
the go-ahead from Smith
the goahead from Smith
the nod from Smith
Smith’s signature
Smith’s sign-off
the sign-off of Smith
the signoff of Smith
04/13/2023 71
As a keyword item, the address
101 E. Bergen Ave., Temple, CA 90200
does not match any of:
101 East Bergen Avenue
the Bergen site
the Temple location
our 90200 outlet
Missing in Action IV
04/13/2023 72
Options
Target: Sheila was granted 100,000 options at $10
Match: What are our options for lunch?
Match in a signature line:
Amanda Wacz
Acme Stock Options Administrator
Destroy
Target: destroy evidence
Match in a disclaimer: The information in this email, and any attachments, may contain confidential and/or privileged information and is intended solely for the use of the named recipient(s). Any disclosure or dissemination in whatever form, by anyone other than the recipient is strictly prohibited. If you have received this transmission in error, please contact the sender and destroy this message and any attachments. Thank you.
Over-inclusive - Unwanted Extras
04/13/2023 73
Unwanted Extras II
alter*
Target: alter, alters, altered, altering
Matches: alternate, alternative, alternation, altercate,
altercation, alterably, …
grant
Target: stock option grant
Matches names: Grant Woods, Howard Grant
04/13/2023 74
Tuning an Ontology
Linguists briefed as reviewers
Linguists read the data
Linguists study complaint and other relevant documents
Linguists analyze the search index
Legal Team provides input, feedback
04/13/2023 75
A Simple Linguistic Ontology
ROYALTY CONCEPT– Royalty– Commission– Honorarium– Usage Fee– Slice of the Pie
04/13/2023 76
A Simple Pricing Concept
PRICING CONCEPT– Purchase Order– PO– Dollar amount– Invoice
04/13/2023 77
Adding Subjective Content
PRICING CONCEPT– Purchase Order– PO– Dollar amount– Invoice– Cylinder– Canister– Bottle
Ontology Usage
Identifying Misspellings, Slang, Nicknames, etc.
Variant Generation – help the user find what he meant (names, words, suggestions)– Buy* Buying, Buys, Bought, etc.– Kenneth Lay, Ken Lay, klay, kenneth.lay
View variations in context to choose topics
Document segmentation – text blocks, signatures
Finding Words in Context, Frequencyat serious risk of losing 25are certain risks inherent in 16
04/13/2023 78
04/13/2023 79
Identifying misspellings, slang, etc
1. Match the index against electronic dictionary.
2. From the remaining material (not in dictionary), remove any items that are merely numbers.
3. Find (in the ontologies) any words that are similar to what remains.
4. Add the similar words to the ontology
This increases coverage (i.e., ensures that we retrieve documents that otherwise would have been missed)
04/13/2023 80
Variant Generation
Help the user find out search for what he meant
Take names, numbers, and other entities for which the user wants to search
Automatically generate likely synonyms
04/13/2023 81
Variant Generation
Show the context of these variations, so the user can evaluate them.
04/13/2023 82
Document Segmentation
Examples of signaturesJean-Louis Koenig
President GGDA Region
MegaCorp International SA
Rue de Concours 2280
Bern, Switzerland
Robert Guilliam
Product Regulatory Affairs & Compliance
MegaCorp International
Neuchatel
Switzerland
Tél. +41 (31) 125 2366
Alberto Goreman
Manager Printing & Packaging, Eastern Region
+57 3 451 7195, [email protected]
04/13/2023 83
Finding words in context
Phrase Total Instances
risks alienating some 37at serious risk of losing 25are certain risks inherent in 16are at risk of running 15it be risking anything by 15difference a risk o why 14and the risks inherent in 12without assuming any risk 8we could risk losing next 7avoid transferring risk to the 5requires taking risks and the 4can t risk not living 3and unknown risks and uncertainties 2a potential risk that was 2avoid transfering risk to the 2
This increases coverage AND precision
04/13/2023 84
Multi-Lingual Issues
Does language matter? – Lucerne– Luzerne– Lucerna
These places were all the same city
Name of city not necessarily expressed in the same language as rest of document
In Europe, many email threads and documents are mixed language, and must be properly categorized as such
04/13/2023 85
Automated Ontology Expansion Tools
Currently implemented expansion modules:Spelling variants:color >> colour, defense >> defence, labeled>> labelledLemmatization (recovering uninflected form):walking >> walk, ate >> eatMorphological variants:eat >> eats, eating, eaten, atehablar >> hablo, hablas, habla, hablan, habláis, hablamosNumber expansion:$2.5B >> two point five billion dollars2,567 >> two thousand five hundred sixty seven13 >> 13th, thirteenthName variants:Elizabeth Van der Beek >> “Liz Van der Beek”, “Liz Vander Beek”, “Van der Beek, Elizabeth”, “Beth Vanderbeek”, etc.Email variants (mined from alias clusters file):Elizabeth Van der Beek >> evanderbeek, liz.vanderbeek, vanderbeekl, emvanderbeek, etc.Abbreviations:administrative project meeting >> admin project meeting, admin project mtg, admin proj mtg, etc.