georgetown lecture 2012 6 2 full

85
06/18/2022 1 “Triggers,” Preservation & Search June 2, 2012 Georgetown Law Sonya L. Sigler

Upload: sonya-sigler

Post on 23-Jun-2015

44 views

Category:

Law


2 download

DESCRIPTION

Guest lecture I gave at Georgetown Law eDiscovery class.

TRANSCRIPT

Page 1: Georgetown lecture 2012 6 2 full

04/13/2023 1

“Triggers,” Preservation & Search

June 2, 2012Georgetown Law

Sonya L. Sigler

Page 2: Georgetown lecture 2012 6 2 full

04/13/2023 2

Overview

Triggers & Preservation• What is it?• Why Does it Matter?

Search

Keyword Search

Clustering

Ontologies

Technology Enhanced Review - Sampling

Social Networking Analysis

Relationship Analysis

Page 3: Georgetown lecture 2012 6 2 full

“Triggers” & Preservation

What is a Trigger?– Litigation reasonably anticipated– Who decides

Litigation Hold Continuum – Established in hind sight– Threat– Letter about litigation– Filing Suit

Cases– Pippin, Zubulake, Pension Committee

04/13/2023 3

Page 4: Georgetown lecture 2012 6 2 full

Pippins v. KPMG

How much data to Preserve?– All hard drives (Pippins’ position)– 100 Sample Hard drives (KPMG’s position)

To Cooperate or NOT to Cooperate?

How Judges React to Lack of Cooperation

04/13/2023 4

Page 5: Georgetown lecture 2012 6 2 full

Zubulake

Litigation Holds– Cannot send a request into the ether

PreservationHave to follow-up

Take affirmative steps to monitor compliance

In-house Counsel Duty

Cannot leave it to employees discretion

Document what was done

04/13/2023 5

Page 6: Georgetown lecture 2012 6 2 full

Pension Committee

No intentional destruction of data

Careless & indifferent

No Latchkey Custodians (alone & unsupervised)– Identify Custodians– Monitor their efforts– Including former employees and third parties

Proactive

Consistent

Reasonable Approach

04/13/2023 6

Page 7: Georgetown lecture 2012 6 2 full

Triggers

When does a duty to preserve arise?

04/13/2023 7

Page 8: Georgetown lecture 2012 6 2 full

What To Do?

Who to include?– Not about data volume– Not about contact with underlying “litigation”

Key Players (Zubulake opinions)– Likely to have relevant information– CEO, Board, Committees, employees, etc.

Produce it from the Key Player (not others)– Nursing Home Pension Fund v. Oracle– Produce emails from the CEO (15) not others (1,650)

04/13/2023 8

Page 9: Georgetown lecture 2012 6 2 full

Spoliation

Failure to Preserve– Didn’t Ask

• Right person• Right Place

– Didn’t follow up

Destruction of Data– Intentional– Inadvertent destruction

What can happen– Sanctions– Adverse Inferences

04/13/2023 9

Page 10: Georgetown lecture 2012 6 2 full

Search

How to Use it To Find Information

How to Use it to Ignore Information

When to use which search methodology

04/13/2023 10

Page 11: Georgetown lecture 2012 6 2 full

04/13/2023 11

Search - Data Assessment

Where is the Data?– Data Mapping - databases, servers, desktops, laptops,

IMs, smart phones, voicemail, other recordsDefining Process from Collection to Review to ProductionCollection Strategy, Process, Approach– Scope of collection: custodians, date ranges, topics

Reports on the Data Processing– File types, encrypted files, de-duplication rates,

password protected files, encrypted files, etc.Not Reasonably Accessible dataAssessing Risk of Data Loss

Page 12: Georgetown lecture 2012 6 2 full

04/13/2023 12

Search - Case Assessment

Who - Cast of Characters

What - What the Heck Happened?

Where - Where did it take place?

When - What time period are we concerned with?

How - fraud, antitrust violation, etc.

WHY - What were the motives involved?

Data Assessment ≠ Effective Case Assessment

Page 13: Georgetown lecture 2012 6 2 full

04/13/2023 13

United States v. O’Keefe (Facciola)– Questioned lawyers’ ability to decide which search terms are more likely to

produce relevant information– Facciola has also suggested that litigants take a look at advanced search

methodologies

Victor Stanley, Inc. v. Creative Pipe, Inc. (Grimm)– Defensibility of process AND execution lies with the party relying upon the

search protocol to meet their obligations which needs to be able to explain search rationale, appropriateness, and proper implementation

– Advocates quality assurance, e.g. by sampling– Searches should be designed by a competent practitioner

Keyword Search Under Scrutiny

Page 14: Georgetown lecture 2012 6 2 full

04/13/2023 14

Keyword Specific Case

William A. Gross Construction Associates, Inc. v. American Manufacturers Mutual Insurance Company

SDNY, Judge Andrew Peck

Keyword list was in the thousands

Use the actual data set and custodians to figure out keywords

“This case is just the latest example of lawyers designing keyword searches in the dark, by the seat of the pants, without adequate (indeed, here, apparently without any) discussion with those who wrote the emails. Prior decisions from Magistrate Judges in the Baltimore-Washington Beltway have warned counsel of this problem, but the message has not gotten through to the Bar in this District.”

Page 15: Georgetown lecture 2012 6 2 full

04/13/2023 15

$6M Keyword Mistake

In re Fannie Mae Securities Litigation

3rd Party - OFHEO

DC Circuit - Judge David Tatel

Attorney agreed to something he did NOT understand

Long list of key terms

Taxpayers suffered the consequence

Page 16: Georgetown lecture 2012 6 2 full

04/13/2023 16

What This Means

• The Courts are finally catching up

• Courts actively ruling on Standards of Care and Process

• Lawyers are Getting Wise

Page 17: Georgetown lecture 2012 6 2 full

04/13/2023 17

Case Law Effects on Discovery

Defensibility of Review Process is now a focus– Culling now can kill you later– Cooperation is a hot topic– Tussle between inside & outside counsel– Beginning to see planning as a necessity

Increased focus on Quality– Heightened involvement expected from corporate clients

in the overall process– Cases pushing this, Qualcomm, Creative Pipe

Page 18: Georgetown lecture 2012 6 2 full

04/13/2023 18

What Else Is There?

Effort to establish & codify uniform “Best Practices”– Quickly becoming roadmap for uneducated industry– Increasingly relied upon by judges as measure of reasonable or

standard behaviorPublications have addressed:– Document retention & production– Email management– Search & Retrieval– Protective orders & confidentiality– ESI admissibility

Page 19: Georgetown lecture 2012 6 2 full

04/13/2023 19

Getting to a Manageable Review Set

Intake Data 100%

Duplicates 25%

Non-Responsive

20%

Produced 12.25%

These figures vary based upon the data set received

NR/Priv 20%

Responsive & Priv 15%

Junk/Spam/ Porn

20%

Focus on finding, reviewing & using the

“right” data, not just filtering data

Page 20: Georgetown lecture 2012 6 2 full

04/13/2023 20

Search Methodologies

specific exact wordsKeyword

Clustering Ontology

relationships among relevant people

similarity of salient features

generalized words or phrases

Social Network Analysis

specific exact wordsKeyword specific exact wordsKeyword

Clustering Ontologysimilarity of

salient featuresgeneralized

words or phrases

specific exact wordsKeyword

Clustering Ontology

Social Network Analysis

RelationshipAnalysis

documents withcausal or

sequential relationship

relationships among relevant people

similarity of salient features

generalized words or phrases

specific exact words, proximity searches, stemming

Keyword

Clustering Ontology

Social Network Analysis

RelationshipAnalysis

documents withcausal or

sequential relationship

relationships among relevant people

similarity of salient features

generalized words or phrases

Content

Concept

Context

Visualization

Measurement

Page 21: Georgetown lecture 2012 6 2 full

04/13/2023 21

Keyword Accuracy Example

8,553 responsive documentsmissed by keyword search(Almost 8% of responsivedocuments missed bykeyword search - Under-inclusive)

Keyword search reduced thedocument set by only 47%

And 88% of the documentsreturned by keywordsearch were not responsive(Over-inclusive)

Page 22: Georgetown lecture 2012 6 2 full

04/13/2023 22

MythKeyword Searching is the Way to Go

If I agree to keyword terms, I am OK

Keyword Search Cases

Keyword replacement example

Keyword substitution

Missing in Action (Under-inclusive)

Unwanted Extras (Over-inclusve)

Multiple subject/persons (Disambiguate)

Page 23: Georgetown lecture 2012 6 2 full

04/13/2023 23

Manual review by humans of large amounts of information is as accurate and complete as possible - perhaps even perfect - and constitutes the gold standard by which all

searches should be measured

This is “The reigning Myth of ‘perfect’ retrieval using traditional means”

Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-DiscoveryThe Sedona Conference Journal (2007) p. 199

Human beings retrieved less than 20% of the relevant documents when they believed they were retrieving over 75%

An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval SystemBlair & Maron (1985)

Fact or Myth?

Page 24: Georgetown lecture 2012 6 2 full

IS 240 – Spring 2011

Blair and Maron 1985A classic study of retrieval effectiveness– earlier studies were on unrealistically small collections

Studied an archive of documents for a legal suit– ~350,000 pages of text– 40 queries– focus on high recall– Used IBM’s STAIRS full-text system

Main Result: – The system retrieved less than 20% of the relevant

documents for a particular information need; lawyers thought they had 75%

But many queries had very high precision

Page 25: Georgetown lecture 2012 6 2 full

IS 240 – Spring 2011

Blair and Maron, cont.

How they estimated recall– generated partially random samples of unseen documents– had users (unaware these were random) judge them for

relevance

Other results:– two lawyers searches had similar performance– lawyers recall was not much different from paralegal’s

Page 26: Georgetown lecture 2012 6 2 full

IS 240 – Spring 2011

Blair and Maron, cont.

Why recall was low– users can’t foresee exact words and phrases that will

indicate relevant documents• “accident” referred to by those responsible as:“event,” “incident,” “situation,” “problem,” …• differing technical terminology• slang, misspellings

– Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied

Page 27: Georgetown lecture 2012 6 2 full

04/13/2023 27

Keyword Search Summary

Pro Word Stemming

–Hous* - house, housemate, household

Easy to use/explain/agree Familiar

Fast results

Con Over-inclusive

–Disambiguate

Under-inclusive Word must be present Hard to craft Ineffective with short

messages, IMs

Page 28: Georgetown lecture 2012 6 2 full

04/13/2023 28

Keyword Truths

Under-inclusive - missing relevant or important info

Over-inclusive - costly to review

“Reasonable Keyword Search” doesn’t exist

Effective keyword search is difficult/impossible– Index Data, Analyze Index– Suggest keywords or approach

Keywords may not be appropriate for the data

Keyword Search is ONE Tool in Your Arsenal

Page 29: Georgetown lecture 2012 6 2 full

04/13/2023 29

Keyword Accuracy Example

8,553 responsive documentsmissed by keyword search(Almost 8% of responsivedocuments missed bykeyword search - Under-inclusive)

Keyword search reduced thedocument set by only 47%

And 88% of the documentsreturned by keywordsearch were not responsive(Over-inclusive)

Page 30: Georgetown lecture 2012 6 2 full

04/13/2023 30

Search Methodology Continuum

Review Methodology - Decided Upfront

Identify Issues in the Case– Formulate Queries and Approaches for Finding

Responsive Documents– Formulate Relevancy and Responsiveness Guidelines

Identify Primary Participants

Select or Triage Documents for Review

Page 31: Georgetown lecture 2012 6 2 full

04/13/2023 31

Review Tools for Relevancy Assessment

Keyword Searches, Culling– Slices of Data are Reviewed

Categorization of Data– Entire Dataset is Categorized– Review Targeted Data

Automated Review– Categorization of Dataset– Random Sampling (Statistically Significant)

Page 32: Georgetown lecture 2012 6 2 full

04/13/2023 32

Categorization of Data for Review

Categorize Entire Data Set– Spam/Porn/System Files– Personal/Private Data– Non-relevant Business Data

Business Data– Relevancy Assessment by Topic– Privilege Review

Keyword, Topic Analysis - Overlap, Holes

Page 33: Georgetown lecture 2012 6 2 full

04/13/2023 33

Search Methodologies

specific exact wordsKeyword

Clustering Ontology

relationships among relevant people

similarity of salient features

generalized words or phrases

Social Network Analysis

specific exact wordsKeyword specific exact wordsKeyword

Clustering Ontologysimilarity of

salient featuresgeneralized

words or phrases

specific exact wordsKeyword

Clustering Ontology

Social Network Analysis

RelationshipAnalysis

documents withcausal or

sequential relationship

relationships among relevant people

similarity of salient features

generalized words or phrases

specific exact words, proximity searches, stemming

Keyword

Clustering Ontology

Social Network Analysis

RelationshipAnalysis

documents withcausal or

sequential relationship

relationships among relevant people

similarity of salient features

generalized words or phrases

Content

Concept

Context

Visualization

Measurement

Page 34: Georgetown lecture 2012 6 2 full

04/13/2023 34

Categorization Methods

Statistical Methods (#s based)– Topic Clustering

• Statistical Similarity • Counting #s of words, appearance together

– Latent Semantic Indexing– Supervised v. Unsupervised Clustering

Linguistic Methods (Word Based)– Keyword (Culling Method)– Ontologies

Page 35: Georgetown lecture 2012 6 2 full

04/13/2023 35

Clustering

Clustering just means putting documents into groups that have something in common.

Manually (that's what manual review is)

Keyword Searches

Ontologies (linguistic filters)

Automated clustering (using technology)– Automated clustering by document type (all the Word

documents go into one basket– Automated clustering by creation date– Automated clustering by Actor– Automated clustering by statistical similarity (statistical

clustering)– ... and many other approaches

Page 36: Georgetown lecture 2012 6 2 full

04/13/2023 36

Clustering -- “Options”

1 Cluster or 4 Clusters

Financial/energy trading options   

Email/computer menu-driven options   

Stock options (ISO's)   

The generic idea of an available choice of action

Page 37: Georgetown lecture 2012 6 2 full

04/13/2023 37

Clustering

Software implements statistical methods of finding groups of “similar” documents– “Similar” must be defined appropriately

for the application

Documents are categorized with very little effort by the user

May help with document review– A single reviewer can look at similar

documents together, produce consistent review decisions

– Tight clustering can be used to detect “near duplicates” caused by OCR errors

Page 38: Georgetown lecture 2012 6 2 full

04/13/2023 38

Clustering vs. queries

Clustering is unpredictable compared to keywords or taxonomiesThe items that look very similar (to the clustering algorithm) may not actually be similar in ways that matter– Relevancy may depend upon fine legal distinctions– May vary in the same matter by subpoena and/or

jurisdiction

Page 39: Georgetown lecture 2012 6 2 full

04/13/2023 39

Ontologies

Implement ontologies for directed searches. – Approach searching from a knowledge-representation viewpoint– Field is 25 years old, lots of work done– Advantages:

• Disambiguate different meanings of the same word from their context More accurate

• Encapsulate many ways of saying the same thing More thorough

• Search for concepts, not individual words More intuitive, more reusable, and faster

Can be combined with other methods (unsupervised clustering, discussions).

Page 40: Georgetown lecture 2012 6 2 full

04/13/2023 40

Subjectivity

GOOD WEATHER– Sun– Calm

BAD WEATHER– Rain– Snow– Wind

Page 41: Georgetown lecture 2012 6 2 full

04/13/2023 41

A More Realistic Ontology

ROYALTY CONCEPT• royalty• royalties• rty• commission• commissions• comm.• honorarium• honorariums• honoraria• usage fee• usage charge• usg fee• use fee• fee for use• fee for usage• incent*• insent*• earn a fee• eam a fee

• charge for use• charged for use • charging for use• charges for use• licence fee• license fee• lisense fee• “take cut”~2• “takes cut”~2• “took cut”~2• “slice pie”~5• “piece pie”~5• “piece action”~5• “slice action”~5• -king• -queen• -prince• -princess

Page 42: Georgetown lecture 2012 6 2 full

04/13/2023 42

Ontology as a Query

But it can be slightly cumbersome to deal with directly in that form

q ((+(std:%CapacityReports_% std:%DINCapacity_%) +(std:%ACMEEPPlant_% std:%ProductName_%)) (+(std:%ACMEPNPlant_% std:%ProductName_%) +(std:%ProductiveCapability_% std:%CapacityReports_%)) (+(std:%CapacityCreep_% std:%OperationsImprovement_% std:%CapacityExpansion_% std:%CapacityRestoration_%) +(std:%ACMEPNPlant_% std:%ProductName_%)) (+(std:%EquipmentReplacement_% std:%FinishingColumn_%) +(std:%ACMEPNPlant_% std:%ProductName_%)) (std:%Audit_% actor:%Audit_%) (+(std:%SettlementNegotiations_% std:%ContractNegotiations_% ) +(actor:%ACMEOutsideCounsel_% std:%ACMEOutsideCounsel_% actor:%ACME UBOutsideCounsel_% std:%AcmeSubOutsideCounsel_% actor:%AcmeSub_% std:%AcmeSub_%)) (std:%FTC_% actor:%FTC_%) ((+subject:%ProductName_% +(std:swap std:"supply agreement" std:"exchange agreement" std:"agree to exchange")) std:"name

(About a quarter of its regular size)

Page 43: Georgetown lecture 2012 6 2 full

04/13/2023 43

Ontology Pros & Cons

Identify acronymsNormalize variants Disambiguate termsIdentify overly broad keywordsIdentify and correct keywords with errorsCreate extensive libraries of ontologiesCan be used as a clustering methodTopics can appear in more than one languagesReusable for different types of litigation, e.g. anti-trust, product liability etc. (and for both offense and defense)

As with Keyword - word basedLabor intensive, upfront

Page 44: Georgetown lecture 2012 6 2 full

“Search” Terminology

Technology-Enhanced Review

Technology Assisted Review

Automated Review

Predictive Coding

04/13/2023 44

Page 45: Georgetown lecture 2012 6 2 full

Setup

Sample

Expert judges sample

Non-responsive

Responsive

Model learnsModel predicts

Model categorizes all remaining documents

Responsive Non-responsive

Repeat as needed

Page 46: Georgetown lecture 2012 6 2 full

Automated Review Methodology

Page 47: Georgetown lecture 2012 6 2 full

04/13/2023 47

Priv byHigh-Speed

Manual Review

Source Data

Eliminate Duplicates & System Files

Non-Responsive Isolation

ontologies

Responsiveby Technology

Enhanced Review

(removedanother 7%)

NR by Technology Enhanced

Review(removed

another 18%)

30%

30%

15%22%

100%

3%

Technology Enhanced Review:Speed, Predictable Costs, and Accuracy

Automate any portion of the review

Example from a real case

Page 48: Georgetown lecture 2012 6 2 full

04/13/2023 48

Search Methodologies

specific exact wordsKeyword

Clustering Ontology

relationships among relevant people

similarity of salient features

generalized words or phrases

Social Network Analysis

specific exact wordsKeyword specific exact wordsKeyword

Clustering Ontologysimilarity of

salient featuresgeneralized

words or phrases

specific exact wordsKeyword

Clustering Ontology

Social Network Analysis

RelationshipAnalysis

documents withcausal or

sequential relationship

relationships among relevant people

similarity of salient features

generalized words or phrases

specific exact words, proximity searches, stemming

Keyword

Clustering Ontology

Social Network Analysis

RelationshipAnalysis

documents withcausal or

sequential relationship

relationships among relevant people

similarity of salient features

generalized words or phrases

Content

Concept

Context

Visualization

Measurement

Page 49: Georgetown lecture 2012 6 2 full

04/13/2023 49

From Document Analysis to Social Network Analysis

Page 50: Georgetown lecture 2012 6 2 full

04/13/2023 50

From Social Network Analysisto Discussions

Page 51: Georgetown lecture 2012 6 2 full

04/13/2023 51

Search Methodologies

specific exact wordsKeyword

Clustering Ontology

relationships among relevant people

similarity of salient features

generalized words or phrases

Social Network Analysis

specific exact wordsKeyword specific exact wordsKeyword

Clustering Ontologysimilarity of

salient featuresgeneralized

words or phrases

specific exact wordsKeyword

Clustering Ontology

Social Network Analysis

RelationshipAnalysis

documents withcausal or

sequential relationship

relationships among relevant people

similarity of salient features

generalized words or phrases

specific exact words, proximity searches, stemming

Keyword

Clustering Ontology

Social Network Analysis

RelationshipAnalysis

documents withcausal or

sequential relationship

relationships among relevant people

similarity of salient features

generalized words or phrases

Content

Concept

Context

Visualization

Measurement

Page 52: Georgetown lecture 2012 6 2 full

04/13/2023 52

Analytics

Analytics are Based on the Model and on Discussions

Page 53: Georgetown lecture 2012 6 2 full

04/13/2023 53

Better Answers and Better Questions

When were customary work practices circumvented?

When did established norms of behavior change?

Who knew, or likely knew, what facts?

Who interacted with whom and how intimately?

Who was involved in what types of decisions or meetings?

Who are the real ‘insiders’?

What data is hidden or missing?

When were electronically documented conversations “taken off line,” possibly in an attempt to avoid detection?

How did the importance of different actors change over time?

Page 54: Georgetown lecture 2012 6 2 full

04/13/2023 54

Bear Stearns

Two hedge fund managers

arrested

Charged with securities and

wire fraud, and one with

insider trading

Internal emails:– “I'm fearful of these markets. ... As we discussed it may not be a

meltdown for the general economy but in our world it will be.” – “I think we should close the funds now .”

External communications:– “We are very comfortable with exactly where we are.” – “The funds are performing exactly as they were designed to.”

Lower Bar For Fraud?

Page 55: Georgetown lecture 2012 6 2 full

04/13/2023 55

Sentiment Analysis Visualization

Page 56: Georgetown lecture 2012 6 2 full

04/13/2023 56

Analysis of Anomalous Communication Patterns

Unusual levels relative to a

particular type of activity pop out

Color-coded graphs show relative communication densities for apples to apples comparisons

Page 57: Georgetown lecture 2012 6 2 full

04/13/2023 57

Spread of Information

Page 58: Georgetown lecture 2012 6 2 full

04/13/2023 58

Emotive ToneWhistle-blower Scenario

Page 59: Georgetown lecture 2012 6 2 full

04/13/2023 59

“Call Me” EventsSequence Viewer used for analytics-driven review

Page 60: Georgetown lecture 2012 6 2 full

04/13/2023 60

Search Risks

Failure to find responsive documents

Failure to recognize responsive documents

Failure to recognize privileged documents

Inconsistent treatment of documents (e.g., duplicates)

Failure to complete project in a timely manner

Sophisticated Tools– Understand What They Do and Don’t Do Well– Inform Yourself, Speak to References, Consultants

Page 61: Georgetown lecture 2012 6 2 full

04/13/2023 61

Transparency of Process

Discussing Review Protocols– Provide transparent, defensible, sophisticated search

based on document content– Clustering, Ontologies, Analytics, and yes, sometimes

Keywords too

Develop search methodologies for each case– Use technology experts in consultation with case / legal

experts

Results verifiable by Quality Control– Defensible sampling

Page 62: Georgetown lecture 2012 6 2 full

04/13/2023 62

Thank you!

Sonya L. SiglerVice President, Product Strategy

SFL Data415-321-8385

[email protected] www.sfldata.com

Page 63: Georgetown lecture 2012 6 2 full

04/13/2023 63

Review Protocol

≠ Agreeing to Search Terms

Data Culling (upfront or backend)

Search Methodologies - Continuum– Keyword Positive List– Ontologies– Clustering– Technology Enhanced Review– Relationship Analysis

Quality Control Process & Procedures

Privilege Review, Sensitivities

Production Format & Timing

Page 64: Georgetown lecture 2012 6 2 full

04/13/2023 64

Search

The Courts are Finally Starting to Catch up to TechnologyMaking more aggressive rulings:– Forcing attorneys to live with the results of bad

searches– Sanctioning those who screw up, even if no allegation

of fraud– Demanding repeatable,

demonstrable process – using

terms like “quality assurance”

Page 65: Georgetown lecture 2012 6 2 full

04/13/2023 65

Search Under Scrutiny

Facciola’s Opinions - United States v. O’Keefe

“for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than [other] search terms … is truly to go

where angels fear to tread.”

He has also suggested that litigants take a good look at more advanced search methodologies, including the use of computational linguistics and technology assisted review

Page 66: Georgetown lecture 2012 6 2 full

04/13/2023 66

Reasonableness of Search Methods

Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md., May 29, 2008).

"Common sense suggests that even a properly designed and executed keyword search may prove to be over-inclusive or under-inclusive...the only prudent way to test the reliability of the keyword search is to perform some appropriate sampling."

“Selection of the appropriate search and information retrieval technique requires careful advance planning by persons qualified to design effective search methodology. The implementation of the methodology selected should be tested for quality assurance; and the party selecting the methodology must be prepared to explain the rationale for the method chosen to the court, demonstrate that it is appropriate for the task, and show that it was properly implemented.”

Page 67: Georgetown lecture 2012 6 2 full

04/13/2023 67

From Pre-Discovery to Production Completeness

Henry v. Quicken Loans --> 26(f) consulting– Lawyers agreed to keyword lists and process– Ran own (unsanctioned) searches with expert– Told to live with bad results, and pay for it

Qualcomm --> Smell Test; Dig Deeper– In-house counsel (Qualcomm) v. Outside Counsel (Day Casebeer)– Sanctions, Attorney Client-Privilege Problems– Associate found docs and told they weren’t relevant; found out the

hard way that those and 230,000 other pages were relevant

Judge Rader’s Protocol in TX for Patent cases– 5 custodians– 5 search terms (can you say over broad…)

Page 68: Georgetown lecture 2012 6 2 full

04/13/2023 68

Missing abbreviations / acronyms / clippings:

– incentive stock option but not ISO

– Board of Directors but not BOD

– 1998 plan but not 98 plan

Missing inflectional variants:

– grant but not grants, granted, granting

Missing spellings or common misspellings:

– gray but not grey

– privileged but not priviliged, priviledged, privilidged,

priveliged, privelidged, priveledged, …

Under-inclusive - Missing in Action

Page 69: Georgetown lecture 2012 6 2 full

04/13/2023 69

Missing syntactic variants:

board of directors meeting

but not

Missing in Action II

meeting of the boardof directors

BOD meeting

board meeting

BOD mtg

board mtg

directors’ meeting

directors’mtg

mtg of the BOD

mtg of the directors

BOD meetings

board meetings

BOD mtgs

board mtgs

directors’ meetings

directors’ mtgs

mtgs of the BOD

mtgs of the directors

Page 70: Georgetown lecture 2012 6 2 full

04/13/2023 70

Missing synonyms / paraphrases:

hire date but not start date

approved by Smith

but not

Missing in Action III

Smith’s approval

the approval of Smith

Smith’s ok

Smith’s go-ahead

Smith’s goahead

the go-ahead from Smith

the goahead from Smith

the nod from Smith

Smith’s signature

Smith’s sign-off

the sign-off of Smith

the signoff of Smith

Page 71: Georgetown lecture 2012 6 2 full

04/13/2023 71

As a keyword item, the address

101 E. Bergen Ave., Temple, CA 90200

does not match any of:

101 East Bergen Avenue

the Bergen site

the Temple location

our 90200 outlet

Missing in Action IV

Page 72: Georgetown lecture 2012 6 2 full

04/13/2023 72

Options

Target: Sheila was granted 100,000 options at $10

Match: What are our options for lunch?

Match in a signature line:

Amanda Wacz

Acme Stock Options Administrator

Destroy

Target: destroy evidence

Match in a disclaimer: The information in this email, and any attachments, may contain confidential and/or privileged information and is intended solely for the use of the named recipient(s). Any disclosure or dissemination in whatever form, by anyone other than the recipient is strictly prohibited. If you have received this transmission in error, please contact the sender and destroy this message and any attachments. Thank you.

Over-inclusive - Unwanted Extras

Page 73: Georgetown lecture 2012 6 2 full

04/13/2023 73

Unwanted Extras II

alter*

Target: alter, alters, altered, altering

Matches: alternate, alternative, alternation, altercate,

altercation, alterably, …

grant

Target: stock option grant

Matches names: Grant Woods, Howard Grant

Page 74: Georgetown lecture 2012 6 2 full

04/13/2023 74

Tuning an Ontology

Linguists briefed as reviewers

Linguists read the data

Linguists study complaint and other relevant documents

Linguists analyze the search index

Legal Team provides input, feedback

Page 75: Georgetown lecture 2012 6 2 full

04/13/2023 75

A Simple Linguistic Ontology

ROYALTY CONCEPT– Royalty– Commission– Honorarium– Usage Fee– Slice of the Pie

Page 76: Georgetown lecture 2012 6 2 full

04/13/2023 76

A Simple Pricing Concept

PRICING CONCEPT– Purchase Order– PO– Dollar amount– Invoice

Page 77: Georgetown lecture 2012 6 2 full

04/13/2023 77

Adding Subjective Content

PRICING CONCEPT– Purchase Order– PO– Dollar amount– Invoice– Cylinder– Canister– Bottle

Page 78: Georgetown lecture 2012 6 2 full

Ontology Usage

Identifying Misspellings, Slang, Nicknames, etc.

Variant Generation – help the user find what he meant (names, words, suggestions)– Buy* Buying, Buys, Bought, etc.– Kenneth Lay, Ken Lay, klay, kenneth.lay

View variations in context to choose topics

Document segmentation – text blocks, signatures

Finding Words in Context, Frequencyat serious risk of losing 25are certain risks inherent in 16

04/13/2023 78

Page 79: Georgetown lecture 2012 6 2 full

04/13/2023 79

Identifying misspellings, slang, etc

1. Match the index against electronic dictionary.

2. From the remaining material (not in dictionary), remove any items that are merely numbers.

3. Find (in the ontologies) any words that are similar to what remains.

4. Add the similar words to the ontology

This increases coverage (i.e., ensures that we retrieve documents that otherwise would have been missed)

Page 80: Georgetown lecture 2012 6 2 full

04/13/2023 80

Variant Generation

Help the user find out search for what he meant

Take names, numbers, and other entities for which the user wants to search

Automatically generate likely synonyms

Page 81: Georgetown lecture 2012 6 2 full

04/13/2023 81

Variant Generation

Show the context of these variations, so the user can evaluate them.

Page 82: Georgetown lecture 2012 6 2 full

04/13/2023 82

Document Segmentation

Examples of signaturesJean-Louis Koenig

President GGDA Region

MegaCorp International SA

Rue de Concours 2280

Bern, Switzerland

Robert Guilliam

Product Regulatory Affairs & Compliance

MegaCorp International

Neuchatel

Switzerland

Tél. +41 (31) 125 2366

Alberto Goreman

Manager Printing & Packaging, Eastern Region

+57 3 451 7195, [email protected]

Page 83: Georgetown lecture 2012 6 2 full

04/13/2023 83

Finding words in context

Phrase Total Instances

risks alienating some 37at serious risk of losing 25are certain risks inherent in 16are at risk of running 15it be risking anything by 15difference a risk o why 14and the risks inherent in 12without assuming any risk 8we could risk losing next 7avoid transferring risk to the 5requires taking risks and the 4can t risk not living 3and unknown risks and uncertainties 2a potential risk that was 2avoid transfering risk to the 2

This increases coverage AND precision

Page 84: Georgetown lecture 2012 6 2 full

04/13/2023 84

Multi-Lingual Issues

Does language matter? – Lucerne– Luzerne– Lucerna

These places were all the same city

Name of city not necessarily expressed in the same language as rest of document

In Europe, many email threads and documents are mixed language, and must be properly categorized as such

Page 85: Georgetown lecture 2012 6 2 full

04/13/2023 85

Automated Ontology Expansion Tools

Currently implemented expansion modules:Spelling variants:color >> colour, defense >> defence, labeled>> labelledLemmatization (recovering uninflected form):walking >> walk, ate >> eatMorphological variants:eat >> eats, eating, eaten, atehablar >> hablo, hablas, habla, hablan, habláis, hablamosNumber expansion:$2.5B >> two point five billion dollars2,567 >> two thousand five hundred sixty seven13 >> 13th, thirteenthName variants:Elizabeth Van der Beek >> “Liz Van der Beek”, “Liz Vander Beek”, “Van der Beek, Elizabeth”, “Beth Vanderbeek”, etc.Email variants (mined from alias clusters file):Elizabeth Van der Beek >> evanderbeek, liz.vanderbeek, vanderbeekl, emvanderbeek, etc.Abbreviations:administrative project meeting >> admin project meeting, admin project mtg, admin proj mtg, etc.