jack g. conrad, xi s. guo, peter jackson, monem meziou research & development

Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-

specific Operational Environment

Jack G. Conrad, Xi S. Guo, Peter Jackson, Monem MeziouResearch & DevelopmentThomson Legal & Regulatory – West GroupSt. Paul, Minnesota 55123 USA{Jack.Conrad,Peter.Jackson}@WestGroup.com

228th International VLDB '02 — J. Conrad20-23 Aug. 2002

Growth of Online Databases Westlaw and Westnews

0

2000

4000

6000

8000

10000

12000

14000

16000

No.

of

Dat

abas

esWestnews

Westlaw

20-23 Aug. 2002 28th International VLDB '02 — J. Conrad 3

Outline

Terminology Overview Research Contributions

(Novelty of Investigation) Corpora Statistics Experimental Set-up

Phase 1: Actual Physical Resources Phase 2: Acquired Logical Resources

Performance Evaluation Conclusions Future Work

Background to vocabulary, incl. that used in title

Background to vocabulary, incl. that used in title

Of our operational environment and overall problem space

Of our operational environment and overall problem space

Aspects of the problem that haven’t been explored before, esp. wrt scale & prod. sys.

Aspects of the problem that haven’t been explored before, esp. wrt scale & prod. sys.

We’ll look at the data sets used, namely, those listed for the next item

We’ll look at the data sets used, namely, those listed for the next item

We’ll compare the effectiveness of each approach on each data set

We’ll compare the effectiveness of each approach on each data set

I’ll share what conclusions we’re able to draw and discuss new directions this work may be taking

I’ll share what conclusions we’re able to draw and discuss new directions this work may be taking


Terminology Database Selection

• Given O(10K) DBs composed of textual documents, need to effectively & efficiently aid users to narrow info search

Actual Physical Resources• Exist O(1K) underlying physical DBs that can be leveraged

to reduce the dimensionality of problem• Have access to complete term distributions asso. w/ these

DBs

Acquired Logical Resources• Can re-architect underlying DBs along domain- and user-

centric content-types (e.g., Region, Topic, Doc-type, etc.)• Then profile those DBs using random or query-based

sampling

And hone in on the most relevant materials available in the systemAnd hone in on the most relevant materials available in the system

Organized around internal criteria such as pub. year, h/w system, etc. Organized around internal criteria such as pub. year, h/w system, etc.

Can characterize “logical” DBs using diff. sampling techniquesCan characterize “logical” DBs using diff. sampling techniques

Wanted to convince ourselves that we could first get reasonable results at this level

Wanted to convince ourselves that we could first get reasonable results at this level


OverviewOperational Environment

Over 15,000 databases consisting of 1,000s of docs Over one million U.S. attorneys Thousands of others in the UK, Canada, Australia, … O(100K) qrys submitted to Westlaw system each day

Motivations for Re-architecting System Showcasing 1000s of DBs typically a competitive advantage Segment of today’s users prefer global search environments Simplified activity of narrowing scope for online research

• User & domain-centric rather than hardware or maint.-centric• Primarily concentrating on areas of law and business

Toolkit approach to DBs and DB Selection tools• Diverse mechanisms for focusing on relevant information

Overview of Westlaw’s operational environmentOverview of Westlaw’s operational environment

Several 100,000 qrys are submitted …Several 100,000 qrys are submitted …

Each mech. optimized on a particular level of granularityEach mech. optimized on a particular level of granularity

We require our users to submit a DB IDWe require our users to submit a DB ID


Contributions of Research

Represent O(10,000) DBs DBs can contain O(100,000) documents Collection sizes vary by several

magnitudes Documents can appear in > 1 DB DBs cumulatively in TB, not GB range Docs represent real, not simulated domain Implemented in actual prod. environment

Work reported here involves between 2 and 3 TBWork reported here involves between 2 and 3 TB


Westlaw Architectural IssuesPhysical vs. “Logical” Databases

O (100)O (1000)

Jurisdiction

Legal Practice

Area

Doc-Type

. . .

Fed.

State

Local

Int’lAnalytical (2002/06)

WestNews (2002/05)

Case_Law (2002/03)

Regulatory (2002/07)

Statutes (2002/04)

Traditionally, data for the Westlaw System were

phys. stored in silos that were dictated by internal considerations, that is,

those that facilitated storage and maintenance

(publ. year, aggregate content type, or source)

Traditionally, data for the Westlaw System were

phys. stored in silos that were dictated by internal considerations, that is,

those that facilitated storage and maintenance

(publ. year, aggregate content type, or source)

Rather than categories of data that made sense to system users (in the legal domain),

categories such as legal jurisdiction (region), legal

practice area, or document-type (e.g., congressional leg.,

treatises, jury verdicts, etc)

Rather than categories of data that made sense to system users (in the legal domain),

categories such as legal jurisdiction (region), legal

practice area, or document-type (e.g., congressional leg.,

treatises, jury verdicts, etc)

This is what our primary objective was in re-arch the WL repository to achieve our logical

data sets

This is what our primary objective was in re-arch the WL repository to achieve our logical

data sets

The 3 cols labeled red rep. the 3 prim. bases for segment

The 3 cols labeled red rep. the 3 prim. bases for segment

The rows labeled in blue are … residual

sub-groupings resulting from this

strategy

The rows labeled in blue are … residual

sub-groupings resulting from this

strategy

Order Mag. Diff.Order Mag. Diff.


Corpora StatisticsCollection

InformationPhysicalDatabase

s(Phase 1)

Number of Collections

1000

Collections Profiled

100

StandardDocs / Profile

All

Average Docs / Collection

298,935

Average Tokens / Profile

97,299

Logical

Databases(Phase 2)

128

128

500 / 1000

378,468

22,296 / 47,450

Each doc partic. in profile

Each doc partic. in profile

Is basically entire dict.

Is basically entire dict.

Callan found 300 docs sufficed

Callan found 300 docs sufficed

Via samplingVia sampling

( 40% of WL) ( 90% of WL)

Roughly 25% & 50% of the complete dict.Roughly 25% & 50% of the complete dict.


Alternative Scoring Models Scoring: CORI 1-2-3

tf-idf based representing df-icf

absent terms given default belief prob.

Engine: WIN Bayesian Inference

Network Data: Collection Profiles

Complete Term Distr. (Phase 1) Random & Query-based sample

Term Distr. (Phase 2)

Scoring: Language Model occurrence based

via df + cf smoothing techniques used

on absent terms Engine: Statistical

Term / Concept Probabilities

Data: Collection Profiles Complete Term Distr. (Phase 1) Random & Query-based sample

Term Distr. (Phase 2)

1028th International VLDB '02 — J. Conrad20-23 Aug. 2002

tf * idf Scoring — Cori_Net3

)0.1|log(|

)5.0||

log(

)1(

)1()|(

Ccf

C

I

cwcw

K

Kdf

dfdtdtT

bbji ITddcwp

• Similar to Cori_Net2 but normalized w/o layered variables

The belief p(wi|cj) in collection cj due to observing term wi is

determined by db + (1 – db) * T * I

The belief p(wi|cj) in collection cj due to observing term wi is

determined by db + (1 – db) * T * I

Where db is the minimum belief component when term wi occurs

in collection cj

Where db is the minimum belief component when term wi occurs

in collection cj

This is the collection retrieval equivalent to normalized inverse doc freq (or idf)

This is the collection retrieval equivalent to normalized inverse doc freq (or idf)

Typically this tf type expr is normalized by df_max, but here we introduce K which has been inspired by exps in doc retrieval # our K is different than anything Callan or others have used # they have a set of parameters that

are successively wrapped around each other ( ) …

Typically this tf type expr is normalized by df_max, but here we introduce K which has been inspired by exps in doc retrieval # our K is different than anything Callan or others have used # they have a set of parameters that

are successively wrapped around each other ( ) …


Language Modeling

)()1()|()|( wPdwPdwP dbdocsum

)|()|(1

dwPdQPi

misequence

• Weighted Sum Approach (Additive Model)

• Query Treated as a Sequence of Terms

(Independent Events)

Is of course between 0 and 1 Is of course between 0 and 1

LM based only on a profile doc may face sparse data problems when the prob. of a word, w, given a profile ‘doc’ is 0 (unobserved event)

LM based only on a profile doc may face sparse data problems when the prob. of a word, w, given a profile ‘doc’ is 0 (unobserved event)

So it may be useful to extend the original document model with a db model

So it may be useful to extend the original document model with a db model

An additive model can help by leveraging extra evidence from the complete collection of profiles

An additive model can help by leveraging extra evidence from the complete collection of profiles

By summing in the contribution of a word at the db level, can mitigate uncertainty asso. w/ sparse data in non-add. model

By summing in the contribution of a word at the db level, can mitigate uncertainty asso. w/ sparse data in non-add. model

By treating qry as sequence of terms, w/ each term viewed as a separate event, and the qry rep. the joined event (permits dup. terms and phrasal expr.)

By treating qry as sequence of terms, w/ each term viewed as a separate event, and the qry rep. the joined event (permits dup. terms and phrasal expr.)


Test Queries andRelevance Judgments

Actual user submissions to DBS application Phase 1 (Physical Collections): 250 queries

Mean Length: 8.0 terms Phase 2 (Logical Collections): 100 queries

Mean Length: 8.8 terms

Complete Relevance Judgments Provided by domain experts before experiments run Followed training exercises to establish consistency Mean Positive Relevance Judgments per Query

Phase 1 (Physical Collections): 17.0 Phase 2 (Logical Collections): 9.1

Why did we use a diff. qry set for

Phase 2?

Why did we use a diff. qry set for

Phase 2?

Wanted qrys that were less general, more specific, with fewer positive rel. jdgmts per qry

Wanted qrys that were less general, more specific, with fewer positive rel. jdgmts per qry


Retrieval Experiments Database-level Test Parameters:

100 physical DBs vs. 128 logical DBs For logical DB profiles: Query-based vs. Random sampling

phrasal concepts vs. terms only stemming vs. no stemming scaling vs. none (i.e., global freq reduction) minimum term frequency thresholds

Performance Metrics: Standard Precision at 11-point Recall Precision at N-database cut-offs

It’s important to point out that our initial exps were at the …

It’s important to point out that our initial exps were at the …

Some of the variables we examined are indicated here

Some of the variables we examined are indicated here

Qrys with … versus …Qrys with … versus …

Stemmed terms vs. unstemmed terms

Stemmed terms vs. unstemmed terms

Inspired by speech recogn experiments – noiseInspired by speech recogn experiments – noise

We’ll see some examples of these next

We’ll see some examples of these next

DBS -- Phase 1: 100 Physical CollectionsCori_Net2 vs. LM (250 Queries)

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Recall (Percent) [11-point]

Prec

isio

n (P

erce

nt)

Cori2_all

LM_all

Essentially represents the best from both methods for this Phase

Essentially represents the best from both methods for this Phase

And we see LM clearly outperforms CORI by > 10% at the first recall points

And we see LM clearly outperforms CORI by > 10% at the first recall points

Performance avg-ed over 250 qrysPerformance avg-ed over 250 qrys

Result consistent with recent results in the doc. retrieval domain

Result consistent with recent results in the doc. retrieval domain

DBS -- Phase 2: 128 Logical Collections Cori_Net2 vs. LM (100 queries)

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100


Prec

isio

n (P

erce

nt)

Baseline_11pt

Cori2_0.6_300_stem

LM_Rand_500_1

When we move to the logical collections, we see a reversal in this relative performance

When we move to the logical collections, we see a reversal in this relative performance

Incl. the baseline in this case because it’s rel. closer to that of the two techs.

Incl. the baseline in this case because it’s rel. closer to that of the two techs.

Avg. prec. of the two may be sim., but CORI sign. better than other LM results here

(Rand_1000 and QBS 500+1000)

Avg. prec. of the two may be sim., but CORI sign. better than other LM results here

(Rand_1000 and QBS 500+1000)

DBS -- Phase 2: 128 Logical CollectionsEnhanced Cori_Net3 (100 Queries)

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100


Prec

isio

n (P

erce

nt)

baseline_11pt

Cori3_400_1.0

Cori3_400_1.0_Lex

Cori3_400_1.0_Lex+

The final plot to be exhibitedThe final plot to be exhibited

Here we explore a special post-process lexical analysis of queries for

jurisdictionally relevant content

Here we explore a special post-process lexical analysis of queries for

jurisdictionally relevant content

I.e., when no such context is found, jurisdictionally biased collections are down weighted

I.e., when no such context is found, jurisdictionally biased collections are down weighted

For results marked Lex, process applied only to qrys w/ no juris. clues

For results marked Lex, process applied only to qrys w/ no juris. clues

For results marked Lex+, apply the reranking to all qrys, but leave the dbs that match the lexical clues in their orig. ranks

For results marked Lex+, apply the reranking to all qrys, but leave the dbs that match the lexical clues in their orig. ranks


Performance Evaluation WIN using CORI scoring

• Works better for Logical collections than Physical collections• Best results from random sampled DBs

Language Modeling with basic smoothing

• Performs best for Physical collections; less well for Logical• Top results from random sampled DBs

Jurisdictional Lexical Analysis contributes > 10% to average precision

Precision at init Recall

point

CORI

Physical DBs

70%

Logical DBs 85+%

LM

80%

70%

Results don’t agree w/ Callan’s, but he was operating in a non-cooperating env.

Results don’t agree w/ Callan’s, but he was operating in a non-cooperating env.

And as we saw, adding our post-process lexical analysis, precision increased by over 10% at the top recall points

And as we saw, adding our post-process lexical analysis, precision increased by over 10% at the top recall points


Document-level Relevance

RelevanceCategory

Quantity Percentage % Relevant(Cumulative)

On Point 1415 56.60% 56.60%

Relevant 439 17.56% 74.16%

Marginally Relevant

199 7.96% 82.12%

Not Relevant

447 17.88% ———

Combined 2500 100.00% 82.12%

Took 25% of our Phase 2 qrys and ran them against the top 5 CORI-ranked DBs, then evaluated the top 20 documents (2,500 docs total) – this is what

resulted

Took 25% of our Phase 2 qrys and ran them against the top 5 CORI-ranked DBs, then evaluated the top 20 documents (2,500 docs total) – this is what

resulted

“On Point” cat. surpasses the next three cats combined“On Point” cat. surpasses the next three cats combined


Conclusions

WIN using CORI scoring more effective than current LM for environments that harness database profiling via sampling

Language Modeling more sensitive to sparse data issues

Post-process Lexical Analysis contributes significantly to performance

Random-sampling Profile Creation outperforms Query-based sampling in the WL environment


Future Work

Document Clustering Basis for new categories of databases

Language Modeling Harness robust smoothing techniques Measure contribution to logical DB performance

Actual document-level relevance Expand set of relevance judgments Assess doc scores based on both DB + doc beliefs

Bi-modal User Analysis Complete automation vs. User interaction in DBS

May show promise for domains in which we know much less about the pre-

existing doc structure

May show promise for domains in which we know much less about the pre-

existing doc structure

Competing w/ high perf thanks to CORI

Competing w/ high perf thanks to CORI

Smoothing: Simple, linear, smallest

binomial, finite element, b-spline

Smoothing: Simple, linear, smallest

binomial, finite element, b-spline

Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-

specific Operational Environment

Jack G. Conrad, Xi S. Guo, Peter Jackson, Monem MeziouResearch & DevelopmentThomson Legal & Regulatory – West GroupSt. Paul, Minnesota 55123 USA{Jack.Conrad,Peter.Jackson}@WestGroup.com


Related Work L. Gravano, et al., Stanford (VLDB 1995)

Presented GlOSS system to assist in DB selection task Used ‘Goodness’ as measure of effectiveness

J. French, et al., U. Virginia (SIGIR 1998) Came up with metrics to evaluate DB selection systems Began to compare effectiveness of different methods

J. Callan, et al., UMass. (SIGIR 95+99, CIKM 2000) Developed Collection Retrieval Inference Net (CORI) Showed CORI was more effective than GlOSS, CVV, others


Background

Exponential growth of data sets on Web and in commercial enterprises

Limited means of narrowing scope of searches to relevant databases

Application challenges in large domain-specific operational environments

Need effective approaches that scale and deliver in focused production systems