search for approximate matches in large databases

19
Search for Approximate Matches in Large Databases Eugene Fink Jaime Carbonell Aaron Goldstein Philip Hayes

Upload: tavia

Post on 25-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Search for Approximate Matches in Large Databases. Eugene Fink Jaime Carbonell. Aaron Goldstein Philip Hayes. Motivation. Fast identification of approximate matches in large sets of records. Applications: Medical databases Customer records National security. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Search for Approximate Matches in Large Databases

Search for Approximate Matchesin Large Databases

Eugene FinkJaime Carbonell

Aaron GoldsteinPhilip Hayes

Page 2: Search for Approximate Matches in Large Databases

Motivation

Fast identification of approximatematches in large sets of records.

Applications:• Medical databases• Customer records• National security

Page 3: Search for Approximate Matches in Large Databases

Outline• Records and queries

• Search for matches

• Experimental results

Page 4: Search for Approximate Matches in Large Databases

Table of records

We specify a table of records by a list of attributes.

ExampleWe can describe patients in a hospitalby their sex, age, and diagnosis.

Page 5: Search for Approximate Matches in Large Databases

Records and queriesA record includes a specificvalue for each attribute.A query may include lists ofvalues and numeric ranges.

QuerySex: male, femaleAge: 20..40Dx: asthma, flu

ExampleRecordSex: femaleAge: 30Dx: asthma

Page 6: Search for Approximate Matches in Large Databases

Query typesA point query includes a specificvalue for each attribute.A region query includes lists of values or numeric ranges.

Region querySex: male, femaleAge: 20..40Dx: asthma, flu

ExamplePoint querySex: femaleAge: 30Dx: asthma

Page 7: Search for Approximate Matches in Large Databases

Exact matchesA record is an exact match for a query if every value in the record belongs tothe respective range in the query.

RecordAge

Sex

Dx

Query

Page 8: Search for Approximate Matches in Large Databases

Approximate matchesA record is an approximate match for aquery if it is “close” to the query region.

Record

Age

Sex

Dx

Query

Page 9: Search for Approximate Matches in Large Databases

Approximate queriesAn approximate query includes:

• Point or region

• Distance function

• Number of matches

• Distance limit

Page 10: Search for Approximate Matches in Large Databases

Outline• Records and queries

• Search for matches

• Experimental results

Page 11: Search for Approximate Matches in Large Databases

Indexing structure

diagnosis

male, 30,asthma

female, 30,asthma

male, 40,flu

female, 50,flu

female, 30,ulcer

female, 30,fracture

diagnosis diagnosisdiagnosis

ageage

sexmale female

3040 5030

asthma flu fracture ulcerasthma flu

• Maintain a PATRICIA tree of records

• Group nodes into fixed-size disk blocks

Page 12: Search for Approximate Matches in Large Databases

Search for matches

diagnosis

male, 30,asthma

female, 30,asthma

male, 40,flu

female, 50,flu

female, 30,ulcer

female, 30,fracture

diagnosis diagnosisdiagnosis

ageage

sexmale female

3040 5030

asthma flu fracture ulcerasthma flu

• Depth-first search for exact matches

• Best-first search for approximate matches

Page 13: Search for Approximate Matches in Large Databases

Outline• Records and queries

• Search for matches

• Experimental results

Page 14: Search for Approximate Matches in Large Databases

Performance

:• Twenty-one attributes• 1.6 million records

Experiments with a database of all patientsadmitted to Massachusetts hospitals fromOctober 2000 to September 2002

Use of a Pentium computer:• 2.4 GHz CPU

• 1 Gbyte memory• 400 MHz bus

Page 15: Search for Approximate Matches in Large Databases

Variables

Control variables:

• Number of records

• Memory size

• Query type

Measurements:

• Retrieval time

Page 16: Search for Approximate Matches in Large Databases

Small memory• Number of records: 100 to 1,672,016• Memory size: 4 MByte

Ret

rieva

l Tim

e (m

sec)100

10

1102 103 104 105 106

Number of Records

Rangequeries

Approximatequeries

Exact queries

Availablememory

lg n

n0.15

lg n

n0.5

Page 17: Search for Approximate Matches in Large Databases

Large memory• Number of records: 1,672,016• Memory size: 64 to 1,024 MByte

Range queries

Approximatequeries

Exact queries

Ret

rieva

l Tim

e (m

sec)

100

10

164 128 256 512 1,024

Memory Size (MBytes)

1,000

10,000

Page 18: Search for Approximate Matches in Large Databases

Summary• Retrieval time grows as fractional power (about 0.5) of database size• If we extrapolate this growth rate, retrieval times are reasonable for very large databases

Page 19: Search for Approximate Matches in Large Databases

Summary• Retrieval time grows as fractional power (about 0.5) of database size• If we extrapolate this growth rate, retrieval times are reasonable for very large databases:

Number ofrecords (n)

n 0.5 time(seconds)

1,000,000100,000,000

10,000,000,0001,000,000,000,000

0.05 . 0.50 .

5.00 .

50.00 .