Search for Approximate Matchesin Large Databases
Eugene FinkJaime Carbonell
Aaron GoldsteinPhilip Hayes
Motivation
Fast identification of approximatematches in large sets of records.
Applications:
• Medical databases
• Customer records
• National security
Table of records
We specify a table of records by a list of attributes.
ExampleWe can describe patients in a hospitalby their sex, age, and diagnosis.
Records and queriesA record includes a specificvalue for each attribute.
A query may include lists ofvalues and numeric ranges.
QuerySex: male, femaleAge: 20..40Dx: asthma, flu
ExampleRecordSex: femaleAge: 30Dx: asthma
Query typesA point query includes a specificvalue for each attribute.
A region query includes lists of values or numeric ranges.
Region querySex: male, femaleAge: 20..40Dx: asthma, flu
ExamplePoint querySex: femaleAge: 30Dx: asthma
Exact matchesA record is an exact match for a query if every value in the record belongs tothe respective range in the query.
RecordAge
Sex
Dx
Query
Approximate matchesA record is an approximate match for aquery if it is “close” to the query region.
Record
Age
Sex
Dx
Query
Approximate queries
An approximate query includes:
• Point or region
• Distance function
• Number of matches
• Distance limit
Indexing structure
diagnosis
male, 30,asthma
female, 30,asthma
male, 40,flu
female, 50,flu
female, 30,ulcer
female, 30,fracture
diagnosis diagnosisdiagnosis
ageage
sexmale female
3040 5030
asthma flu fracture ulcerasthmaflu
• Maintain a PATRICIA tree of records
• Group nodes into fixed-size disk blocks
Search for matches
diagnosis
male, 30,asthma
female, 30,asthma
male, 40,flu
female, 50,flu
female, 30,ulcer
female, 30,fracture
diagnosis diagnosisdiagnosis
ageage
sexmale female
3040 5030
asthma flu fracture ulcerasthmaflu
• Depth-first search for exact matches
• Best-first search for approximate matches
Performance
:
• Twenty-one attributes
• 1.6 million records
Experiments with a database of all patientsadmitted to Massachusetts hospitals fromOctober 2000 to September 2002
Use of a Pentium computer:• 2.4 GHz CPU
• 1 Gbyte memory
• 400 MHz bus
Variables
Control variables:
• Number of records
• Memory size
• Query type
Measurements:
• Retrieval time
Small memory• Number of records: 100 to 1,672,016• Memory size: 4 MByte
Ret
riev
al T
ime
(mse
c)
100
10
1102 103 104 105 106
Number of Records
Rangequeries
Approximatequeries
Exact queries
Availablememory
lg n
n0.15
lg n
n0.5
Large memory• Number of records: 1,672,016• Memory size: 64 to 1,024 MByte
Range queries
Approximatequeries
Exact queries
Ret
riev
al T
ime
(mse
c)
100
10
164 128 256 512 1,024
Memory Size (MBytes)
1,000
10,000
Summary• Retrieval time grows as fractional power (about 0.5) of database size
• If we extrapolate this growth rate, retrieval times are reasonable for very large databases