![Page 1: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/1.jpg)
Search for Approximate Matchesin Large Databases
Eugene FinkJaime Carbonell
Aaron GoldsteinPhilip Hayes
![Page 2: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/2.jpg)
Motivation
Fast identification of approximatematches in large sets of records.
Applications:• Medical databases• Customer records• National security
![Page 3: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/3.jpg)
Outline• Records and queries
• Search for matches
• Experimental results
![Page 4: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/4.jpg)
Table of records
We specify a table of records by a list of attributes.
ExampleWe can describe patients in a hospitalby their sex, age, and diagnosis.
![Page 5: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/5.jpg)
Records and queriesA record includes a specificvalue for each attribute.A query may include lists ofvalues and numeric ranges.
QuerySex: male, femaleAge: 20..40Dx: asthma, flu
ExampleRecordSex: femaleAge: 30Dx: asthma
![Page 6: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/6.jpg)
Query typesA point query includes a specificvalue for each attribute.A region query includes lists of values or numeric ranges.
Region querySex: male, femaleAge: 20..40Dx: asthma, flu
ExamplePoint querySex: femaleAge: 30Dx: asthma
![Page 7: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/7.jpg)
Exact matchesA record is an exact match for a query if every value in the record belongs tothe respective range in the query.
RecordAge
Sex
Dx
Query
![Page 8: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/8.jpg)
Approximate matchesA record is an approximate match for aquery if it is “close” to the query region.
Record
Age
Sex
Dx
Query
![Page 9: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/9.jpg)
Approximate queriesAn approximate query includes:
• Point or region
• Distance function
• Number of matches
• Distance limit
![Page 10: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/10.jpg)
Outline• Records and queries
• Search for matches
• Experimental results
![Page 11: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/11.jpg)
Indexing structure
diagnosis
male, 30,asthma
female, 30,asthma
male, 40,flu
female, 50,flu
female, 30,ulcer
female, 30,fracture
diagnosis diagnosisdiagnosis
ageage
sexmale female
3040 5030
asthma flu fracture ulcerasthma flu
• Maintain a PATRICIA tree of records
• Group nodes into fixed-size disk blocks
![Page 12: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/12.jpg)
Search for matches
diagnosis
male, 30,asthma
female, 30,asthma
male, 40,flu
female, 50,flu
female, 30,ulcer
female, 30,fracture
diagnosis diagnosisdiagnosis
ageage
sexmale female
3040 5030
asthma flu fracture ulcerasthma flu
• Depth-first search for exact matches
• Best-first search for approximate matches
![Page 13: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/13.jpg)
Outline• Records and queries
• Search for matches
• Experimental results
![Page 14: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/14.jpg)
Performance
:• Twenty-one attributes• 1.6 million records
Experiments with a database of all patientsadmitted to Massachusetts hospitals fromOctober 2000 to September 2002
Use of a Pentium computer:• 2.4 GHz CPU
• 1 Gbyte memory• 400 MHz bus
![Page 15: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/15.jpg)
Variables
Control variables:
• Number of records
• Memory size
• Query type
Measurements:
• Retrieval time
![Page 16: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/16.jpg)
Small memory• Number of records: 100 to 1,672,016• Memory size: 4 MByte
Ret
rieva
l Tim
e (m
sec)100
10
1102 103 104 105 106
Number of Records
Rangequeries
Approximatequeries
Exact queries
Availablememory
lg n
n0.15
lg n
n0.5
![Page 17: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/17.jpg)
Large memory• Number of records: 1,672,016• Memory size: 64 to 1,024 MByte
Range queries
Approximatequeries
Exact queries
Ret
rieva
l Tim
e (m
sec)
100
10
164 128 256 512 1,024
Memory Size (MBytes)
1,000
10,000
![Page 18: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/18.jpg)
Summary• Retrieval time grows as fractional power (about 0.5) of database size• If we extrapolate this growth rate, retrieval times are reasonable for very large databases
![Page 19: Search for Approximate Matches in Large Databases](https://reader034.vdocuments.site/reader034/viewer/2022052308/568167e4550346895ddd4d19/html5/thumbnails/19.jpg)
Summary• Retrieval time grows as fractional power (about 0.5) of database size• If we extrapolate this growth rate, retrieval times are reasonable for very large databases:
Number ofrecords (n)
n 0.5 time(seconds)
1,000,000100,000,000
10,000,000,0001,000,000,000,000
0.05 . 0.50 .
5.00 .
50.00 .