Will Data Mining Change the Functions of DBMS?
Jiawei HanDAIS (Data And Information Systems) Lab
University of Illinois at Urbana-Champaign
Will DM Be Integrated with DB Functions? DM: Already a functional component of DBMS
Microsoft/SQLServer: Analysis Manager IBM/DB2 & IntelligentMiner Oracle: Data Mining Package
But will DM be “intruding” into DBMS, i.e., be integrated with essential DBMS functions? Indexing Data integration Data cleaning Query processing
Indexing by Data Mining Indexing graphs? ─ # of subgraphs: exponential!
Chemical Informatics/bioinformatics …
Discriminative frequent graph patterns (SIGMOD’04)
Indexing subsequences?
Shopping sequence, DNA/protein sequence (SDM’05)
When is discriminative frequent pattern indexing useful?
Complex objects, big (object) queries
(a) (b) (c)
Sample database
Query graph
Data Cleaning by Data Mining Load messy data into a structured database?
Inconsistent data: age = “1946”? Field mis-alignments Glitches of data: completely messed up inputs Missing/un-matching delimiters: XML, HTML
data Big field: BLOB, CLOB, multimedia and text
Data mining Data cleaning by distribution/outlier analysis Dependency/correlation analysis Schema-directed or schema “discovery”
Data Integration by Data Mining Linking and mining cross-over multiple data
relations Cross-mine (Classification across multiple
data relations: ICDE’04) Search across heterogeneous databases
Object identification/merge, reference reconciliation (Alon’s group)
Mining across heterogeneous DBs Personalizing data from heterogeneous
sources
Query Processing by Data Mining Query plan refinement based on query execution
history
Better query planning by investigating additional
data statistics
Current optimizer: key/foreign key, cardinality,
# distinct values
Additional information:
Strong dependency/correlation
Histogram, dense vs. sparse regions, etc.
Conclusions DBers have been “invading” into DM and made
great contributions It is time to consider that DM may invade DBMS
to enhance its functionality General philosophy
Invisible data mining Google is doing this for page ranking
successfully Can we do it to enhance DBMS?
You can do better if you know your data better!