unit-4 enhanced data models for advanced applications

Generalized Model for Active Databases and Oracle Triggers

Triggers are executed when a specified condition occurs during insert/delete/update

Triggers are action that fire automatically based on these conditions

Triggers follow an Event-condition-action (ECA) model • Event: Database modification E.g., insert, delete, update,

• Condition: Any true/false expression Optional: If no condition is specified then condition is

always true• Action: Sequence of SQL statements that will be automatically

executed

When a new employees is added to a department, modify the Total_sal of the Department to include the new employees salary;

CREATE TRIGGER Total_sal1AFTER INSERT ON EmployeeFOR EACH ROWWHEN (NEW.Dno is NOT NULL)

UPDATE DEPARTMENTSET Total_sal = Total_sal + NEW. SalaryWHERE Dno = NEW.Dno;

From the previous example we can find that:• Logically this means that we will CREATE a TRIGGER, let us call the

trigger Total_sal1

This trigger will execute AFTER INSERT ON Employee table (EVENT)

It will do the following FOR EACH ROW

WHEN NEW.Dno is NOT NULL (CONDITION)

The trigger will UPDATE DEPARTMENT (ACTION) By SETting the new Total_sal to be the sum of old Total_sal and NEW. Salary WHERE the Dno matches the NEW.Dno;

CREATE TRIGGER <name>• Creates a trigger if one does not exist

ALTER TRIGGER <name>• Alters a trigger if one does exist

Works in both cases, whether a trigger exists or not

Conditions can be one of the following: AFTER

• Executes after the event BEFORE

• Executes before the event INSTEAD OF

• Executes instead of the event Note that event does not execute in this case INSTEAD OF triggers are used for modifying views

Triggers can be • Row-level

FOR EACH ROW specifies a row-level trigger Executed separately for each affected row NEW and OLD keywords can be used here

• Statement-level Default (when FOR EACH ROW is not specified) Execute once for the SQL statement,

Any true/false condition to control whether a trigger is activated or not• Absence of condition means that the trigger will

always execute for the even (True)• Otherwise, condition is evaluated

before the event for BEFORE trigger after the event for AFTER trigger

Action can be• One SQL statement• A sequence of SQL statements enclosed between a

BEGIN and an END Action specifies the relevant modifications

Design and Implementation Issues for Active Databases

An active database allows users to make the following changes to triggers (rules)• Activate• Deactivate• Drop

An event can be considered in 3 ways: Immediate consideration:

• Part of the same transaction and can be one of the following depending on the situation Before After Instead of

Deferred consideration:• Condition is evaluated at the end of the transaction

Detached consideration:• Condition is evaluated in a separate transaction

Potential Applications for Active Databases Notification

• Automatic notification when certain condition occurs• Enforcing integrity constraints

Triggers are smarter and more powerful than constraints• Maintenance of derived data

Automatically update derived data and avoid anomalies due to redundancy E.g., trigger to update the Total_sal in the earlier example

Triggers in SQL-99 Can alias variables inside the REFERENCING

clause

Trigger examples in SQL-99

Time Representation, Calendars, and Time Dimensions Time is considered ordered sequence of points in some

granularity• Use the term choronon instead of point to describe minimum

granularity

A calendar organizes time into different time units for convenience.• Accommodates various calendars

Gregorian (western), Chinese, Islamic, Etc. Point events

• Single time point event E.g., bank deposit

• Series of point events can form a time series data Duration events

• Associated with specific time period Time period is represented by start time and end time

Transaction time• The time when the information from a certain

transaction becomes valid

Bitemporal database• Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning

Add to every tuple • Valid start time • Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

A single complex object stores all temporal changes of the object

Time varying attribute• An attribute that changes over time

E.g., age

Non-Time varying attribute• An attribute that does not changes over time

E.g., date of birth

Declarative Language• Language to specify rules

Inference Engine (Deduction Mechanism)• Can deduce new facts by interpreting the rules• Related to logic programming

Prolog language (Prolog => Programming in logic) Uses backward chaining to evaluate

Top-down application of the rules

Specification consists of:• Facts

Similar to relation specification without the necessity of including attribute names

• Rules Similar to relational views (virtual relations that are not

stored)

Predicate has• a name• a fixed number of arguments

Convention: Constants are numeric or character strings Variables start with upper case letters

• E.g., SUPERVISE(Supervisor, Supervisee) States that Supervisor SUPERVISE(s) Supervisee

Rule• Is of the form head :- body

where “:-” is read as “if and only if”• E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y)• E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)

Query• Involves a predicate symbol followed by some variable

arguments to answer the question• E.g., SUPERIOR(james,Y)?• E.g., SUBORDINATE(james,X)?

(a) Prolog notation (b) Supervisory tree

• Program is built from atomic formulas Literals of the form p(a1, a2, … an) where

p predicate name n is the number of arguments

Built-in predicates are included E.g., <, <=, =, etc.

• A literal is either An atomic formula An atomic formula preceded by not

A formula can have quantifiers• Universal• Existential

In clausal form, a formula must be transformed into another formula with the following characteristics• All variables are universally quantified• Formula is made of a number of clauses where each

clause is made up of literals connected by logical ORs only

• Clauses themselves are connected by logical ANDs only.

Any formula can be converted into a clausal form• A specialized case of clausal form are horn clauses

that can contain no more than one positive literal Datalog program are made up of horn clauses

There are two main alternatives for interpreting rules:• Proof-theoretic• Model-theoretic

Proof-theoretic• Facts and rules are axioms • Ground axioms contain no variables• Rules are deductive axioms• Deductive axioms can be used to construct new facts

from existing facts• This process is known as theorem proving

Model-theoretic• Given a finite or infinite domain of constant values, we

assign the predicate every combination of values as arguments

• If this is done for every predicated, it is called interpretation

Model• An interpretation for a specific set of rules

Model-theoretic proofs • Whenever a particular substitution to the variables in

the rules is applied, if all the predicated are true under the interpretation, the predicate at the head of the rule must also be true

Minimal model• Cannot change any fact from true to false and still get a

model for these rules

A program is safe if it generates a finite set of facts Two main methods of defining truth values• Fact-defined predicates (or relations)

Listing all combination of values that make a predicate true• Rule-defined predicates (or views)

Head (LHS) of one or more Datalog rules, for example:

Many operations of relational algebra can be defined in the for of Datalog rules that defined the result of applying these operations on database relations (fact predicates)• Relational queries and views can be easily specified in

Datalog

Define an inference mechanism based on relational database query processing concepts

• Example: predicate dependencies

42

Knowledge databases

43

KDD: A Definition

106-1012 bytes:we never see the wholedata set, so will put it in the memory of computers

What is the knowledge?How to represent and use it?

Then run Data Mining algorithms

KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.

44

We often see data as a string of bits, or numbers and symbols, or “objects” which we collect daily.

Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data.

Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”.

Data, Information, Knowledge

Knowledge can be considered data at a high level of abstraction and generalization.

45

From Data to KnowledgeFrom Data to Knowledge From Data to KnowledgeFrom Data to Knowledge

...10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS

12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA

15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA

16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 　 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS, 　 VIRUS...

Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes

Numerical attribute categorical attribute missing values class labels

IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15THEN Prediction = VIRUS [87,5%]

[confidence, predictive accuracy]

46

People gathered and stored so much data because they think some valuable assetsare implicitly coded within it.

Raw data is rarely of direct benefit.Its true value depends on the ability to extract information useful for decision support.

Impractical Manual Data Analysis

knowledge base

inference engine

How to acquire knowledge for knowledge-based systems remains as the main difficult and crucial problem. ?

Tradition: via knowledge engineersNew trend: via automatic programs

Data Rich Knowledge Poor

47

Volume

Value

EDP

MIS

DSS

Benefits of Knowledge Discovery

Generate

Rapid Response

Disseminate

EDP: Electronic Data ProcessingMIS: Management Information Systems

DSS: Decision Support Systems

48

Lecture 1: Overview of KDD

1. What is KDD and Why ?

2. The KDD Process

4. Data Mining Methods

3. KDD Applications

5. Challenges for KDD

49

The KDD processThe non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)

non-trivial process

Multiple process

valid Justified patterns/models

novel Previously unknown

useful Can be used

understandableby human and machine

50

The Knowledge Discovery The Knowledge Discovery ProcessProcess The Knowledge Discovery The Knowledge Discovery ProcessProcess

KDD is inherentlyinteractive and

iterative

a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations

1

2

3

4

5

Understand the domain and Define problems

Collect and Preprocess Data

Data MiningExtract Patterns/Models

Interpret and Evaluate discovered knowledge

Putting the results in practical use

51

The KDD Process

Data organized by function

Create/selecttarget database

Select samplingtechnique and

sample data

Supply missing values

Normalizevalues

Select DM task (s)

Transform todifferent

representation

Eliminatenoisy data

Transformvalues

Select DM method (s)

Create derivedattributes

Extract knowledge

Find importantattributes &

value ranges

Test knowledge

Refine knowledge

Query & report generationAggregation & sequencesAdvanced methods

Data warehousing 1

2

3 4

5

52

Main Contributing Areas of KDDMain Contributing Areas of KDD Main Contributing Areas of KDDMain Contributing Areas of KDD

DatabasesStore, access, search, update data (deduction)

StatisticsInfer info from data (deduction & induction, mainly numeric data)

Machine LearningComputer algorithms that

improve automatically through experience (mainly induction,

symbolic data)

KDD

[data warehouses:integrated data]

[OLAP: On-Line Analytical Processing]

53

Potential Potential ApplicationsApplications Potential Potential ApplicationsApplications

Business information

- Marketing and sales data analysis- Investment analysis- Loan approval- Fraud detection- etc.

Manufacturing information

- Controlling and scheduling- Network management- Experiment result analysis- etc.

Scientific information- Sky survey cataloging- Biosequence Databases- Geosciences: Quakefinder- etc.

Personal information

54

KDD: Opportunity and KDD: Opportunity and Challenges Challenges

KDD: Opportunity and KDD: Opportunity and Challenges Challenges

Data RichKnowledge Poor(the resource)

Enabling Technology(Interactive MIS, OLAP, parallel computing, Web, etc.)

Competitive Pressure

Data Mining TechnologyMature

KDD

55

KDD workshops: since 1989.Inter. Conferences: KDD (USA), first in 1995;PAKDD (Asia), first in 1997; PKDD (Europe), first in 1997.ML’04/PKDD’04 (in Pisa, Italy)

Industry interests and competition: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, …About 80% of the Fortune 500 companies are involved in data mining projects or using data mining systems.

JAPAN: FGCS Project (logic programming and reasoning).

“Knowledge Discovery is the most desirable end-product of computing”. Wiederhold, Standford Univ.

KDD: A New and Fast Growing Area

56

Primary Tasks of Data Primary Tasks of Data MiningMining Primary Tasks of Data Primary Tasks of Data MiningMining

Classification

Deviation andchange detection

?

Summarization

Clustering

Dependency

Modeling

Regression

finding the descriptionof several predefined classes and classify a data item into one of them.

maps a data item to a real-valued prediction variable.

identifying a finite set of categories or clusters to describe

the data.

finding a compact description

for a subset of data

finding a model which describes

significant dependencies between variables.

discovering the most significant changes in the data

57

Data General patterns

Examples

Cancerous Cell Data

Classification“What factors determine cancerous cells?”

Classification Algorithm

MiningAlgorithm

- Rule Induction- Decision tree- Neural Network

58

If Color = light and Tails = 1 and Nuclei = 2Then Healthy Cell (certainty = 92%)

If Color = dark and Tails = 2 and Nuclei = 2Then Cancerous Cell (certainty = 87%)

Classification: Rule Induction“What factors determine a cell is cancerous?”

59

Color = dark

Color = light

healthy

Classification: Decision Trees

#nuclei=1

#nuclei=2

#nuclei=1

#nuclei=2

#tails=1 #tails=2

cancerous

cancerous healthy

healthy

#tails=1 #tails=2

cancerous

60

Healthy

Cancerous

“What factors determine a cell is cancerous?”

Classification: Neural Networks

Color = dark

# nuclei = 1

…

# tails = 2

End of Unit-iv

unit-4 enhanced data models for advanced applications

Documents