2. data warehouse 3. data mining - kku web hosting base 2. data warehouse 3. data mining ... unit of...

6
1. Data Base 1. Data Base 2 Data Warehouse 2. Data Warehouse 3. Data Mining For Operate business For Analyze business For Discover business D W h O i l DBMS Data Warehouse vs. Operational DBMS OLTP (online transaction processing) Major task of traditional relational DBMS Major task of traditional relational DBMS Daytoday operations: purchasing, inventory, b ki f t i ll it ti banking, manufacturing, payroll, registration, accounting, etc. OLAP (online analytical processing) Major task of data warehouse system Data analysis and decision making Data analysis and decision making 3 OLTP vs. OLAP OLTP vs. OLAP OLTP OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB d i li ti i td bj t i td DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational historical, summarized, detailed, flat relational isolated summarized, multidimensional integrated, consolidated titi d h usage repetitive ad-hoc access read/write index/hash on prim. key lots of scans index/hash on prim. key unit of work short, simple transaction complex query # records d tens millions accessed #users thousands hundreds DB size 100MB-GB 100GB- TB 4 DB size 100MB GB 100GB TB metric transaction throughput query throughput, response

Upload: lamtuyen

Post on 28-Apr-2018

225 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 2. Data Warehouse 3. Data Mining - KKU Web Hosting Base 2. Data Warehouse 3. Data Mining ... unit of work short, ... Data Warehousing Manager b

1. Data Base1. Data Base2 Data Warehouse2. Data Warehouse3. Data Miningg

For Operate business For Analyze business

For Discover business

D W h O i l DBMSData Warehouse vs. Operational DBMS

• OLTP (on‐line transaction processing)Major task of traditional relational DBMS– Major task of traditional relational DBMS

– Day‐to‐day operations: purchasing, inventory, b ki f t i ll i t tibanking, manufacturing, payroll, registration, accounting, etc.

• OLAP (on‐line analytical processing)( y p g)– Major task of data warehouse system– Data analysis and decision makingData analysis and decision making

3

OLTP vs. OLAPOLTP vs. OLAP

OLTP OLAPOLTP OLAPusers clerk, IT professional knowledge workerfunction day to day operations decision supportDB d i li ti i t d bj t i t dDB design application-oriented subject-orienteddata current, up-to-date

detailed, flat relationalhistorical, summarized,detailed, flat relational

isolatedsummarized, multidimensionalintegrated, consolidated

titi d husage repetitive ad-hocaccess read/write

index/hash on prim. keylots of scans

index/hash on prim. keyunit of work short, simple transaction complex query# records

dtens millions

accessed#users thousands hundredsDB size 100MB-GB 100GB-TB

4

DB size 100MB GB 100GB TBmetric transaction throughput query throughput, response

Page 2: 2. Data Warehouse 3. Data Mining - KKU Web Hosting Base 2. Data Warehouse 3. Data Mining ... unit of work short, ... Data Warehousing Manager b

Knowledge DiscoveryKnowledge Discovery 

5

Discovering useful patternsDiscovering useful patterns

What is a Data Warehouse?What is a Data Warehouse?

Common definitions of a Data Warehouse 

• A decision support database that is maintainedA decision support database that is maintained separately from the organization’s operational databasedatabase

– Support information processing by providing a solid platform of consolidated, historical data for analysis.y

• “A data warehouse is a subject‐oriented, integrated, time variant and nonvolatile collection of data intime‐variant, and nonvolatile collection of data in support of management’s decision‐making 

” W H I6

process.”—W. H. Inmon

Data Warehouse Implementation Road Map

Extract and transform

ETL ImportanceExtraction

ETL ImportanceEnsure data is1 R l t1. Relevant 2. Useful3 Quality3. Quality4. Accurate5 Accessible5. Accessible

TransformationTransformation1. Anomalies exist in operational 

data ‐ inconsistent development

Online Analytical Processing : OLAP• extend the capabilities of query

and reportingdata ‐ inconsistent development  approaches

2 Eliminates anomalies

and reporting• enables users to view the data in

complex relationships (Multi-2. Eliminates anomalies • Cleans• Standardizes

p p (dimensions)

• provides drill down and roll upbe able to slice and dice Standardizes

• Presents subject oriented data• be able to slice and dice• What if analysis

Page 3: 2. Data Warehouse 3. Data Mining - KKU Web Hosting Base 2. Data Warehouse 3. Data Mining ... unit of work short, ... Data Warehousing Manager b

D t W h A hit tData Warehouse Architecture

Figure1. Basic Architecture Figure2. With a Stage Architecture

Business IntelligenceBusiness Intelligence

10

Data Warehouse D t T f ti S iData Transformation Services

Fact constellations

Star Schema

Snow‐flakeSnow flake

Page 4: 2. Data Warehouse 3. Data Mining - KKU Web Hosting Base 2. Data Warehouse 3. Data Mining ... unit of work short, ... Data Warehousing Manager b

T i l OLAP O iTypical OLAP Operations

• Roll up (drill‐up): summarize data

– by climbing up hierarchy or by dimension reduction

• Drill down (roll down): reverse of roll‐up

– from higher level summary to lower level summary or detailed data, or introducing new dimensions

• Slice and dice: project and select• Pivot (rotate):

– reorient the cube, visualization, 3D to series of 2D planes

• Other operations

– drill across: involving (across) more than one fact table

– drill through: through the bottom level of the cube to its back‐end relational tables (using SQL)

16

Page 5: 2. Data Warehouse 3. Data Mining - KKU Web Hosting Base 2. Data Warehouse 3. Data Mining ... unit of work short, ... Data Warehousing Manager b

Ten Common Mistakes1 St ti ith1. Starting with wrong sponsors

a. Data Warehousing Managerb. Executive sponsor with great deal of moneyc. Project “driver”

a. Has already earned the respect of the other executivesb. Has healthy skepticism about technologyy p gyc. Is decisive but flexible

2. Setting unrealistic expectations that can’t be meta Data warehousing has two phases:a. Data warehousing has two phases: 

Selling Phase – persuade peopleStruggle Phase –meet the expectation

b. Frustrates executives at the moment of truth

3. Promoting wrong value of their Data Warehouse

i i liti ll ï b h ia. engaging in politically‐naïve behavior

a. help managers make better decisions

b. lose potential supportersb. lose potential supporters4. Loading Data Warehouse with unnecessary information

a. sends a list of table and data elements to the end user along with requestb. get back long lists of unnecessary informationc. slows responsiveness and increase the data warehouse storage requirements

5. Data Warehouse Database Design vs. Transactional Database Designa. Transaction processing: 

‐ a programmer develops a query that will be used many timesa programmer develops a query that will be used many times‐ usually contains only the basic data

b. Data warehousing:‐ an end‐user develops the query and may use it only one time

fi d d i i i f i l d l l d‐ expect to find aggregates – sums, averages, trends, time‐series information already calculated for them and ready for immediate display

6. Data Warehousing Manager: Technology‐oriented rather than User‐orienteda user hostile project manager puts entire project in danger of being scrappeda. user hostile project manager puts entire project in danger of being scrappedb. Data Warehousing is a service business and not a storage business. c. Don’t make clients angry!!!

7 Too much emphasis on traditional internal record‐oriented data7. Too much emphasis on traditional internal record oriented dataa. senior executives see data warehouses as irrelevantb. consider including images, graphics, audio or video, etc…

8. Delivering data with overlapping and confusing definitionsg pp g ga. Finance manager – sales means net of revenue less returnsb. Distribution people – sales means what needs to be deliveredc. Sales person – sales means amount committed by clients

9 f C i d S l bili9. Performance, Capacity, and Scalability a. within 4 month, purchase at least one additional processor equal or larger than the current 

computer.b b d t f dditi l h db. budget for additional hardwarec. budget for unforeseen difficultiesd. network overloads are a very common

10 Believing that once the Data Warehouse is up and running your problems are finished10. Believing that once the Data Warehouse is up and running, your problems are finisheda. data warehousing project team needs to maintain high energy over long periods of time.b. Data warehousing is a journey not a destination

Data MiningData Mining

วัฎจักรขั้นตอนการทํางานของ CRISP-DMD t Mi i Pวฎจกรขนตอนการทางานของ

Data Mining ประกอบไปดวย 4 ขั้นตอนหลักๆ ดงันี้

Data Mining ProcessProblem formulationๆ

1. เขาใจธุรกิจนั้น เพือ่ระบุโอกาสทางธุรกิจหรือการระบปญหาทีเ่กิดขึ้นกับธรกิจ Data Selectionหรอการระบุปญหาทเกดขนกบธุรกจ

2. ตองเขาใจขอมูลและแหลงขอมูล เพื่อระบุขอบเขตของขอมูลที่จะนํามาทาํการวเิคราะห เพื่อนาํมาทาํการแกไขปญหา

Data Selection

Data Cleaningวเคราะห เพอนามาทาการแกไขปญหา3. ทาํการเปลี่ยนแปลงขอมูลดิบใหอยูในรูป

ของขอมูลที่จะนําไปใชไดจริงในทางิ

Data Transformation

ธุรกิจ 4. นาํเทคนิคของ Data Mining ไปใชกับ

ขอมูล เพื่อคนหาความสัมพันธและ้

Data Mining

Result evaluationรูปแบบทั้งหมด 5. วดัประสิทธภิาพของตวัแบบ การวดั

ประสิทธภิาพของเทคนคิของ Data

Result evaluation and Visualization

Mining ที่จะนาํมาใช จากผลลัพธ ซึง่สามารถตรวจสอบไดหลายทาง

6 นาํเอาตวัแบบที่ประเมินแลว ไปปฏิบัติ6. นาเอาตวแบบทประเมนแลว ไปปฏบตจริงกับธุรกิจ

Page 6: 2. Data Warehouse 3. Data Mining - KKU Web Hosting Base 2. Data Warehouse 3. Data Mining ... unit of work short, ... Data Warehousing Manager b

Multi‐Dimensional Major Tasks in Data Multi Dimensional Measure of Data Quality

jPreprocessing

•Data cleaningA well‐accepted multidimensional view:

Data cleaning• Fill in missing values, smooth noisy data, identify or remove outliers, and multidimensional view:

• Intrinsic DQ: Accuracy, objectivity, believability, and reputation.

resolve inconsistencies•Data integration

• Integration of multiple databases datay, p• Accessibility DQ: Accessibility and

access security.

Integration of multiple databases, data cubes, or files

•Data transformation

• Contextual DQ: Relevancy, value added, timeliness, completeness,

•Normalization and aggregation•Data reduction

• Obtains reduced representation inamount of data.

• Representation DQ: Interpretability,

Obtains reduced representation in volume but produces the  same or similar analytical results

ease of understanding, concise representation, consistent

•Data discretization• Part of data reduction but with particular importance especially forrepresentation. particular importance, especially for numerical data

Major Tasks in Data Preprocessing

Data integrationData integration

Data cleaningData cleaning

Data transformation -5, 32, 100,59, 45 -0.005, 0.032, .100, .059, 0.045attribute

attributeA1 A2 A3 A226

Data reduction

nsa

ctio

n

nsa

ctio

n

A1     A2      A3   …………    A226A1     A2     …… A105

T1

T2

T1

T2

Data cleaning taskso Fill in missing values

tran tran …

T459

T2000

o Fill in missing valueso Identify outliers and smooth out noisy data o Correct inconsistent data

Data Mining  Strategies

Predictive or Supervised Modeling

Descriptive or Unsupervised Modelingor Supervised Modeling or Unsupervised Modeling

Classification Prediction Associations Clustering

Estimation/ Regression

ID คืนเงิน

… ราย ได

โกง Predictedตองการทราบ Pattern

Supervised

ID ื โ

N ??

Y ??

ตองการทราบ Pattern ของคนที่โกงภาษี

ID คืนเงิน

… ราย ได

โกง

Y

… Y ??

Testing d t t

Y

… … Ndataset

Training dataset

Learning Classifier

ModelModel

Model

Predicted Class

(Y /N )dataset Classifier Model (Yes/No)

ID คืนเงิน

… ราย ได

PredictedNew Case เงน ได cted

??

New Case