technologies of the future s. sudarshan dept. of computer science & engg. iit bombay
TRANSCRIPT
![Page 1: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/1.jpg)
Technologies of the future
S. Sudarshan
Dept. of Computer Science & Engg.
IIT Bombay
![Page 2: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/2.jpg)
Where is the IT industry heading to?
• Internet technologies– E-Commerce– Web databases, XML, etc
• Data Warehousing
• Data mining
![Page 3: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/3.jpg)
What is common amongst them?
• Data intensive applications
![Page 4: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/4.jpg)
Specific Features
• E-Commerce - guaranteed security of information
• Web applications - heterogeneous sources of data
![Page 5: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/5.jpg)
Specific features
• Data warehouses - data analysis
• Data mining - identify unknown patterns
Massive data
![Page 6: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/6.jpg)
What should a database system provide?
• storage and retrieval of data
• a user interface– querying interface– database administration– reporting interface
• protection of data against failures and malice accesses
![Page 7: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/7.jpg)
More database system features
• data consistency and integrity
• efficient execution of tasks
![Page 8: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/8.jpg)
Components of a traditional database system
User Interface
Data
St Mngr Buffer MngrRecoveryTx Mngr
Query ProcQuery Opt
![Page 9: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/9.jpg)
What is Query Optimization?
• Select candidate from Parties, Participants where party_name = ‘BJP’ and Parties.candidate = Participants.candidate
Parties
Parties.candidate = Participants.candidate
candidate
Participants
party_name = ‘BJP’
QueryEvaluationPlan
![Page 10: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/10.jpg)
Query Optimization
• Alternative Plans
• Optimal Plan– All possible alternatives
• Transformations
• Heuristics– Selects before joins
![Page 11: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/11.jpg)
Optimizers
• System R– Join order selection: find best join order– A1 A2 A3 .. An– Left deep join trees
• Volcano Extensible Query Optimizer Generator– Bushy trees
Ai
Ak
![Page 12: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/12.jpg)
Advances in Query Optimization
• Multi-Query Optimization– Finding common sub-expressions
• Approximate query answering
![Page 13: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/13.jpg)
Caching of Query Results
• Store results of earlier queries
• Motivation– speed up access to remote data
• also reduce monetary costs if charge for access
– interactive querying often results in related queries• results of one query can speed up processing of another
– caching can be at client side, in middleware, and even in a database server itself
![Page 14: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/14.jpg)
What is Transaction Processing?
• A transaction is a unit of program execution that accesses and possibly updates various data items
• Atomicity• Consistency• Isolation• Durability• Concurrency Control (Locking)
![Page 15: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/15.jpg)
What is OLTP?
• Traditional RDBMS are used for OLTP
• On-Line Transaction Processing– used for daily processing– detailed, up to date data– read/update a few records– isolation, recovery and integrity are critical
![Page 16: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/16.jpg)
What is OLAP?
• OLAP is used for decision support
• On-Line Analytical Processing– Summarized historical data– mainly read-only operations– used in data warehouses
![Page 17: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/17.jpg)
Data, Data everywhereyet ...
• I can’t find the data I need– data is scattered over the network
– many versions, subtle differences
• I can’t get the data I need– need an expert to get the data
• I can’t understand the data I found– available data poorly documented
• I can’t use the data I found– results are unexpected
– data needs to be transformed from one form to other
![Page 18: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/18.jpg)
What is a Data Warehouse?
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
[Barry Devlin]
![Page 19: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/19.jpg)
Which are our lowest/highest margin
customers ?
Which are our lowest/highest margin
customers ?
Who are my customers and what products are they buying?
Who are my customers and what products are they buying?
Which customers are most likely to go to the competition ?
Which customers are most likely to go to the competition ?
What impact will new products/services
have on revenue and margins?
What impact will new products/services
have on revenue and margins?
What product prom--otions have the biggest
impact on revenue?
What product prom--otions have the biggest
impact on revenue?
What is the most effective distribution
channel?
What is the most effective distribution
channel?
Why Data Warehousing?
![Page 20: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/20.jpg)
Decision Support
• Used to manage and control business
• Data is historical or point-in-time
• Optimized for inquiry rather than update
• Use of the system is loosely defined and can be ad-hoc
• Used by managers and end-users to understand the business and make judgements
![Page 21: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/21.jpg)
What are the users saying...
• Data should be integrated across the enterprise
• Summary data had a real value to the organization
• Historical data held the key to understanding data over time
• What-if capabilities are required
![Page 22: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/22.jpg)
Data Warehousing -- It is a process
• Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previously possible
• A decision support database maintained separately from the organization’s operational database
![Page 23: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/23.jpg)
OLTP vs Data Warehouse
• OLTP– Application Oriented
– Used to run business
– Clerical User
– Detailed data
– Current up to date
– Isolated Data
– Repetitive access by small transactions
– Read/Update access
• Warehouse (DSS)– Subject Oriented
– Used to analyze business
– Manager/Analyst
– Summarized and refined
– Snapshot data
– Integrated Data
– Ad-hoc access using large queries
– Mostly read access (batch update)
![Page 24: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/24.jpg)
Data Warehouse Architecture
RelationalDatabases
LegacyData
Purchased Data
Data Warehouse Engine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
![Page 25: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/25.jpg)
Querying Data Warehouses
• SQL Extensions
• Multidimensional modeling of data– OLAP
![Page 26: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/26.jpg)
SQL Extensions
• Extended family of aggregate functions– rank (top 10 customers)– percentile (top 30% of customers)– median, mode– Object Relational Systems allow addition of
new aggregate functions
• Reporting features– running total, cumulative totals
![Page 27: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/27.jpg)
OLAP
• Nature of OLAP Analysis– Aggregation -- (total sales, percent-to-total)– Comparison -- Budget vs. Expenses– Ranking -- Top 10, quartile analysis– Access to detailed and aggregate data– Complex criteria specification– Visualization– Need interactive response to aggregate queries
![Page 28: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/28.jpg)
MonthMonth1 1 22 3 3 4 4 776 6 5 5
Pro
du
ctP
rod
uct
ToothpasteToothpaste
JuiceJuiceColaColaMilkMilk
CreamCream
SoapSoap
Regio
n
Regio
n
WWS S
N N
Multi-dimensional Data
• Measure - sales (actual, plan, variance)
DimensionsDimensions: : Product, Region, TimeProduct, Region, TimeHierarchical summarization pathsHierarchical summarization paths
Product Product Region Region TimeTime Industry Country YearIndustry Country Year
Category Region Quarter Category Region Quarter
Product City Month weekProduct City Month week
Office DayOffice Day
![Page 29: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/29.jpg)
Conceptual Model for OLAP
• Numeric measures to be analyzed– e.g. Sales (Rs), sales (volume), budget, revenue,
inventory
• Dimensions– other attributes of data, define the space– e.g., store, product, date-of-sale– hierarchies on dimensions
• e.g. branch -> city -> state
![Page 30: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/30.jpg)
Strengths of OLAP
• It is a powerful visualization tool
• It provides fast, interactive response times
• It is good for analyzing time series
• It can be useful to find some clusters and outliners
• Many vendors offer OLAP tools
![Page 31: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/31.jpg)
Data Mining
• Decision making process
• Extract unknown information
• More than just analysis of data
![Page 32: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/32.jpg)
Why Data Mining• Credit ratings/targeted marketing:
– Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
– Identify likely responders to sales promotions
• Fraud detection– Which types of transactions are likely to be fraudulent, given the
demographics and transactional history of a particular customer?
• Customer relationship management:– Which of my customers are likely to be the most loyal, and which are most
likely to leave for a competitor? :
Data Mining helps extract such information
![Page 33: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/33.jpg)
Data mining
• Process of semi-automatically analyzing large databases to find interesting and useful patterns
• Overlaps with machine learning, statistics, artificial intelligence and databases but– more scalable in number of features and instances– more automated to handle heterogeneous data
![Page 34: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/34.jpg)
Some basic operations
• Predictive:– Regression– Classification
• Descriptive:– Clustering / similarity matching– Association rules and variants– Deviation detection
![Page 35: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/35.jpg)
Application Areas
Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud Analysis
Telecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providersValue added dataUtilities Power usage analysis
![Page 36: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/36.jpg)
Data Mining in Use
• The US Government uses Data Mining to track fraud• A Supermarket becomes an information broker• Basketball teams use it to track game strategy• Cross Selling• Target Marketing• Holding on to Good Customers• Weeding out Bad Customers
![Page 37: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/37.jpg)
Why Now?
• Data is being produced
• Data is being warehoused
• The computing power is available
• The computing power is affordable
• The competitive pressures are strong
• Commercial products are available
![Page 38: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/38.jpg)
Data Mining works with Warehouse Data
• Data Warehousing provides the Enterprise with a memory
• Data Mining provides the Enterprise with intelligence
![Page 39: Technologies of the future S. Sudarshan Dept. of Computer Science & Engg. IIT Bombay](https://reader035.vdocuments.site/reader035/viewer/2022070415/5697bfc51a28abf838ca6b81/html5/thumbnails/39.jpg)
Mining market
• Around 20 to 30 mining tool vendors• Major players:
– Clementine,
– IBM’s Intelligent Miner,
– SGI’s MineSet,
– SAS’s Enterprise Miner.
• All pretty much the same set of tools• Many embedded products: fraud detection, electronic
commerce applications