mike carey information systems group computer science department uc irvine

Database Systems:

A Vertical Slice of Computer Science … or …

It’s All About the Data!

Mike CareyInformation Systems GroupComputer Science DepartmentUC Irvine

©2003 BEA Systems, Inc. | 2

Wait … Who Is This Guy?

Carnegie-Mellon University, 1975-80 B.S. and M.S. Student, EE/ECE

UC Berkeley, 1980-83 Ph.D. Student, CS

University of Wisconsin, 1983-95 Assistant/Associate/Full Professor, CS

IBM, 1995-2000 Industrial Researcher & Software R&D Manager

Propel Software, 2000-01 Startup Company Fellow/CTO/VP of Software

BEA Systems, Inc., 2001-08 (acquired by Oracle) Industrial Software Architect & Sr. Engineering

Director And now I’m here…

Trivia tidbit:Here’s a photo of my first (ever) CS TA

Plan For Today’s Talk

Okay, so just what is a database system? Based on lecture notes from the UW-Madison

database curriculum, as immortalized in Database Management Systems (Ramakrishnan & Gehrke, a.k.a.“the Cow book”)

The database field is a vertical slice of all of CS! You’ll see what I mean (and why)…

What’s exciting in “database systems” today? UCI Information Systems Group (ISG) and

beyond!

What is a Database System?

So what’s a database? A very large, integrated collection of data

Usually a model of a real-world enterprise or a history of real-world events

Entities (e.g., students, courses, Facebook users, …)

Relationships (e.g., Susan is taking CS 234, Susan is a friend of Lynn, Mike filed a grade change for Lynn, …)

What’s a database management system (DBMS)? A software system designed to store, manage,

and provide access to one or more such databases

Evolution of DBMS

Files

CODASYL/IMS

Relational

Manual Coding

Byte streams

Majority of application development effort goes towards building and then maintaining data access logic

Relational DB Systems

Declarative approachTables + views bring “data independence”Details left to system

Designed to simplify data-centric application development

Early DBMS Technologies

Records and pointers

Large, carefully tuned data access programs that have dependencies on physical access paths, indexes, etc.

New Data

???

…

…

New Data

???

…

…

New Data

???

…

…

Why Use a DBMS?

Data independence

Efficient (and automatic) data access

Reduced application development time

Data integrity and security

Uniform data administration

Concurrent access and recovery from crashes

Why Study Databases?

Shift from computation to information At the “low end”: explosion of the web

(a mess!) At the “high end”: scientific applications

Datasets increasing in diversity and volume Digital libraries, interactive video, social

media, genomic data, big science data, …

... need for DBMS exploding! DBMS field encompasses most of CS

OS, languages, theory, AI, multimedia, logic, …

?!

Data Models

A data model is a collection of concepts for describing data (to one another or to a DBMS)

A schema is a description of a particular collection of data, using a given data model

The relational model is the most widely used data model today

Relation – basically a table with rows and (named) columns

Schema – describes the tables and their columns

Levels of Abstraction

Many views of one conceptual (logical) schema and an underlying physical schema

Views describe how different users or groups see the data

Conceptual schema defines the logical structure of the database

Physical schema describes the files and indexes used “under the covers”

Physical Schema

Conceptual Schema

View 1 View 2 View 3

Bits

On-Disk Data

Structures

Logical Model

Lies!

Example: University DB

Conceptual schema: Students(sid: string, name: string, login: string,

age: integer, gpa: real) Courses(cid: string, cname: string, credits:

integer) Enrolled(sid: string, cid: string, grade: string)

Physical schema: Relations each stored as unordered files Have indexes on first and third columns of Students

External schema (a.k.a. view):

CourseInfo(cid: string, cname: string, enrollment: integer)

Data Independence

Applications are insulated from how data is actually structured and stored!

Logical data independence: Protection from changes in the logical structure of data

Physical data independence: Protection from changes in the physical structure of data

One of the most important benefits of using a DBMS! Allows changes to be made w/o application

rewrites

Example: University DB (cont.)

User query (in SQL, against the external schema): SELECT c.cid, c.enrollment FROM CourseInfo c

WHERE c.cname = ‘Computer Game Design’

Equivalent query (against the conceptual schema): SELECT e.cid, count(e.*) FROM Enrolled e, Courses c

WHERE e.cid = c.cid AND c.cname = ‘Computer Game Design’

GROUP BY c.cid

Under the hood (against the physical schema) Access Courses – use index on cname to find associated

cid Access Enrolled – use index on cid to count the

enrollments

Architecture of a DBMS

A typical DBMS has a layered architecture

The figure doesn’t show the concurrency control and recovery components

This is one of several possible architectures; each actual system has its own variations

Query Optimizationand Execution

Relational Operators

Files and Access Methods

Buffer Management

Disk Space Management

DB

Note:These layersmust considerconcurrencycontrol and

recovery

Queries

DB Field is a Vertical Slice of CS “I like programming languages and

compilers” Consider high-level, declarative languages like

SQL “I like low-level operating systems issues”

DBMSs manage records, memory, locks, logs, … “I really want to work on distributed

systems” Distributed and parallel database systems are

ripe with distributed algorithms and systems issues (!)

“Data structure and algorithm design is really cool” Database indexes are data structures on disk (or

flash)

(And so on!)

What’s Exciting in DB Land Today? The Web is full of database challenges

(“Big Data”!) A box for keywords only goes so far…

▪ How can I query the web, e.g., “Find me 5-string Fender bass guitars for sale in the $1000-1500 price range”

Click streams and social networks generate lots of data▪ How can I query and analyze all that data (e.g., to act

on it)? Ubiquitous computing is data-rich, too

Build, deploy, and use location-based data services

Query and aggregate streams of sensor or video data

There’s data everywhere, and of all shapes and sizes How do we integrate it, e.g., for rapid crisis

response? And when we do, how do we ensure

privacy/security?

Ex: DB Challenges at Facebook Data store for low-latency, high-traffic Web

sites Only have a few hundred milliseconds to generate an

entire page Data heavily cached outside the DBMS today, which is

“far from ideal” Data systems for offline/batch-oriented

processing I mentioned this before: clickstream analysis, graph

analysis, etc. Potentially interested in faster, approximate answers Would like to do this in real time as well, as data arrives

Hardware trends (always) present new opportunities Flash storage, for example Multicore CPUs (nobody uses them very well yet)

Cool open source work at Facebook related to DBs Hive: Open source SQL on top of Hadoop Cassandra: Large-scale distributed storage for

semistructured data

AsterixDB System (From UCI)(http://asterixdb.ics.uci.edu/)

Disk

MainMemory

Disk

CPU(s)

ADMData

MainMemory

Disk

CPU(s)

ADMData

ADMData

Hi-Speed Interconnect

Data loads & feeds from external sources (XML,

JSON, …)

AQL queries & scripting

requests and programs

Data publishing to

external sources and

apps

ASTERIX Goal: To ingest, digest, persist, index, manage, query, analyze, and publish massive quantities of semi-structured information…

(ADM = ASTERIX Data Model, AQL

= ASTERIX Query

Language)

MainMemory

CPU(s)

17

http://asterixdb.ics.uci.edu/documentation/index.html



Summary

A DBMS is for storing and querying big datasets

Benefits of using a DBMS are many: enables rapid development of new applications (“what, not how”), recovers after crashes, supports (safe) concurrent access, helps maintain data integrity and security, …

Levels of schema abstraction data independence

DB research is a vertical slice of all of CS (“for data”)

Big Data experts are in high industrial demand! ()

Data is what it’s all about today! So, consider taking our three classes: CS 122A/B/C.

Questions?

mike carey information systems group computer science department uc irvine

Documents

new data

genomic data

data conceptual schema

given data model

particular collection

integrated collection

big science data

data independence efficient