cse 232a graduate database systems2009: bachelors in cse from iit madras 2009—16: ms and phd in cs...
TRANSCRIPT
CSE 232A Graduate Database Systems
1
Fall 2019
Arun Kumar
2009: Bachelors in CSE from IIT Madras
2009—16: MS and PhD in CS from UW-Madison
PhD thesis area: Data systems for ML workloads
2016-: Asst. Prof. at UC San Diego CSE 2019-: + Asst. Prof. at UC San Diego HDSI
Summers: 110F!
Winters: —40F!
Ahh! :)
About Myself
2
3
What is this course about? Why take it?
1. IBM’s Watson beats humans in Jeapordy!
5
How did Watson achieve that?
6
Watson devoured LOTS of data!
7
2. “Structured” data with Google search results
8
How does Google know that?
9
Google also devours LOTS of data!
3. Amazon’s “spot-on” recommendations
10
11
How does Amazon know that?
12
You guessed it! LOTS and LOTS of data!
Analysis
13
And innumerable “traditional” applications
14
15
Large-scale data management systems are the cornerstone of many digital applications, both modern and traditional
16
The Age of “Big Data”/“Data Science”
17
Data data everywhere, All the wallets did shrink! Data data everywhere, Nor any moment to think?
18
CSE 232A will get you thinking about the fundamentals of database systems
1. How are large structured datasets stored and organized?
2. How are “queries” handled? 3. How to make the system faster? 4. Deeper and more recent issues
19
What this course is NOT about
❖ NOT a course on basics of relational algebra or SQL ❖ Take CSE 132A instead (pre-requisite for 232A!)
❖ NOT a course on how to use an RDBMS for database-backed applications (triggers, physical tuning, etc.) ❖ Take CSE 132B instead
❖ NOT a course on distributed systems or transactions ❖ Take CSE 223B instead
http://cseweb.ucsd.edu/classes/fa19/cse232-a
20
And now for the (boring) logistics …
21
Prerequisites
❖ CSE 132A (or equivalent) is essential ❖ CSE 120 is also helpful ❖ For all other cases, email the instructor with
proper justification. A waiver can be considered.
http://cseweb.ucsd.edu/classes/fa19/cse232-a
22
Course Administrivia
❖ Lectures: MWF 3:00-3:50pm, Ledden Auditorium Attending ALL lectures is mandatory! ❖ Instructor: Arun Kumar; [email protected] Office hours: Wed 4-5pm, 3218 CSE (EBU3b) ❖ TAs: Nikos Koulouris, Kaiqi Yao, Aman Achpal, and
Allen Ordookhanians ❖ Piazza: TBD
http://cseweb.ucsd.edu/classes/fa19/cse232-a
23
Grading
❖ Midterm Exam 1: 20% Date: Friday, October 25; in-class ❖ Midterm Exam 2: 20% Date: Monday, November 25; in-class ❖ Final Exam: 60% (cumulative)
Date: Friday, December 13, 3-6pm; Room TBD
❖ (Optional/Extra Credit) 6 Paper Reviews: 3%
http://cseweb.ucsd.edu/classes/fa19/cse232-a
24
Course Outline
1. How are large structured datasets stored? Storage and file layout
2. How are “queries” handled? Indexing, sorting, relational operator implementations, and query processing
3. How to make the system faster? Query optimization and parallelism
4. Recent issues and trends Data systems for ML, Data integration and cleaning, semi-structured data, ML for RDBMSs
http://cseweb.ucsd.edu/classes/fa19/cse232-a
25
The primary focus will be the relational data model and Relational Database Management Systems (RDBMS)
26
Relational model in a nutshell
Basically, Relation:Table :: Pilot:Driver (okay, a bit more)
The model formalizes “operations” to manipulate relations
RatingID Rating Date UserID MovieID1 3.5 08/27/15 23294 202 4.0 07/20/15 4232 2933 2.5 08/02/15 54551 846… … … … …
27
Relational model in a nutshell
Invented by E. F. Codd in 1970s at IBM Research in CA
28
Relational DBMS in a nutshell
A software system to implement the relational model, i.e., enable users to manage data stored as relations
29
Relational DBMS in a nutshell
First RDBMSs: System R (IBM) and Ingres (Berkeley) in 1970s
A rare photo of the original System R manual
Mike Stonebraker won the Turing Award in 2015!
30
Relational DBMS in a nutshell
RDBMS software is now a USD 40+ billions/year industry! Numerous open source RDBMSs also popular
People still start companies about what are basically RDBMSs!
31
Course Textbook
Prescribed: “Database Management Systems” 3rd Edition Raghu Ramakrishnan and Johannes Gehrke
Optional: “Database Systems: The Complete Book” by Garcia-Molina, Widom, and Ullman “Big Data Integration” by Dong and Srivastava
Aka The “Cow Book” Which cow are you?
32
Tentative Course Schedule
Week Topic
0-1 Introduction; Data Storage (Disks; Files; New Hardware)
2-3 Indexing (B+ Tree; Hash Indexes; Learned Indexes); Sorting
3-4 Relational Operator Implementations; Query Processing
4 Midterm Exam 1 on Friday, 10/25
5-6 Query Optimization; Materialized Views
6-7 Parallel RDBMSs; Dataflow Systems; Cloud RDBMSs
7-8 Data Systems for ML Workloads
9 Midterm Exam 2 on Monday, 11/25
9-10 Data Integration and Cleaning; Semi-Structured Data
10 More ML for RDBMSs; Recap
11 Final Exam on Friday, 12/13
33
General Dos and Do NOTs
Do: ❖ Raise your hand before asking questions during lectures ❖ Discuss papers in the reading list with peers and others ❖ Participate in class discussions; and on Piazza, if you like ❖ Use “CSE232A:” as subject prefix for all emails to me/TAs Do NOT: ❖ Plagiarize or share content for your paper reviews ❖ Harass, cut off, or be disrespectful to others in class ❖ Use email as primary communication mechanism for
doubts/questions instead of Office Hours ❖ Record or quote the instructor’s anecdotes out of class! ☺
34
Questions?
35
Example Relational DB: Netflix!
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 20… … … … …
UserID Name Age JoinDate79 Alice 23 01/10/13… … …
MovieID Name ReleaseDate Director20 Inception 07/13/2010 Christopher Nolan… … …
Ratings
Users
Movies
36
Recap: Relational Model
37
Relational Model: Basic Terms
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 4232 2933 2.5 08/02/15 54551 846… … … … …
What is a Relation?
A glorified table!
What are Attributes?
These things
What are Domains?
The mathematical “domains” for the attributes
Integers Real …
What is Arity?Ratings
Number of attributes
38
Relational Model: Basic Terms
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 4232 2933 2.5 08/02/15 54551 846… … … … …
What are Tuples?What is Cardinality?
These thingsNumber of tuples
Ratings
39
Referring to “tuples”: Two notations
1. Without using attribute names (positional/sequence) 2. Using attribute names (named/set)
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 4232 2933 2.5 08/02/15 54551 846… … … … …
Ratings (R)
t[1] = 3.5t.NumStars = 3.5
40
Relational Model: Basic Terms
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 4232 2933 2.5 08/02/15 54551 846… … … … …
What is Schema?
The relation name, and the name and logical descriptions of the attributes (including domains)
Aka “metadata”
Ratings
41
Relational Model: Basic Terms
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 4232 2933 2.5 08/02/15 54551 846… … … … …
What is an Instance?
A given relation populated with a set of tuples (loose analogy: schema:instance::type:value in PL)
Instance 1
RatingID NumStars Timestamp UserID MovieID3292 1.5 06/27/14 794 10
294122 4.0 07/10/14 232 32974423 0.5 03/08/14 8451 846
… … … … …
Instance 2
Ratings
42
Relational Model: Basic Terms
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 20… … … … …
What is a Relational Database?
UserID Name Age JoinDate79 Alice 23 01/10/13… … …
MovieID Name ReleaseDate Director20 Inception 07/13/2010 Christopher Nolan… … …
A collection of relations; similarly, schema vs. instance
Ratings
Users
Movies
43
“Write Operations” on a Relation
❖ Insert Add tuples to a relation ❖ Delete Remove tuples from a relation (typically based on
“predicate” matches, e.g., “NumStars <= 4.5” ❖ Modify Logically, deletes + inserts, but typically implemented
as in-place updates to a relation instance
44
“Read Operations” on a Relation
❖ “Select” Select all tuples from Ratings with “UserID == 19” ❖ “Project” Select only Director attribute from Movies ❖ “Aggregate” Select Average of all NumStars in Ratings
And a few more formal “algebraic” operations …
45
Recap: Relational Algebra
Basic Relational Operations
Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference
46
Select
47
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16
Ratings (R)
Example: Get all ratings with 4.0 or more stars
“Selection condition/predicate”Select
“Operator”
Select
48
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16
R
RatingID NumStars Timestamp UserID MovieID2 4.0 07/20/15 80 204 4.5 03/05/14 80 16
R’
Schema preservedSubset of tuples
(satisfying selection condition)
Complex Selection Conditions
49
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16
R
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 20
R’
Basic Relational Operations
Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference
50
Project
51
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16
Ratings (R)
Example: Get all MovieIDs
“Projection list”
Project
52
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16
R
MovieID2016
R’ Schema reduced (to projection list)
Tuple values “deduplicated” (slightly different semantics in SQL)
Composition of Relational Ops
53
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16
R
Example: Get UserID and NumStars of ratings with less than 3 stars
UserID NumStars79 2.5
R’RelOp(Relation) gives a Relation
Basic Relational Operations
Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference
54
Rename
55
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16
Ratings (R)
Example: Rename Timestamp to RateDate
⇢C(2�>RateDate)(R)
⇢RatingID,NumStars,RateDate,UserID,MovieID(R)
Rename
56
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16
R
R’
Instance preserved Schema modified
RatingID NumStars RateDate UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16
⇢C(2�>RateDate)(R)
Basic Relational Operations
Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference
57
Cross Product
58
UserID Name Age JoinDate79 Alice 23 01/10/1380 Bob 41 05/10/13
MovieID Name ReleaseYear Director20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron
Users (U)
Movies (M)
❖ Cartesian product (construct all pairs of tuples across tables) ❖ Schema of output “concatenates” the input schemas ❖ Be careful with attribute name conflicts! Use Rename op
Cross Product
59
UserID Name Age JoinDate79 Alice 23 01/10/1380 Bob 41 05/10/13
MovieID Name ReleaseYear Director20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron
Users (U)
Movies (M)
UserID U.Name Age JoinDate MovieID M.Name ReleaseYear
Director
79 Alice 23 01/10/13 20 Inception 2010 Christopher Nolan
79 Alice 23 01/10/13 16 Avatar 2009 Jim Cameron80 Bob 41 05/10/13 20 Inception 2010 Christopher
Nolan80 Bob 41 05/10/13 16 Avatar 2009 Jim Cameron
⇢C(1�>U .Name(U)⇥ ⇢C(1�>M .Name(M)
Basic Relational Operations
Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference
60
Union
61
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 16
R1
RatingID NumStars Timestamp UserID MovieID3 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20
R2Union of sets of tuples (instances)
Inputs must have identical schema: “Union-compatible”
Union
62
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20
Basic Relational Operations
Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference
63
Set Difference
64
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 16
R1
RatingID NumStars Timestamp UserID MovieID3 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20
R2
Set difference of sets of tuples (instances) Inputs must be “Union-compatible”
Set Difference
65
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 16
R1
RatingID NumStars Timestamp UserID MovieID3 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20
R2
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 20
R’ = R1 – R2
Basic Relational Operations
Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference
66
Derived and Other Relational Ops
Set Operation: Intersection Join Group By Aggregate
67
Intersection
68
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 16
R1
RatingID NumStars Timestamp UserID MovieID3 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20
R2
Set intersection of sets of tuples (instances) Inputs must be “Union-compatible”
Intersection
69
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 16
R1
RatingID NumStars Timestamp UserID MovieID3 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20
R2
RatingID NumStars Timestamp UserID MovieID3 2.5 08/02/14 79 16
Derived and Other Relational Ops
Set Operation: Intersection Join Group By Aggregate
70
Join
71
❖ Equivalent Select on Cross Product, but “bypasses” full X
❖ Perhaps the most intensively studied Rel Op! ❖ Several “types” of Joins:
Natural Join and Equi-Join Condition Join (aka Theta Join) Semi-Join, Inner Join, Outer Join, Anti-Join, etc.
R ./JoinCondition M
�JoinCondition(R⇥M)
Natural Join
72
MovieID Name ReleaseYear Director20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron
Movies (M)
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16
Ratings (R)
R ./ M
Natural Join
73
RatingID
NumStars
Timestamp
UserID
MovieID
Name ReleaseYear
Director
1 3.5 08/27/15 79 20 Inception 2010 Christopher Nolan2 4.0 07/20/15 80 20 Inception 2010 Christopher Nolan3 2.5 08/02/14 79 16 Avatar 2009 Jim Cameron4 4.5 03/05/14 80 16 Avatar 2009 Jim Cameron
❖ “Join attributes”: attributes that determine matching tuples Have same name in both inputs! “MovieID” in R and M
❖ Implicit equality condition on join attributes If > 1 pair, implicit logical “and” of all equality terms ❖ Output schema concatenates input schemas But join attributes appear only once in output (Project)
R ./ M
Equi-Join
74
❖ Generalization of the Natural Join
❖ Attribute names of “join attributes” need not be the same ❖ EqualityCondition is a general boolean expression (logical
“and”, and/or “or”) of terms with equality predicates only ❖ Join attributes from both R and M in output (no Project)
Perhaps the most important and common type of Join! Lots of R&D on efficient implementations!
R ./EqualityCondition M
Equi-Join: Example
75
J K10 x20 y30 x
R1(J,K)P Qx 4y 9x 8
R2(P,Q) T(J,K,P,Q)J K P Q10 x x 420 y y 930 x x 810 x x 830 x x 4
T(J,K, P,Q) = R1 ./K=P R2
(Primary) Key-Foreign Key Join
76
❖ A special kind of equi-join ❖ One of the join attributes is the (Primary) Key of an input
relation; the other is a Foreign Key in the other relation
RatingID NumStars Timestamp UID MovieID
UserID Name Age JoinDate
Ratings
Users
Also a common and important (sub) type of join with even more specialized efficient implementations!
Ratings ./UID=UserID Users
Condition Join (aka “Theta” Join)
77
❖ Generalization of the Equi-Join
R ./JoinCondition M
Condition Join: Example
78
J K10 x20 y30 x
R1(J,K)P Qx 4y 9x 12
R2(P,Q) T(J,K,P,Q)J K P Q10 x x 420 y x 420 y y 930 x x 430 x y 930 x x 12
Perhaps the most difficult type of Join to implement efficiently!
T(J,K, P,Q) = R1 ./J/2>Q R2
Join Expressions
79
Can compose many joins into a single complex expression
RatingID NumStars Timestamp UserID MovieID
UserID UName Age JoinDate
MovieID Name ReleaseYear Director
Ratings
UsersMovies
Q. What do we get as the output?
UserID
UName
Age
JoinDate
RatingID
NumStars
Timestamp
MovieID
Name ReleaseYEar
Director
AllStuff
80
Taxonomy of Joins
All kinds of joins
Inner joins
Outer joins Semi joins Anti joins
Theta joinsEqui joins
Natural joins
Key-Foreign Key Joins“Snowflake” joins
“Star” joins
Derived and Other Relational Ops
Set Operation: Intersection Join Group By Aggregate
81
Group By Aggregate
82
❖ NOT a part of relational algebra, but “Extended RA”! ❖ Useful for “analytics” queries that aggregate numerical data
RatingID NumStars Timestamp UserID MovieIDRatings
What is the average rating for each movie? How many movies has each user rated?
❖ Standard 5 numerical aggregations supported in SQL:
Count, Sum, Average, Maximum, and Minimum Extra: Median, Mode, Variance, Standard Deviation, etc.
Group By Aggregate
83
“Grouping Attributes” (Subset of R’s attributes)
A numerical attribute in R“Aggregate Function”
(SUM, COUNT, AVG, MAX, MIN)
❖ Output schema will have X and an extra numerical attribute (result of the aggregate function)
❖ Can list multiple aggregate functions in the same operation
�X,Agg(Y )(R)
Group By Aggregate
84
RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16
R
MovieID AVG(NumStars)20 3.7516 3.5
R’
What is the average rating for each movie?
�MovieID,AV G(NumStars)(R)
Group By Aggregate
85
AVG(Age)25.75
U’
What is the average age of the users?
UserID Name Age JoinDate79 Alice 23 01/10/1380 Bob 41 05/10/13
123 Carol 19 08/09/14420 Dan 20 03/01/15
Users (U)
The set of Grouping Attributes can be empty too!
�AVG(Age)(U)
Derived and Other Relational Ops
Set Operation: Intersection Join Group By Aggregate
86
87
Recap: SQL
Basic Form of an SQL Query
88
SELECT [DISTINCT] target-list FROM relation-list [WHERE condition]
List of attributes to projectOptional
List of relations (possibly with “aliases”)
Selection/join condition (optional)
What does it mean logically?
89
SELECT [DISTINCT] target-list FROM relation-list [WHERE condition]
1. Cross-product of relations in relation-list 2. If condition given, apply it to filter out tuples 3. Remove attributes not present in target-list 4. If DISTINCT given, deduplicate tuples in result
The above is only a logical interpretation. It is NOT a “plan” an RDBMS would use in general to run an SQL query!
90
Example SQL Query
SELECT M.Name FROM Movies M WHERE M.Year = 2013
Movies (M)MovieID Name Year Director
20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen
Example: Get the names of movies released in 2013
⇡Name(�Y ear=2013(M))
91
Example SQL Query
SELECT M.Name FROM Movies M WHERE M.Year = 2013
Movies (M)MovieID Name Year Director
20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen
NameGravity
Blue Jasmine
92
Example SQL Query
SELECT * FROM Movies M WHERE M.Year = 2013
Movies (M)MovieID Name Year Director
20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen
MovieID Name Year Director53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen
93
Example SQL Query
SELECT M.Name FROM Movies M WHERE M.Year <> 2013
Movies (M)MovieID Name Year Director
20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen
Example: Get the names of movies from years other than 2013
⇡Name(�Y ear 6=2013(M))
94
Example SQL Query
SELECT M.Year FROM Movies M
Movies (M)MovieID Name Year Director
20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen
Example: For which years do we have movie data?⇡Y ear(M)
95
Example SQL Query
SELECT M.Year FROM Movies M
Movies (M)MovieID Name Year Director
20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen
Year2010200920132013
SQL allows repetitions of tuples in a relation! Not the same semantics as RA’s Project
Called “bag semantics” vs. RA’s set semantics
96
DISTINCT in SQL
SELECT DISTINCT M.Year FROM Movies M
Movies (M)MovieID Name Year Director
20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen
Year201020092013
DISTINCT needed to achieve set semantics of RA’s Project in SQL
97
Aliases in SQL
Movies (M)MovieID Name Year Director
20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen
SELECT M.Name FROM Movies M WHERE M.Year = 2013
Why bother with the alias? Not needed here!
SELECT Name FROM Movies WHERE Year = 2013
98
Aliases in SQL – Useful for Joins!
Movies (M)MovieID Name Year DirectorID
SELECT M.Name FROM Movies M, Directors D WHERE D.Name = “Jim Cameron” AND M.DirectorID = D.DID
Aliases help disambiguate attributes with the same name from multiple relations (or even a self-join!)
Example: Get names of movies directed by “Jim Cameron”
Directors (D)DID Name Age
99
More SQL Examples
Example: Get names of movies released in 2013 by Woody Allen or some other director 50 years or older
SELECT M.Name FROM Movies M, Directors D WHERE (D.Name = “Woody Allen” OR D.Age >= 50) AND M.Year = 2013 AND M.DirectorID = D.DID
Movies (M)MovieID Name Year DirectorID
Directors (D)DID Name Age
100
LIKE in SQL
SELECT DISTINCT M.Director FROM Movies M WHERE M.Name LIKE “Blue%”
Movies (M)MovieID Name Year Director
20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen
Example: Get the directors of movies that start with “Blue”
“%” matches any number of characters; “_” matches one
101
ORDER BY in SQL
SELECT M.Name FROM Movies M WHERE M.Year = 2013
Movies (M)MovieID Name Year Director
20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen
NameGravity
Blue Jasmine
ORDER BY M.NameName
Blue JasmineGravity
Useful for data readability Ordering defined by domain semantics Can specify DESC; multiple attributes
102
LIMIT in SQL
SELECT M.Name FROM Movies M WHERE M.Year >= 2010
MovieID Name Year Director20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen
ORDER BY M.Year
YearInceptionGravity
Blue Jasmine
Also useful for data readability Prevents “flooding” of screen with data Be wary of using it without ORDER BY!
LIMIT 2
103
Aggregate Functions in SQL
SELECT COUNT(*) FROM Movies M WHERE M.Year > 2010
Movies (M)MovieID Name Year Director
20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron91 Interstellar 2014 Christopher Nolan
How many movies came out after 2010?
104
Aggregate Functions in SQL
SELECT COUNT(*) FROM Movies M WHERE M.Year > 2010
Movies (M)MovieID Name Year Director
20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron91 Interstellar 2014 Christopher Nolan
How many movies came out after 2010?
COUNT(*)2
105
5 Native Aggregate Functions in SQL
❖ COUNT ([DISTINCT] attribute) ❖ AVG ([DISTINCT] attribute) ❖ SUM ([DISTINCT] attribute) ❖ MAX (attribute) ❖ MIN (attribute)
106
Aggregate Functions in SQL
SELECT COUNT(DISTINCT M.Director) FROM Movies M
Movies (M)MovieID Name Year Director
20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron91 Interstellar 2014 Christopher Nolan
How many directors do we have?
Aggregate Functions in SQL
107
RatingID Stars UserID MovieID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42
Ratings (R)
SELECT R.MovieID, MAX(R.Stars) FROM Ratings R
Which MovieID(s) have the highest rating?
Other attributes NOT allowed in the target-list as such!
Aggregate Functions in SQL
108
RatingID Stars UserID MovieID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42
Ratings (R)
SELECT DISTINCT R.MovieID FROM Ratings R WHERE R.Stars = (SELECT MAX(R2.Stars) FROM Ratings R2)
Which MovieID(s) have the highest rating?
Group By Aggregate in SQL
109
SELECT [DISTINCT] target-list FROM relation-list [WHERE condition] GROUP BY grouping-list HAVING group-condition
X
Condition on each group in aggregate
target-list must be in this form: X’, Agg(Y)
Subset of X
�X,Agg(Y )(R)
Group By Aggregate in SQL
110
RatingID Stars UserID MovieID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42
Ratings (R)
What is the average rating for each movie?
SELECT R.MovieID, AVG(R.Stars) AS AvgRating FROM Ratings R GROUP BY R.MovieID
�MovieID,AVG(Stars)(R)
Group By Aggregate in SQL
111
RatingID Stars UserID MovieID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42
Ratings (R)
SELECT R.MovieID, AVG(R.Stars) AS AvgRating FROM Ratings R GROUP BY R.MovieID
MovieID AvgRating20 4.042 4.053 2.5
One tuple in output per unique value of R.MovieID (aka “group”)
112
Q1. Which of the following is not a basic operation in relational algebra?
ABCD
�[⇢./
Review Questions
113
Q2. What type of join is the following NOT an example of?
RID NumStars RateDate UID MIDUID UName Age JoinDateMID MName Year Director
RUM
ABCD
R ./ U
Inner join
Outer join
Natural join
Theta join
Review Questions
114
Q3. How many movies did “Jim Cameron” direct?
�COUNT(⇤)(�Director=“Jim Cameron”(M))
�ReleaseYear ,COUNT(⇤)(�Director=“Jim Cameron”(M))
�COUNT(⇤)(�Director=“Jim Cameron”(M ./ R))
�Director=“Jim Cameron”(�COUNT(⇤)(M))A
B
C
D
RID NumStars RateDate UID MIDUID UName Age JoinDateMID MName Year Director
RUM
Review Questions
115
Q4. Which of the following attributes is a primary key?
RID NumStars RateDate UID MIDUID UName Age JoinDateMID MName Year Director
RUM
ABCD
U.UID
R.UID
R.MID
M.Year
Review Questions
116
Q5. Which of the following SQL features do not have a counterpart operation in extended relational algebra?
ABCD
SELECT DISTINCT
WHERE
LIMIT
GROUP BY
Review Questions
117
Q6. What is the cardinality of this query’s output?
AB
CD
1
2
3
4
RID NumStars RateDate UID MID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20
SELECT COUNT(DISTINCT MID) FROM R
R
Review Questions
118
Q7. Get the year and director of all movies from the last decade that have the term “Avengers” in their title.
ASELECT Year, Director from M WHERE Year >= 2008 AND MName LIKE “%Avengers%”
MID MName Year DirectorM
BSELECT Year, Director from M WHERE Year >= 2008 AND MName LIKE “%Avengers”
CSELECT Year, Director from M WHERE Year >= 2008 AND MName LIKE “Avengers%”
DSELECT Year, Director from M WHERE Year >= 2008 AND MName = “Avengers”
Review Questions
119
RatingID Stars UserID MovieIDRatings (R)
Q8. Write an SQL query to get the number of 5-star ratings for each movie directed by “Christopher Nolan”
SELECT R.MovieID, COUNT(R.Stars) AS NumHighRatings FROM Ratings R, Movies M WHERE M.Director = “Christopher Nolan” AND R.MovieID = M.MovieID
AND R.Stars = 5 GROUP BY R.MovieID
Movies (M) MovieID Name Year Director
Review Questions
120
Advanced SQL Operations (Optional)
UNION in SQL
121
Get the IDs of users that have rated a movie directed by “Ang Lee” or a movie that released in 2013
RID Stars UID MID
MID Name Year Director
Ratings (R)
Movies (M)
SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Director = “Ang Lee” UNION SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Year = 2013
Union-compatible!
Semantics of UNION in SQL
122
MID Name Year Director20 Inception 2010 Christopher Nolan42 Life of Pi 2012 Ang Lee53 Gravity 2013 Alfonso Cuaron
RID Stars UID MID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42 R M
SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Director = “Ang Lee” UNION SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Year = 2013
UID79
UID79123
Semantics of UNION in SQL
123
UNIONUID79123
UID79
UID79123
UNION implicitly deduplicates tuples (unlike SELECT)!
Q. How to retain duplicates with UNION?
UNION ALLUID79123
UID79
UID7979123
INTERSECT in SQL
124
MID Name Year Director20 Inception 2010 Christopher Nolan42 Life of Pi 2012 Ang Lee53 Gravity 2013 Alfonso Cuaron
RID Stars UID MID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42 R M
SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Director = “Ang Lee” INTERSECT SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Year = 2013
UID79
UID79123
UID79
INTERSECT
EXCEPT (Set Difference) in SQL
125
MID Name Year Director20 Inception 2010 Christopher Nolan42 Life of Pi 2012 Ang Lee53 Gravity 2013 Alfonso Cuaron
RID Stars UID MID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42 R M
SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Director = “Ang Lee” EXCEPT SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Year = 2013
UID79
UID79123 UID
123
EXCEPT
The Contentious Bag Semantics!
126
UID79797912312380
UID7912312392
UNION ALL UID797979791231231231238092
Add the number of repetitions
The Contentious Bag Semantics!
127
UID79797912312380
UID7912312392
EXCEPT ALL UID797980Subtract the number
of repetitions
The Contentious Bag Semantics!
128
UID79797912312380
UID7912312392
INTERSECT ALL UID79123123Minimum of the
number of repetitions