knowledge discovery and data mining 1 (vo)...
TRANSCRIPT
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Knowledge Discovery and Data Mining 1 (VO)(706.701)
Denis Helic
ISDS, TU Graz
October 4, 2018
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 1 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Lecturer
Name: Denis HelicOffice: ISDS, Petersgasse 116, Room 026
Office hours: Tuesday 12:00-13:00Phone: +43-316/873-30610email: [email protected]
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 2 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Lecturer
Name: Roman KernOffice: Know-Center, Inffeldgasse 13, 6th Floor, Room 072
Office hours: By appointmentPhone: +43-316/873-30860email: [email protected]
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 3 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Language
Lectures in EnglishCommunication in German/EnglishExamination: German/English
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 4 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Outline
1 Welcome and Introduction
2 Course Organization
3 Motivation
4 Course Overview
5 Course Highlights
6 Practical Part: KDDM1 KU
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 5 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Teaching Data Science @ ISDS
Introduction to Knowledge Technologies (new title: Introduction toAI & Data Science)Data Analysis Courses:
Knowledge Discovery and Data Mining 1 (Basics and theory)Knowledge Discovery and Data Mining 2 (Applications)Visual Analytics
Analysis of Web Systems & Data:Web Science (Basics)Network Science (Theory and applications)Complexity Science
Infrastructure:Databases (new Prof. and title: Data Management)New course: Architecture of Database SystemsNew course: Architecture of Large-Scale Machine Learning Systems
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 6 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Course context
Knowledge Discovery and Data Mining 1 (VO) (706.701)Obligatory course Master Software Development and Business (1stSemester)Obligatory elective course in subject catalog “KnowledgeTechnologies” (Computer Science)
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 7 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Course context
Knowledge Discovery and Data Mining 1 (KU) (706.702)Free course (but submitted for inclusion in the catalogs)You will get 1 ECTS now, 1.5 ECTS when it is included in the catalogAn add-on for the theoretical partHighly suggested
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 8 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Goals of the course
The overall goal of KDDM and related courses is to learn how todiscover patterns and models in data. We aim to discover patternsthat are:
i Valid: hold for new data with high probabilityii Useful: we can base further actions on themiii Unexpected: non-obviousiv Understandable: humans can interpret them
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 9 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Goals of the course: patterns example
1854 Broad Street cholera outbreakExtracting clusters of cholera outbreak in the city of London in 1854. Thecases clusterd around some intersections of roads in London. These hadcontaminated water wells.
http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 10 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Goals of the course: patterns example
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 11 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Goals of the course
Specific goals of this course are to learn about two out of three basicelements of that discovery:
i Tools: advanced mathematical tools from probability theory,linear algebra, information theory, and statistical inference
ii Infrastructure: models of computation for large data (handled in othercourses)
iii Process: steps that are needed to discover patternsI assume here that you already know
i How to program and develop softwareii Mathematical basics from probability theory and linear algebra
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 12 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Welcome and Introduction
Goals of the course
Student goals: to pass the examinationBonus goal for all: to have fun!
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 13 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Course Calendar
04.10.2018: Course organization, Introduction and Motivation (Denis)11.10.2018: Preprocessing (Roman)18.10.2018: Feature Extraction (Roman)25.10.2018: Feature Engineering (Roman)
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 14 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Course Calendar
30.10.2018: Data Matrices (Denis)08.11.2018: Review of Linear Algebra (Denis)15.11.2018: Partial Exam 1 / Project presentations (KU)22.11.2018: Principal Component Analysis (Denis)29.11.2018: Singular Value Decomposition (Denis)
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 15 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Course Calendar
06.12.2018: Recommender Systems: Matrix Factorization (Denis)13.12.2018: Topic Modeling and Non-negative Matrix Factorization(Denis)10.01.2019: Classification (Denis)17.01.2019: Clustering (Roman)24.01.2019: Evaluation (Denis)31.01.2019: Partial Exam 2 / Project presentations (KU) / FinalExamination
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 16 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Course Logistics
Course website:http://kti.tugraz.at/staff/denis/courses/kddm1Slides are (will be) available on the course websiteAdditional readings, references, links, etc. also on the websiteWe expect that you have basic knowledge in probability theory andlinear algebraTo freshen the knowledge you should solve these problems!This problem is not graded!As a side note: we also expect that you know how to program(relevant for the practical part)
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 17 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Grading
Two partial examinationsFinal examination in JanuaryTwo additional examination dates in winter semesterThree examination dates in summer semesterExamination material: lectures/slides/further readingsIn class we will discuss sample examination questions
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 18 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Partial Examinations
2 written examinationsIn the beginning of a lecture: max 45 minutesEach partial examination 2 questionsDifficulty adjusted to solve both problems in approx. 30 minutesMax 20 points for each questionTotal points: 80
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 19 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Examination
Written examination 90 minutes4 questionsDifficulty adjusted to solve all four in approx. 60 minutesMax 20 points for each questionTotal points: 80
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 20 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Important!
If you take partial examinations it counts as one examination attemptIn other words: if you are negative at partial examinations you will getthe negative grade in the TUGOnline
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 21 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Grading
0-40 points: 541-50 points: 451-60 points: 361-70 points: 271-80 points: 1
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 22 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Important!
If you take partial examinations it counts as one examinationattemptIn other words: if you are negative at partial examinations you will getthe negative grade in the TUGOnlineBad case scenario I: if you get 0 points at the first partialexamination you are negative!In that case you can take final examination in JanuaryAll together this will count as two attempts
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 23 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Important!
Bad case scenario II: if you are negative after the second partialexamination you need to take the exam in summer termAll together this will count as two attemptsTherefore: take partial examination only if you follow the lecture andlearn in parallelThat is also the advantage: you will learn as you go and noteverything at once
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 24 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
KU Organization
KU organization today after this lectureTwo presentations: after partial examinations
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 25 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Organization
Questions?
Raise them now (+1 +1)Ask after the lecture (+1)Visit me in the office hours (+1)Send me an e-mail (±1)As a side note: you should(!) interrupt me immediately (+1 +1 +1)and ask any question you might have during the lecture
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 26 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Motivation
How much information is being produced?
Study at Berkley: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/The World Wide Web contains about 170 terabytes of information onits surfaceThis is seventeen times the size of the Library of Congress printcollectionsInstant messaging generates five billion messages a dayThis is 274 terabytes a year
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 27 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Motivation
How much information is being produced?
Email generates about 400,000 terabytes of new information eachyear worldwideP2P file exchange on the Internet is growing rapidlyThat was 2003, what do we have today?In 1993: 100.000G transferred over the Internet per year
In 2008: 100.000G transferred over the Internet in a second
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 28 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Motivation
How much information is being produced?
Email generates about 400,000 terabytes of new information eachyear worldwideP2P file exchange on the Internet is growing rapidlyThat was 2003, what do we have today?In 1993: 100.000G transferred over the Internet per yearIn 2008: 100.000G transferred over the Internet in a second
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 28 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Motivation
How much information is being produced?
Data, data everywhere:http://www.economist.com/node/15557443
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 29 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Motivation
How much information is being produced?
We are producing more data than we are able to storeWe need to extract and describe useful dataUseful data ≪ all dataWe can store useful dataWe can also try to predict future data: store only prediction model
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 30 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Examples
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 31 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Examples
Amazon product recommendationsi Valid: holds for new data with high probabilityii Useful: users can find and explore new productsiii Unexpected: non-obvious and non-trivialiv Understandable: related articles, etc.
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 32 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Examples
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 33 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Examples
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 34 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Examples
Twitter earthquake and typhoon predictioni Valid: holds for new data with high probabilityii Useful: can save livesiii Unexpected: non-obvious and non-trivialiv Understandable: trajectories of typhoons, positions of earthquakes
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 35 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Knowledge discovery vs. data mining
Knowledge discovery refers to the entire process, of which knowledgeis the end-productIt is iterative and interactiveData mining refers to a specific step in this processIt is the step consisting of applying data analysis and discoveryalgorithms that produce a particular enumeration of patterns overdataAdditional steps are necessary to ensure that the process producesuseful knowledge
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 36 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Steps in the knowledge discovery process
1 Developing an understanding of the application domain and therelevant prior knowledge and identifying the goal of the KDD processfrom the customers viewpoint
2 Creating a target data set: selecting a data set or focusing on asubset of variables or data samples on which discovery is to beperformed
3 Data cleaning and preprocessing: basic operations such as theremoval of noise. If appropriate collecting the necessary informationto model or account for noise, deciding on strategies for handlingmissing data fields, accounting for time sequence information andknown changes
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 37 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Steps in the knowledge discovery process
4 Data reduction and projection: finding useful features to representthe data depending on the goal of the task. Using dimensionalityreduction or transformation methods to reduce the effective numberof variables under consideration or to find invariant representations forthe data
5 Matching the goals of the KDD process step to a particular datamining method e.g. summarization, classification, regression,clustering, etc
6 Choosing the data mining algorithms: selecting methods to be usedfor searching for patterns in the data. This includes deciding whichmodels and parameters may be appropriate e.g. models forcategorical data are different than models on vectors over the reals.Matching a particular data mining method with the overall criteria ofthe KDD process e.g. the enduser may be more interested inunderstanding the model than its predictive capabilities
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 38 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Steps in the knowledge discovery process
7 Data mining searching for patterns of interest in a particularrepresentational form or a set of such representations, classificationrules or trees, regression, clustering and so forth. The user cansignificantly aid the data mining method by correctly performing thepreceding steps
8 Interpreting mined patterns: possibly return to any of the steps forfurther iteration. This step can also involve visualization of theextracted patterns, models or visualization of the data given theextracted models
9 Consolidating discovered knowledge: incorporating this knowledgeinto another system for further action or simply documenting it andreporting it to interested parties. This also includes checking for andresolving potential conflicts with previously believed or extractedknowledge
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 39 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Steps in the knowledge discovery process
Reading!Knowledge Discovery and Data Mining: Towards a Unifying Framework(1996) Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 40 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Steps in the knowledge discovery process
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 41 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
1 Collect training data (e.g. crawl, clean, preprocess)2 Represent examples (e.g. decide which features, how to weight them,
etc.)3 Distance measure (e.g. what is close vs. what is not close)4 Measure the goodness (e.g. objective function)5 Select an approach (e.g. optimization method)
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 42 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day Sunny Normal Humidity Strong Wind Temp. (°C) Play TennisDay1 Yes No Yes 12 NoDay2 No No No 18 NoDay3 Yes Yes No 21 YesDay4 Yes Yes No 28 YesDay5 Yes Yes No 19 ?
Table: Should I play tennis on Day5?
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 43 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
DefinitionMinkowski distance of order 𝑝 between two vectors 𝐱 and 𝐲 from ℝ𝑛 isgiven by:
𝑑𝑝(𝑥, 𝑦) = (𝑛
∑𝑖=1
|𝑥𝑖 − 𝑦𝑖|𝑝)1/𝑝
(1)
What is 𝑑2(𝑥, 𝑦)?
Euclidean distanceWhat is 𝑑1(𝑥, 𝑦)? Manhattan distance
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 44 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
DefinitionMinkowski distance of order 𝑝 between two vectors 𝐱 and 𝐲 from ℝ𝑛 isgiven by:
𝑑𝑝(𝑥, 𝑦) = (𝑛
∑𝑖=1
|𝑥𝑖 − 𝑦𝑖|𝑝)1/𝑝
(1)
What is 𝑑2(𝑥, 𝑦)? Euclidean distanceWhat is 𝑑1(𝑥, 𝑦)?
Manhattan distance
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 44 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
DefinitionMinkowski distance of order 𝑝 between two vectors 𝐱 and 𝐲 from ℝ𝑛 isgiven by:
𝑑𝑝(𝑥, 𝑦) = (𝑛
∑𝑖=1
|𝑥𝑖 − 𝑦𝑖|𝑝)1/𝑝
(1)
What is 𝑑2(𝑥, 𝑦)? Euclidean distanceWhat is 𝑑1(𝑥, 𝑦)? Manhattan distance
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 44 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day1 Day2 Day3 Day4 Day5Day1 0. 6.164414 9.11043358 16.0623784 7.14142843Day2 6.164414 0. 3.31662479 10.09950494 1.73205081Day3 9.11043358 3.31662479 0. 7. 2.Day4 16.0623784 10.09950494 7. 0. 9.Day5 7.14142843 1.73205081 2. 9. 0.
Table: Euclidean distances between days
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 45 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day1 Day2 Day3 Day4 Day5Day1 0. 8. 11. 18. 9.Day2 8. 0. 5. 12. 3.Day3 11. 5. 0. 7. 2.Day4 18. 12. 7. 0. 9.Day5 9. 3. 2. 9. 0.
Table: Manhattan distances between days
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 46 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day1 Day2 Day3 Day4 Day5Day1 0. 3.72174037 3.66840663 4.47507277 3.5008579Day2 3.72174037 0. 3.27940612 3.76436189 3.23329618Day3 3.66840663 3.27940612 0. 1.35622245 0.38749213Day4 4.47507277 3.76436189 1.35622245 0. 1.74371458Day5 3.5008579 3.23329618 0.38749213 1.74371458 0.
Table: Euclidean distances between days (standardized features)
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 47 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day Sunny Normal Humidity Strong Wind Temp. (°C) Temp (°F) Play TennisDay1 Yes No Yes 12 53.6 NoDay2 No No No 18 64.4 NoDay3 Yes Yes No 21 69.8 YesDay4 Yes Yes No 28 82.4 YesDay5 Yes Yes No 19 66.2 ?
Table: Should I play tennis on Day5?
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 48 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day1 Day2 Day3 Day4 Day5Day1 0. 18.8 27.2 46.8 21.6Day2 18.8 0. 10.4 30. 4.8Day3 27.2 10.4 0. 19.6 5.6Day4 46.8 30. 19.6 0. 25.2Day5 21.6 4.8 5.6 25.2 0.
Table: Manhattan distances between days using Fahrenheit and Celsius
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 49 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Quick example of what it all means
Day Sunny Normal Humidity Strong Wind Temp. (°C) Temp (°F) Play TennisDay1 Yes No Yes 12 53.6 NoDay2 No No No 18 64.4 NoDay3 Yes Yes No 21 69.8 YesDay4 Yes Yes No 28 82.4 YesDay5 Yes Yes No 19 66.2 ?
Table: Should I play tennis on Day5?
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 50 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Big picture: KDDM
Probability Theory Linear Algebra Map-Reduce
Mathematical Tools Infrastructure
Knowledge Discovery Process
Information Theory Statistical Inference
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 51 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Teaching Data Science @ ISDS and KDDM1
KDDM1: Basics, theory and KDD process until step 8 (novisualization, no interpretation)KDDM2: Implementation and practice of the theory from KDDM1Visual Analytics: KDD process steps 8 and 9 (interpretation andvisualization)Network Science: graph and networks mining
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 52 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Data mining and other fields
Data mining overlaps withi Databases: Large-scale data, simple queriesii Machine learning: Small data, complex models, model parametersiii Statistics: Theory, predictive models, no algorithms
Data mining: Algorithms, simple and predictive models, large-scaledata
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 53 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Overview
Data mining and other fields
Statistics Machine Learning
Databases
Data Mining
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 54 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Highlights
KDDM1: some thoughts
Big data, data science, …Many buzzwordsBut in the end:
i You need to know how to program and how to develop software systemsii You need to understand the math behind data analysis: linear algebra,
probability, statistics
If you have solid knowledge in (i) and (ii) you are in the top 5% ofdevelopers in the field ;)For the years to come you will be earning a lot of money
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 55 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Course Highlights
KDDM1: some highlights
Recommender systems: decompose the matrix and find out what yourusers like :)PCA, SVD: reduce the dimensionality of the dataNNMF: analyze relationships between set of documents with linearalgebraTopic models: analyze relationships between set of documents withprobability theoryBayesian inference
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 56 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
Practical part: KDDM1 KU
Why should I be interested?To consolidate and reinforce your (theoretical) knowledge with practicalhands-on experienceHelps a lot with the partial and final examinationsGood preparation for KDDM2If interested: possibility to develop a topic for project or MSc Thesis
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 57 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
Practical part: KDDM1 KU
You have to:Form small groups of up to four studentsDecide on an interesting practical or research questionDecide which data you need for that questionCrawl/download the data and work on your projectGive two presentations (in English) on the progress and your finalresultsEngage with the class and discuss the results of other groups
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 58 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
Practical part: KDDM1 KU
Possible topics:Reasons of success of crowd funding campaigns?Movie/Game/Music RecommenderSentiment analysis of user posts on Reddit/StackOverflow, etc.Controversial topics in social media (clustering, prediction, etc.)Hate speech in social media (clustering, prediction, etc.)
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 59 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
Practical part: KDDM1 KU
Presentations take place on 15.11.2018 and 31.01.2019 after thepartial/final examinationsFor the first presentation prepare three slides (5 min strict):
First slide: Research/Practical QuestionSecond slide: DatasetThird slide: Experimental Design
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 60 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
Practical part: KDDM1 KU
For the second presentation prepare five slides (10 min strict):First slide: MotivationSecond slide: MethodologyThird slide: Experimental SetupFourth slide: ResultsFifth slide: Discussion
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 61 / 62
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Practical Part: KDDM1 KU
Practical part: KDDM1 KU
GradingPresentation (time restrictions will be taken into account)Results
Denis Helic (ISDS, TU Graz) KDDM1 October 4, 2018 62 / 62