knowledge discovery and data mining 1 (ku) -...
TRANSCRIPT
Knowledge Discovery and Data Mining 1 (KU)
Simon Walk
IICM, TU Graz
October 22, 2015
Simon Walk (IICM) KDDM1 October 22, 2015 1 / 11
KDDM 1 (KU) - Introduction
Introduction
Simon Walk
Institute for InformationSystems & Computer Media
Inffeldgasse 16c/I
Office: D.2.07
E-Mail: [email protected]
Research Interests:Knowledge & Data MiningSocial Network AnalysisSemantic Web & OntologiesDynamical Systems & ComplexNetworksMachine Learning
Simon Walk (IICM) KDDM1 October 22, 2015 2 / 11
KDDM 1 (KU) - Introduction
Course Context & Goals
Why should you be interested in KDDM1 (KU)?
To consolidate and reinforce your (theoretical) knowledge obtained inKDDM1 (VO) with practical “hands-on” experience.
Helps a LOT for the final exam!
Good preparation for KDDM2!
“Feel” like a data scientist!
If interested: Continue with Master Project or Master’s Thesis
Simon Walk (IICM) KDDM1 October 22, 2015 3 / 11
KDDM 1 (KU) - Organization
Course Organization
You have to
1. form small groups of up to two students.
2. choose one of two practical assignments.
3. work on your chosen assignment.
4. give two presentations (in english) on the progress and results of yourassignment.
After forming a group, send one e-mail to [email protected] andinclude the names and student ids (Matrikelnummern) of the group. Alle-mails have to include [KDDM1] in the subject!
Simon Walk (IICM) KDDM1 October 22, 2015 4 / 11
KDDM 1 (KU) - Project Descriptions
Project 1 - Crawling, Cleaning and Clustering
Objective: Group (semantically) similar pages of a website according totheir most relevant terms!
Write a web-crawler to collect pages/documents that contain text.
Clean the crawled pages from all markup languages and unwantedcontent (e.g., HTML, JavaScript, etc.).
Calculate similarities between the pages (i.e., by calculatingsimilarities between the TF-IDF Vectors for each page)
Group similar pages (i.e., by using a clustering algorithm, such ask-means)
Hint: Python, scikit-learn1, SciPy2 and NumPy3 already provide you withmost of the functionality required to solve this task!
1http://scikit-learn.org/stable/2http://www.scipy.org3http://www.numpy.org
Simon Walk (IICM) KDDM1 October 22, 2015 5 / 11
KDDM 1 (KU) - Project Descriptions
Project 1 - Crawling, Cleaning and Clustering
A word of warning: Be careful when crawling websites!Don’t hammer the servers or you might risk getting banned!
Either select smaller websites for crawling (complete crawl) or choose anappropriate sampling strategy for selecting the pages to analyze!
Rule of thumb: Your datasets should consist of, at least, 1,000 pages!
Simon Walk (IICM) KDDM1 October 22, 2015 6 / 11
KDDM 1 (KU) - Project Descriptions
Project 2 - Movie Recommender
Objective: Recommend similar movies to users, using matrix factorization!
Crawl or download4 a movie-ratings dataset.
Create/Extract the required utility matrix and minimize noise (e.g.,subtract averages).
Perform UV Decomposition to obtain U ∈ Rn×d and V ∈ Rd×m withd = 2 or d = 3.
Plot and interpret findings.
Hint: Python, scikit-learn, SciPy and NumPy already provide you withmany of the functions and tools required to solve this task!
4We suggest to use MovieLens 100khttp://grouplens.org/datasets/movielens/
Simon Walk (IICM) KDDM1 October 22, 2015 7 / 11
KDDM 1 (KU) - Project Descriptions
Project Presentations
Will take place after Partial Exam 2 & 3 on 03.12.2015 and 21.01.2016.
For 03.12.2015 prepare a 5-minute presentation (strict) with 3 slides:
First slide: Dataset
Second slide: Experimental Setup
Third slide: preliminary results
For 21.01.2016 prepare a 10-minute presentation (strict) with 5 slides:
First slide: Introduction/Motivation
Second slide: Methodology
Third slide: Experimental setup
Fourth slide: Results
Fifth slide: Discussion
Simon Walk (IICM) KDDM1 October 22, 2015 8 / 11
KDDM 1 (KU) - Project Descriptions
Project Presentations
Send the slides to [email protected] as PDF until 02.12.2015 23:59 forpresentation 1 and 20.01.2016 23:59 for presentation 2.
Subject of the e-mail must include [KDDM1].
Note that presentations that take longer than 5 or 10 minutes will beinterrupted and stopped!
Grading for the KU depends on your presentation and your results!
Simon Walk (IICM) KDDM1 October 22, 2015 9 / 11