data science for advanced dummies

37
Data Science for Advanced Dummies

Upload: saurav-chakravorty

Post on 14-Dec-2014

113 views

Category:

Data & Analytics


4 download

DESCRIPTION

An overview of data science

TRANSCRIPT

  • 1. Data Science for Advanced Dummies
  • 2. Introduction to Big Data What is Big Data? What makes data, Big Data? 2
  • 3. Big Data Definition No single standard definition Big Data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it 3
  • 4. Characteristics of Big Data: 1-Scale (Volume) Data Volume 44x increase from 2009 2020 From 0.8 zettabytes to 35zb Data volume is increasing exponentially 4 Exponential increase in collected/generated data
  • 5. Characteristics of Big Data: 2-Complexity (Varity) Various formats, types, and structures Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc Static data vs. streaming data A single application can be generating/collecting many types of data 5 To extract knowledge all these types of data need to linked together
  • 6. Characteristics of Big Data: 3-Speed (Velocity) Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction 6
  • 7. Big Data: 3Vs 7
  • 8. Some Make it 4Vs 8
  • 9. Whos Generating Big Data Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 9
  • 10. What Technology Do We Have For Big Data ?? 10
  • 11. 11
  • 12. Which Movie Do You Like? Designing a movie recommendation system
  • 13. Can you describe the movie you would like?
  • 14. Recommender Systems Movie Problem: Find Similar movies to my taste. Movies have many Features Western, Clint Eastwood, Tarantino, 90s, A viewer as preferences Features Likes Western; hates content based filtering movies Netflix Prize From Wikipedia, the free encyclopedia The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the contest. The competition was held by Netflix, an online DVD-rental service, and was open to anyone not connected with Netflix (current and former employees, agents, close relatives of Netflix employees, etc.) or a resident of Cuba, Iran, Syria, North Korea, Burma or Sudan.[1] On 21 September 2009, the grand prize of US$1,000,000 was given to the BellKor's Pragmatic Chaos team which bested Netflix's own algorithm for predicting ratings by 10.06%.[2]
  • 15. A Highly Simple Solution Comedy Action Blockbu ster . Is Tom Cruise the Lead? 6 5 0 1 7 8 1 0 Saurav 2 8 Sauravs Score = .2*Comedy + .1*Action + 10*Blockbuster + + -.9*Tom Cruise Comedy Action Blockbu ster . Is Tom Cruise the Lead? 2 8 0 0 Saurav 7
  • 16. Quiz #1 Is google search a recommender systems?
  • 17. Supervised Learning Design an Accurate Vending Machine This is a Classification Problem This line is called the Decision Boundary or Separating Hyper plane
  • 18. Quiz #2 Give an example where you think supervised learning is used Hint Spam vs. Ham in Emails
  • 19. Some Common Supervised Algorithms Classification Decision Trees Random Forest Support Vector Machine Neural Network Logistic Regression Regression Linear Regression Non-linear Regression Logistic Regression Association Rule Learning Arules Even Sequence Analysis
  • 20. In Action Handwriting Recognition System Classification Input? Output? 200 200 10 200 200 8 180 200 20 6 Features Labels
  • 21. Note the similarity Classification Algorithms Try to Separate items into Classes
  • 22. Demo
  • 23. Quiz #3 Is driverless cars a learning problem? What are the features? What is the label?
  • 24. Unsupervised Learning
  • 25. Flowers Tetramerous flower of Ludwigia octovalvis showing petals and sepals Sepal lengthSepal width Petal length Petal width 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 5.0 3.6 1.4 0.2 5.4 3.9 1.7 0.4 4.6 3.4 1.4 0.3 5.0 3.4 1.5 0.2 4.4 2.9 1.4 0.2 4.9 3.1 1.5 0.1 5.4 3.7 1.5 0.2
  • 26. Clustering Cluster: A collection/group of data objects/points similar (or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups Cluster analysis find similarities between data according to characteristics underlying the data and grouping similar data objects into clusters Unsupervised learning no predefined classes for a training data set Two general tasks: identify the natural clustering number and properly grouping objects into sensible clusters
  • 27. Plot
  • 28. Quiz #4 How many types (species) of flowers are there?
  • 29. Can you see 3 species?
  • 30. Examples of Unsupervised Learning Clustering Dimensionality Reduction Feature Extraction Self Organizing Maps
  • 31. Quiz #5 Which of the below are supervised and which are unsupervised Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number of groups of essays that are somehow "similar" or "related". Examine a large collection of emails that are known to be spam email, to discover if there are sub-types of spam mail. Given historical data of childrens ages and heights, predict children's height as a function of their age. Have a computer examine an audio clip of a piece of music, and classify whether or not there are vocals (i.e., a human voice singing) in that audio clip, or if it is a clip of only musical instruments (and no vocals). Given a set of news articles from many different news websites, find out what are the main topics covered. Suppose you are working on weather prediction, and you would like to predict whether or not it will be raining at 5pm tomorrow. You want to use a learning algorithm for this. Would you treat this as a classification or a regression problem?
  • 32. Where is Big Data???
  • 33. Lets start from (Big) Data How do you design this system? How do you pay for this? How do you trust someone to do it right? How expensive will such a system be? I need Data. Good reusable data. High quality data. Else all the smarts are waste.
  • 34. Here comes BIG Data to help Image Audio Learning HUGE data sets
  • 35. Thank you!