data science at scale back up - thecads.org · in the final capstone project, devel- ... learn the...
TRANSCRIPT
About This SpecializationLearn scalable data management, evaluate big data technologies, and design effective visualizations.This Specialization covers intermediate topics in data science. You will gain hands-on experience with scalable SQL and NoSQL data management solutions, data mining algorithms, and practical statistical and machine learning concepts. You will also learn to visualize data and communicate results, and you’ll explore legal and ethical issues that arise in working with big data. In the final Capstone Project, devel-oped in partnership with the digital internship platform Coursolve, you’ll apply your new skills to a real-world data science project.
5 coursesFollow the suggested order or choose yourown
ProjectsFollow the suggested order or choose yourown
Certi�catesFollow the suggested order or choose yourown
� � � � � � ��
Data Manipulation at Scale: Systems and Algorithms
������������������������
Commitment
Subtitles
4 weeks of study, 6-8 hours/week
English, Spanish, Chinese (Simplified)
About the CoursetData analysis has replaced data acquisition as the bottleneck to evidence-based decision making --- we are drowning in it. Extracting knowledge from large, heterogeneous, and noisy datasets requires not only powerful computing resources, but the programming abstractions to use them effectively. The abstractions that emerged in the last decade blend ideas from parallel databases, distributed systems, and programming languages to create a new class of scalable data analytics platforms that form the foundation for data science at realistic scales.
In this course, you will learn the landscape of relevant systems, the principles on which they rely, their tradeoffs, and how to evaluate their utility against your requirements. You will learn how practical systems were derived from the frontier of research in computer science and what systems are coming on the horizon. Cloud computing, SQL and NoSQL databases, MapReduce and the ecosystem it spawned, Spark and its contemporaries, and specialized systems for graphs and arrays will be covered.
You will also learn the history and context of data science, the skills, challenges, and methodologies the term implies, and how to structure a data science project. At the end of this course, you will be able to:
Learning Goals: 1. Describe common patterns, challenges, and approaches associated with data science projects, and what makes them different from projects in related fields.2. Identify and use the programming models associated with scalable data manipulation, including relational algebra, mapreduce, and other data flow models.3. Use database technology adapted for large-scale analytics, including the concepts driving parallel databases, parallel query processing, and in-database analytics4. Evaluate key-value stores and NoSQL systems, describe their tradeoffs with comparable systems, the details of important examples in the space, and future trends.5. “Think” in MapReduce to effectively write algorithms for systems including Hadoop and Spark. You will understand their limitations, design details, their relationship to databases, and their associated ecosystem of algorithms, extensions, and languages.write programs in Spark6. Describe the landscape of specialized Big Data systems for graphs, arrays, and streams
Week 1Data Science Context and ConceptsUnderstand the terminology and recurring principles associated with data science, and understand the structure of data science projects and emerging methodologies to approach them. Why does this emerg-ing field exist? How does it relate to other fields? How does this course distinguish itself? What do data science projects look like, and how should they be approached? What are some examples of data science projects?
Video · Appetite Whetting: Politics
Video · Appetite Whetting: Extreme Weather
Video · Appetite Whetting: Digital Humanities
Video · Appetite Whetting: Bibliometrics
Video · Appetite Whetting: Food, Music, Public Health
Video · Appetite Whetting: Public Health cont'd, Earthquakes, Legal
Video · Characterizing Data Science
Video · Characterizing Data Science, cont'd
Video · Distinguishing Data Science from Related Topics
Video · Four Dimensions of Data Science
Video · Tools vs. Abstractions
Video · Desktop Scale vs. Cloud Scale
Video · Hackers vs. Analysts
Video · Structs vs. Stats
Video · Structs vs. Stats cont'd
Video · A Fourth Paradigm of Science
Video · Data-Intensive Science Examples
Video · Big Data and the 3 Vs
Video · Big Data Definitions
Video · Big Data Sources
Reading · Supplementary: Three-Course Reading List
Reading · Supplementary: Resources for Learning Python
Video · Course LogisticsReading · Supplementary:
Class Virtual Machine
Reading · Supplementary: Github Instructions
Video · Twitter Assignment: Getting Started
Programming Assignment · Twitter Sentiment Analysis
Week 2Relational Databases and the Relational AlgebraRelational Databases are the workhouse of large-scale data management. Although originally motivated by problems in enterprise operations, they have proven remarkably capable for analytics as well. But most importantly, the principles underlying relational databases are universal in managing, manipulating, and analyzing data at scale. Even as the landscape of large-scale data systems has expanded dramatically in the last decade, relational models and languages have remained a unifying concept. For working with large-scale data, there is no more important programming model to learn.
Video · Data Models, Terminology
Video · From Data Models to Databases
Video · Pre-Relational Databases
Video · Motivating Relational Databases
Video · Relational Databases: Key Ideas
Video · Algebraic Optimization Overview
Video · Relational Algebra Overview
Video · Relational Algebra Operators: Union, Difference, Selection
Video · Relational Algebra Operators: Projection, Cross Product
Video · Relational Algebra Operators: Cross Product cont'd, Join
Video · Relational Algebra Operators: Outer Join
Video · Relational Algebra Operators: Theta-Join
Video · From SQL to RA
Video · Thinking in RA: Logical Query Plans
Video · Practical SQL: Binning Timeseries
Video · Practical SQL: Genomic Intervals
Video · User-Defined Functions
Video · Support for User-Defined Functions
Video · Optimization: Physical Query Plans
Video · Optimization: Choosing Physical Plans
Video · Declarative Languages
Video · Declarative Languages: More Examples
Video · Views: Logical Data Independence
Video · Indexes
Programming Assignment · SQL for Data Science Assignment
Week 3MapReduce and Parallel Dataflow ProgrammingThe MapReduce programming model (as distinct from its implementations) was proposed as a simplifying abstraction for parallel manipulation of massive datasets, and remains an important concept to know when using and evaluating modern big data platforms.
Video · What Does Scalable Mean?
Video · A Sketch of Algorithmic Complexity
Video · A Sketch of Data-Parallel Algorithms
Video · "Pleasingly Parallel" Algorithms
Video · More General Distributed Algorithms
Video · MapReduce Abstraction
Video · MapReduce Data Model
Video · Map and Reduce Functions
Video · MapReduce Simple Example
Video · MapReduce Simple Example cont'd
Video · MapReduce Example: Word Length Histogram
Video · MapReduce Examples: Inverted Index, Join
Video · Relational Join: Map Phase
Video · Relational Join: Reduce Phase
Video · Simple Social Network Analysis: Counting Friends
Video · Matrix Multiply Overview
Video · Matrix Multiply Illustrated
Video · Shared Nothing Computing
Video · MapReduce Implementation
Video · MapReduce Phases
Video · A Design Space for Large-Scale Data Systems
Video · Parallel and Distributed Query Processing
Video · Teradata Example, MR Extensions
Video · RDBMS vs. MapReduce: Features
Video · RDBMS vs. Hadoop: Grep
Video · RDBMS vs. Hadoop: Select, Aggregate, Join
Programming Assignment · Thinking in MapReduce
Week 4NoSQL: Systems and ConceptsNoSQL systems are purely about scale rather than analytics, and are arguably less relevant for the practicing data scientist. However, they occupy an important place in many practical big data platform architectures, and data scientists need to understand their limitations and strengths to use them effectively.
Video · What Does Scalable Mean?
Video · A Sketch of Algorithmic Complexity
Video · A Sketch of Data-Parallel Algorithms
Video · "Pleasingly Parallel" Algorithms
Video · More General Distributed Algorithms
Video · MapReduce Abstraction
Video · MapReduce Data Model
Video · Map and Reduce Functions
Video · MapReduce Simple Example
Video · MapReduce Simple Example cont'd
Video · MapReduce Example: Word Length Histogram
Video · MapReduce Examples: Inverted Index, Join
Video · Relational Join: Map Phase
Video · Relational Join: Reduce Phase
Video · Simple Social Network Analysis: Counting Friends
Video · Matrix Multiply Overview
Video · Matrix Multiply Illustrated
Video · Shared Nothing Computing
Video · MapReduce Implementation
Video · MapReduce Phases
Video · A Design Space for Large-Scale Data Systems
Video · Parallel and Distributed Query Processing
Video · Teradata Example, MR Extensions
Video · RDBMS vs. MapReduce: Features
Video · RDBMS vs. Hadoop: Grep
Video · RDBMS vs. Hadoop: Select, Aggregate, Join
Programming Assignment · Thinking in MapReduce
Week 5Graph AnalyticsGraph-structured data are increasingly common in data science contexts due to their ubiquity in modeling the communica-tion between entities: people (social networks), computers (Internet communication), cities and countries (transportation networks), or corporations (financial transactions). Learn the common algorithms for extracting information from graph data and how to scale them up.
Video · Graph Overview
Video · Structural Analysis
Video · Degree Histograms, Structure of the Web
Video · Connectivity and Centrality
Video · PageRank
Video · PageRank in more Detail
Video · Traversal Tasks: Spanning Trees and Circuits
Video · Traversal Tasks: Maximum Flow
Video · Pattern Matching
Video · Querying Edge Tables
Video · Relational Algebra and Datalog for Graphs
Video · Querying Hybrid Graph/Relational Data
Video · Graph Query Example: NSA
Video · Graph Query Example: Recursion
Video · Evaluation of Recursive Programs
Video · Recursive Queries in MapReduce
Video · The End-Game Problem
Video · Representation: Edge Table, Adjacency List
Video · Representation: Adjacency Matrix
Video · PageRank in MapReduce
Video · PageRank in Pregel
� � � � � � ��
Practical Predictive Analytics: Models and Methodst
������������������������
Commitment
Subtitles
4 weeks of study, 6-8 hours/week
Englisht
About the CourseStatistical experiment design and analytics are at the heart of data science. In this course you will design statistical experiments and analyze the results using modern methods. You will also explore the common pitfalls in interpreting statistical arguments, especially those associated with big data. Collectively, this course will help you internalize a core set of practical and effective machine learning methods and concepts, and apply them to solve some real world problems.
Learning Goals: After completing this course, you will be able to:1. Design effective experiments and analyze the results2. Use resampling methods to make clear and bulletproof statistical arguments without invoking esoteric notation3. Explain and apply a core set of classification methods of increasing complexity (rules, trees, random forests), and associ-ated optimization methods (gradient descent and variants)4. Explain and apply a set of unsupervised learning concepts and methods5. Describe the common idioms of large-scale graph analytics, including structural query, traversals and recursive queries, PageRank, and community detection.
Week 1Practical Statistical InferenceLearn the basics of statistical inference, comparing classical methods with resampling methods that allow you to use a simple program to make a rigorous statistical argument. Motivate your study with current topics at the foundations of science: publication bias and reproducibility.
Video · Appetite Whetting: Bad Science
Video · Hypothesis Testing
Video · Significance Tests and P-Values
Video · Example: Difference of Means
Video · Deriving the Sampling Distribution
Video · Shuffle Test for Significance
Video · Comparing Classical and Resampling Methods
Video · Bootstrap
Video · Resampling Caveats
Video · Outliers and Rank Transformation
Video · Example: Chi-Squared Test
Video · Bad Science Revisited: Publication Bias
Video · Effect Size
Video · Meta-analysis
Video · Fraud and Benford's Law
Video · Intuition for Benford's Law
Video · Benford's Law Explained Visually
Video · Multiple Hypothesis Testing: Bonferroni and Sidak Corrections
Video · Matrix Multiply Overview
Video · Matrix Multiply Illustrated
Video · Shared Nothing Computing
Video · MapReduce Implementation
Video · MapReduce Phases
Video · A Design Space for Large-Scale Data Systems
Video · Parallel and Distributed Query Processing
Video · Teradata Example, MR Extensions
Video · RDBMS vs. MapReduce: Features
Video · RDBMS vs. Hadoop: Grep
Video · RDBMS vs. Hadoop: Select, Aggregate, Join
Programming Assignment · Thinking in MapReduce
Week 2Supervised LearningFollow a tour through the important methods, algorithms, and techniques in machine learning. You will learn how these methods build upon each other and can be combined into practical algorithms that perform well on a variety of tasks. Learn how to evaluate machine learning methods and the pitfalls to avoid.
Video · Statistics vs. Machine Learning
Video · Simple Examples
Video · Structure of a Machine Learning Problem
Video · Classification with Simple Rules
Video · Learning Rules
Video · Rules: Sequential Covering
Video · Rules Recap
Video · From Rules to Trees
Video · Entropy
Video · Measuring Entropy
Video · Using Information Gain to Build Trees
Video · Building Trees: ID3 Algorithm
Video · Building Trees: C.45 Algorithm
Video · Rules and Trees Recap
Video · Overfitting
Video · Evaluation: Leave One Out Cross Validation
Video · Evaluation: Accuracy and ROC Curves
Video · Bootstrap Revisited
Video · Ensembles, Bagging, Boosting
Video · Boosting Walkthrough
Video · Random Forests
Video · Random Forests: Variable Importance
Video · Summary: Trees and Forests
Video · Nearest Neighbor
Video · Nearest Neighbor: Similarity Functions
Video · Nearest Neighbor: Curse of Dimensionality
Reading · R Assignment: Classification of Ocean Microbes
Quiz · R Assignment: Classification of Ocean Microbes
Week 3OptimizationYou will learn how to optimize a cost function using gradient descent, including popular variants that use randomization and parallelization to improve performance. You will gain an intuition for popular methods used in practice and see how similar they are fundamentally.
Video · Optimization by Gradient Descent
Video · Gradient Descent Visually
Video · Gradient Descent in Detail
Video · Gradient Descent: Questions to Consider
Video · Intuition for Logistic Regression
Video · Intuition for Support Vector Machines
Video · Support Vector Machine Example
Video · Intuition for Regularization
Video · Intuition for LASSO and Ridge Regression
Video · Stochastic and Batched Gradient Descent
Video · Parallelizing Gradient Descent
Week 4Unsupervised LearningA brief tour of selected unsupervised learning methods and an opportunity to apply techniques in practice on a real world problem.
Video · Introduction to Unsupervised Learning
Video · K-means
Video · DBSCAN
Video · DBSCAN Variable Density and Parallel Algorithms
Peer Review · Kaggle Competition Peer Review
� � � � � � ��
Communicating Data Science Results
������������������������
Subtitles English
About the CourseImportant note: The second assignment in this course covers the topic of Graph Analysis in the Cloud, in which you will use Elastic MapReduce and the Pig language to perform graph analysis over a moderately large dataset, about 600GB. In order to complete this assignment, you will need to make use of Amazon Web Services (AWS). Amazon has generously offered to provide up to $50 in free AWS credit to each learner in this course to allow you to complete the assignment. Further details regarding the process of receiving this credit are available in the welcome message for the course, as well as in the assignment itself. Please note that Amazon, University of Washington, and Coursera cannot reimburse you for any charges if you exhaust your credit.
While we believe that this assignment contributes an excellent learning experience in this course, we understand that some learners may be unable or unwilling to use AWS. We are unable to issue Course Certificates for learners who do not complete the assignment that requires use of AWS. As such, you should not pay for a Course Certificate in Communicating Data Results if you are unable or unwilling to use AWS, as you will not be able to successfully complete the course without doing so.
Making predictions is not enough! Effective data scientists know how to explain and interpret their results, and communi-cate findings accurately to stakeholders to inform business decisions. Visualization is the field of research in computer science that studies effective communication of quantitative results by linking perception, cognition, and algorithms to exploit the enormous bandwidth of the human visual cortex. In this course you will learn to recognize, design, and use effective visualizations.
Just because you can make a prediction and convince others to act on it doesn’t mean you should. In this course you will explore the ethical considerations around big data and how these considerations are beginning to influence policy and practice. You will learn the foundational limitations of using technology to protect privacy and the codes of conduct emerging to guide the behavior of data scientists. You will also learn the importance of reproducibility in data science and how the commercial cloud can help support reproducible research even for experiments involving massive datasets, complex computational infrastructures, or both.
Learning Goals: After completing this course, you will be able to:1. Design and critique visualizations2. Explain the state-of-the-art in privacy, ethics, governance around big data and data science3. Use cloud computing to analyze large datasets in a reproducible way.
Week 1VisualizationStatistical inferences from large, heterogeneous, and noisy datasets are useless if you can't communicate them to your colleagues, your customers, your management and other stakeholders. Learn the fundamental concepts behind informa-tion visualization, an increasingly critical field of research and increasingly important skillset for data scientists. This module is taught by Cecilia Aragon, faculty in the Human Centered Design and Engineering Department..
Video · 01 Introduction: What and Why
Video · 02 Introduction: Motivating Examples
Video · 03 Data Types: Definitions
Video · 04 Mapping Data Types to Visual Attributes
Video · 05 Data Types Exercise
Video · 06 Data Types and Visual Mappings Exercises
Video · 07 Data Dimensions
Video · 08 Effective Visual Encoding
Video · 09 Effective Visual Encoding Exercise
Video · 10 Design Criteria for Visual Encoding
Video · 11 The Eye is not a Camera
Video · 12 Preattentive Processing
Video · 13 Estimating Magnitude
Video · 14 Evaluating Visualizations
Peer Review · Crime Analytics: Visualization of Incident Reports
Week 2Privacy and EthicsBig Data has become closely linked to issues of privacy and ethics: As the limits on what we *can* do with data continue to evaporate, the question of what we *should* do with data becomes paramount. Motivated in the context of case studies, you will learn the core principles of codes of conduct for data science and statistical analysis. You will learn the limits of current theory on protecting privacy while still permitting useful statistical analysis.
Video · Motivation: Barrow Alcohol Study
Video · Barrow Study Problems
Video · Reifying Ethics: Codes of Conduct
Video · ASA Code of Conduct: Responsibilities to Stakeholders
Video · Other Codes of Conduct
Video · Examples of Codified Rules: HIPAA
Video · Privacy Guarantees: First Attempts
Video · Examples of Privacy Leaks
Video · Formalizing the Privacy Problem
Video · Differential Privacy Defined
Video · Global Sensitivity
Video · Laplacian Noise
Video · Adding Laplacian Noise and Proving Differential Privacy
Video · Weaknesses of Differential Privacy
Week 3Reproducibility and Cloud ComputingScience is facing a credibility crisis due to unreliable reproducibility, and as research becomes increasingly computational, the problem seems to be paradoxically getting worse. But reproducibility is not just for academics: Data scientists who cannot share, explain, and defend their methods for others to build on are dangerous. In this module, you will explore the importance of reproducible research and how cloud computing is offering new mechanisms for sharing code, data, environments, and even costs that are critical for practical reproducibility..
Video · Reproducibility and Data Science
Video · Reproducibility Gold Standard
Video · Anecdote: The Ocean Appliance
Video · Code + Data + Environment
Video · Cloud Computing Introduction
Video · Cloud Computing History
Video · Code + Data + Environment + Platform
Video · Cloud Computing for Reproducible Research
Video · Advantages of Virtualization for Reproducibility
Video · Complex Virtualization Scenarios
Video · Shared Laboratories
Video · Economies of Scale
Video · Provisioning for Peak Load
Video · Elasticity and Price Reductions
Video · Server Costs vs. Power Costs
Video · Reproducibility for Big Data
Video · Counter-Arguments and Summary
Practice Quiz · AWS Credit Opt-in Consent Form
Programming Assignment · Graph Analysis in the Cloud
� � � � � � ��
Data Science at Scale - Capstone Project
Upcoming Session: Jan 15
Commitment
Subtitles English
6 weeks of study, 3-4 hours/weekt
About the CourseIn the capstone, students will engage on a real world project requiring them to apply skills from the entire data science pipeline: preparing, organizing, and transforming data, constructing a model, and evaluating results. Through a collabora-tion with Coursolve, each Capstone project is associated with partner stakeholders who have a vested interest in your results and are eager to deploy them in practice. These projects will not be straightforward and the outcome is not prescribed -- you will need to tolerate ambiguity and negative results! But we believe the experience will be rewarding and will better prepare you for data science projects in practice.
Week 1Project A: Blight FightIn this project, you will build a model to predict when a building is likely to be condemned. The data is real, the problem is real, and the impact is real.
.Reading · Get the Data
Reading · Understand the Domain
Other · Milestone: Discuss the Problem and Approaches
Week 2Week 2: Derive a list of buildingsYou are given sets of incidents with location information; you need to use some assumptions to group these incidents by location to identify specific buildings.
Reading · Milestone: Create a list of "buildings" from a list of geo-located incidents
Practice Peer Review · Reflecting on defining "buildings"
Week 3Week 3: Construct a training datasetConstruct a training set by associating each of your buildings with a ground truth label derived from the permit data.
Reading · Milestone: Derive labels for each building
Practice Peer Review · Reflecting on the labeling scheme
Week 4 Week 4: Train and evaluate a simple modelUse a trivial feature set to train and evaluate a simple model
Reading · Milestone: Train a Simple Model
Practice Peer Review · Reflecting on a trivial initial model
Week 5 Week 5: Feature EngineeringDerive additional features and retrain to improve the efficacy of your model.
Reading · Milestone: Adding more features
Practice Peer Review · Reflection on your proposed features.
Week 6 Week 6: Final ReportEnter your final report for grading.
Peer Review · Final Report