systems for data science - marco serafini · multi-core processors •idea: scale computational...

Systems for Data ScienceMarco Serafini

COMPSCI 532Lecture 1

2

Course Structure• Fundamentals you need to know about systems

• Caching, Virtual memory, concurrency, etc…

• Review of several “Big-data” systems• Learn how they work• Principles of systems design: Why systems are designed that way

• Hands-on experience

• No electronic devices during classes (not even in airplane mode)

3

Course Assignments• Reading research papers• 2-3 projects

• Coding assignments

• Midterm + final exam

http://marcoserafini.github.io/teaching/systems-for-data-science/fall19/

4

Course Grades• Midterm exam: 20%• Final exam: 30%• Projects: 50%

5

Questions• Teaching Assistant

• Nathan Ng <[email protected]>• Office hours: Tuesday 4.30-5.30 PM @ CS 207

• Piazza website• https://piazza.com/umass/fall2019/compsci532/home• Ask questions there rather than emailing me or Nathan

• Credits if you are active• Well-thought questions and answers: be curious (but don’t just show off)• I will never penalize you for saying or asking something wrong

https://piazza.com/umass/fall2019/compsci532/home

6

Projects• Groups of two people

• See course website for details

• High-level discussions with other colleagues: ok• “What are the requirements of the project?”

• Low-level discussions with other colleagues: ok• “How do threads work in Java?”

• Mid-level discussions: not ok• “How to design a solution for the project?”

• Project delivery includes short oral exam

What are “systems for data science”?

Systems + Data Science• Data science research

• New algorithms• New applications of existing algorithms• Validation: take small representative dataset, show accuracy

• Systems research• Run these algorithms efficiently• Scale them to larger datasets• End-to-end pipelines• Applications of ML to system design and software engineering (seminar next Spring!)• Validation: build a prototype that others can use

• These are ends of a spectrum

Overview• What type of systems will we target?

• Storage systems• Data processing systems• Cloud analytics• System for machine learning

• Goal: Hide complexity of underlying hardware • Parallelism: multi-core, distributed systems • Fault-tolerance: hardware fails

• Focus on scalable systems• Scale to large datasets• Scale to computationally complex problems

Transactional vs. Analytical Systems• Transactional data management system

• Real-time response• Concurrency• Updates

• Analytical data management system• Non real-time responses• No concurrency• Read-only

• These are ends of a spectrum

Example: Search Engine• Crawlers: download the Web• Hadoop file system (HDFS): store the Web• MapReduce: run massively parallel indexing• Key-value store: store index• Front-end

• Serve client requests• Ranking à this is actual the data science

• Q: Scalability issues?• Q: Which component is transactional / analytical• Q: Where are storage/data processing/cloud/ML involved?

Design goals

13

Ease of Use• Good APIs / abstractions are key in a system• High-level API

• Easier to use, better productivity, safer code• It makes some implementation choices for you• These choices are based on assumptions on the use cases• Are these choices really what you need?

• Low-level API• Harder to use, lower productivity, unsafer code• More flexible

14

Scalability

Parallelism

SpeedupIdeal

Reality

• Ideal world• Linear scalability

• Reality• Bottlenecks• For example: central coordinator

• When do we stop scaling?

15

Latency vs. Throughput

Throughput

Latency

1x requests 10x req 50x req

100x req

Max throughput• Pipe metaphor

• System is a pipe• Requests are small marbles

• Low load• Minimal latency

• Increased load (2x w)• Higher throughput• Latency stable

• High load• Saturation: no more throughput• Latency skyrockets

16

Fault Tolerance• Assume that your system crashes every month• If you run Python scripts on your laptop, that’s fine• But imagine you run a cluster

• 10 nodes = a crash every 3 days• 100 nodes = a crash every seven hours• 1000 nodes = a crash every 50 minutes

• Some computations run for more than one hour• Cannot simply restart when something goes wrong• Even when restarting, we need to keep metadata safe

17

Why do we need parallelism?

Maximum Clock Rate is Stagnating

Source: https://queue.acm.org/detail.cfm?id=2181798

Two major “laws” are collapsing• Moore’s law• Dennard scaling

Moore’s Law• “Density of transistors in an integrated circuit doubles every two years”. Smaller à changes propagate faster

So far so good, but the trend is slowing down and it won’t last for long (Intel’s prediction: until 2021 unless new technologies arise) [1]

[1] https://www.technologyreview.com/s/601441/moores-law-is-dead-now-what/

Expo

nent

ial a

xis

Dennard Scaling• “Reducing transistor size does not increase power density à power consumption proportional to chip area”

• Stopped holding around 2006• Assumptions break when physical system close to limit

• Post-Dennard-scaling world of today• Huge cooling and power consumption issues• If we kept the same clock frequency trends, today a CPU would have the power density of a nuclear reactor

Heat Dissipation Problem• Large datacenters consume energy like large cities• Cooling is the main cost factor

Google @ Columbia River valley (2006) Facebook @ Luleå (2015)

Where is Luleå?

Single-Core Solutions• Dynamic Voltage and Frequency Scaling (DVFS)

• E.g. Intel’s TurboBoost• Only works under low load

• Use part of the chip for coprocessors (e.g. graphics)• Lower power consumption• Limited number of generic functionalities to offload

Multi-Core Processors

core

Processor (chip)

core

core core

core

Processor (chip)

core

core core

core

Processor (chip)

core

core core…

Main Memory

Socket(to motherboard) Socket Socket

Multi-Core processors• Idea: scale computational power linearly

• Instead of a single 5 GHz core, 2 * 2.5 GHz cores

• Scale heat dissipation linearly• k cores have ~ k times the heat dissipation of a single core• Increasing frequency of a single core by k times creates superlinearheat dissipation increase

How to Leverage Multicores• Run multiple tasks in parallel

• Multiprocessing• Multithreading

• E.g. PCs have many parallel background apps• OS, music, antivirus, web browser, …

• How to parallelize one app is not trivial• Embarrassingly parallel tasks

• Can be run by multiple threads• No coordination

Memory Bandwidth Bottleneck• Cores compete for the same main memory bus• Solution: caching help in two ways

• They reduce latency (as we have discussed)• They also increase throughput by avoiding bus contention

SIMD Processors• Single Instruction Multiple Data (SIMD) processors• Example

• Graphical Processing Units (GPUs)• Intel Phi coprocessors

• Q: Possible SIMD snippets

for i in [0,n-1] dov[i] = v[i] * pi

for i in [0,n-1] doif v[i] < 0.01 then

v[i] = 0

Other Approaches• SIMD

• Single Instruction Multiple Data• A massive number of simpler cores

• FPGAs• Dedicated hardware designed for a specific task

Automatic Parallelization?• Holy grail in the multi-processor era• Approaches

• Programming languages• Systems with APIs that help express parallelism• Efficient coordination mechanisms

Homework

The Anatomy of a Large-Scale HypertextualWeb Search Engine

Sergey Brin and Lawrence Page

Computer Science Department,Stanford University, Stanford, CA 94305, USA

[email protected] and [email protected]

Abstract In this paper, we present Google, a prototype of a large-scale search engine which makes heavyuse of the structure present in hypertext. Google is designed to crawl and index the Web efficientlyand produce much more satisfying search results than existing systems. The prototype with a fulltext and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/To engineer a search engine is a challenging task. Search engines index tens to hundreds ofmillions of web pages involving a comparable number of distinct terms. They answer tens ofmillions of queries every day. Despite the importance of large-scale search engines on the web,very little academic research has been done on them. Furthermore, due to rapid advance intechnology and web proliferation, creating a web search engine today is very different from threeyears ago. This paper provides an in-depth description of our large-scale web search engine -- thefirst such detailed public description we know of to date. Apart from the problems of scalingtraditional search techniques to data of this magnitude, there are new technical challenges involvedwith using the additional information present in hypertext to produce better search results. Thispaper addresses this question of how to build a practical large-scale system which can exploit theadditional information present in hypertext. Also we look at the problem of how to effectively dealwith uncontrolled hypertext collections where anyone can publish anything they want.

Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google

1. Introduction(Note: There are two versions of this paper -- a longer full version and a shorter printed version. Thefull version is available on the web and the conference CD-ROM.) The web creates new challenges for information retrieval. The amount of information on the web isgrowing rapidly, as well as the number of new users inexperienced in the art of web research. People arelikely to surf the web using its link graph, often starting with high quality human maintained indicessuch as Yahoo! or with search engines. Human maintained lists cover popular topics effectively but aresubjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics.Automated search engines that rely on keyword matching usually return too many low quality matches.To make matters worse, some advertisers attempt to gain people’s attention by taking measures meant tomislead automated search engines. We have built a large-scale search engine which addresses many ofthe problems of existing systems. It makes especially heavy use of the additional structure present inhypertext to provide much higher quality search results. We chose our system name, Google, because itis a common spelling of googol, or 10100 and fits well with our goal of building very large-scale search

systems for data science - marco serafini · multi-core processors •idea: scale computational...

Documents