downloading a billion files in python · our task is to download all the files on the remote server...
TRANSCRIPT
![Page 1: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/1.jpg)
J a m e s S a r y e r w i n n i e
A case study in multi-threading, multi-processing, and asyncio
Downloading a Billion Files in Python
@ j s a r y e r
![Page 2: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/2.jpg)
Our Task
![Page 3: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/3.jpg)
Our Task
There is a remote server that stores files
![Page 4: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/4.jpg)
Our Task
There is a remote server that stores files
The files can be accessed through a REST API
![Page 5: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/5.jpg)
Our Task
There is a remote server that stores files
The files can be accessed through a REST API
Our task is to download all the files on the remote server to our client machine
![Page 6: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/6.jpg)
Our Task (the details)
![Page 7: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/7.jpg)
Our Task (the details)
What client machine will this run on?
![Page 8: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/8.jpg)
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
What client machine will this run on?
![Page 9: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/9.jpg)
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
What client machine will this run on?
What about the network between the client and server?
![Page 10: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/10.jpg)
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
Our client machine is on the same network as the service with remote files
What client machine will this run on?
What about the network between the client and server?
![Page 11: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/11.jpg)
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
Our client machine is on the same network as the service with remote files
What client machine will this run on?
What about the network between the client and server?
How many files are on the remote server?
![Page 12: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/12.jpg)
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
Our client machine is on the same network as the service with remote files
Approximately one billion files, 100 bytes per file
What client machine will this run on?
What about the network between the client and server?
How many files are on the remote server?
![Page 13: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/13.jpg)
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
Our client machine is on the same network as the service with remote files
Approximately one billion files, 100 bytes per file
What client machine will this run on?
What about the network between the client and server?
How many files are on the remote server?
When do you need this done?
![Page 14: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/14.jpg)
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
Our client machine is on the same network as the service with remote files
Approximately one billion files, 100 bytes per file
What client machine will this run on?
What about the network between the client and server?
How many files are on the remote server?
Please have this done as soon as possible
When do you need this done?
![Page 15: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/15.jpg)
Files
Page Page Page
File Server Rest API
![Page 16: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/16.jpg)
File Server Rest API GET /list
Files
Page Page
FileNames
NextMarker
![Page 17: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/17.jpg)
File Server Rest API GET /list
Files
Page Page
FileNames
NextMarker
{"FileNames": [ "file1", "file2", ...], "NextMarker": "pagination-token"}
![Page 18: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/18.jpg)
File Server Rest API GET /list?next-marker=token
Files
Page Page
FileNames
NextMarker
![Page 19: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/19.jpg)
File Server Rest API GET /list?next-marker=token
Files
Page Page
FileNames
NextMarker
{"FileNames": [ "file1", "file2", ...], "NextMarker": "pagination-token"}
![Page 20: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/20.jpg)
File Server Rest API
GET /list
GET /get/{filename}
{"FileNames": ["file1", "file2", ...]}
{"FileNames": ["file1", "file2", ...], "NextMarker": "pagination-token"}
(File blob content)
GET /list?next-marker={token}
![Page 21: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/21.jpg)
Caveats
This is a simplified case study.
The results shown here don't necessarily generalize.
Not an apples to apples comparison, each approach does things slightly different
Sometimes concrete examples can be helpful
![Page 22: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/22.jpg)
Caveats
This is a simplified case study.
The results shown here don't necessarily generalize.
Not an apples to apples comparison, each approach does things slightly different
Always profile and test for yourself
Sometimes concrete examples can be helpful
![Page 23: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/23.jpg)
Synchronous Version
Simplest thing that could possibly work.
![Page 24: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/24.jpg)
Synchronous
Page Page Page
![Page 25: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/25.jpg)
Synchronous
PagePage
![Page 26: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/26.jpg)
Synchronous
PagePage
![Page 27: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/27.jpg)
Synchronous
PagePage
![Page 28: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/28.jpg)
Synchronous
PagePage
![Page 29: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/29.jpg)
Synchronous
PagePage
![Page 30: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/30.jpg)
Synchronous
PagePage
![Page 31: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/31.jpg)
Synchronous
PagePage
![Page 32: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/32.jpg)
Synchronous
Page
![Page 33: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/33.jpg)
Synchronous
Page
![Page 34: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/34.jpg)
Synchronous
Page
![Page 35: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/35.jpg)
Synchronous
Page
![Page 36: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/36.jpg)
Synchronous
Page
![Page 37: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/37.jpg)
Synchronous
Page
![Page 38: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/38.jpg)
Synchronous
Page
![Page 39: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/39.jpg)
def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get'
response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
![Page 40: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/40.jpg)
def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get'
response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)
![Page 41: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/41.jpg)
def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get'
response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)
![Page 42: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/42.jpg)
def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get'
response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)
![Page 43: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/43.jpg)
def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get'
response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)
![Page 44: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/44.jpg)
def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get'
response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
![Page 45: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/45.jpg)
def download_file(remote_url, local_filename): response = requests.get(remote_url) response.raise_for_status() with open(local_filename, 'wb') as f: f.write(response.content)
![Page 46: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/46.jpg)
Synchronous Results
![Page 47: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/47.jpg)
One request 0.003 seconds
Synchronous Results
![Page 48: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/48.jpg)
One request 0.003 seconds
One billion requests 3,000,000 seconds
Synchronous Results
![Page 49: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/49.jpg)
833.3 hours
One request 0.003 seconds
One billion requests 3,000,000 seconds
Synchronous Results
![Page 50: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/50.jpg)
833.3 hours34.7 days
One request 0.003 seconds
One billion requests 3,000,000 seconds
Synchronous Results
![Page 51: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/51.jpg)
Multithreading
![Page 52: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/52.jpg)
Multithreading
List Files can't be parallelized.
queue.Queue But Get File can be parallelized.
![Page 53: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/53.jpg)
Multithreading
List Files can't be parallelized.
queue.Queue But Get File can be parallelized.
![Page 54: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/54.jpg)
Multithreading
List Files can't be parallelized.
One thread calls List Files and puts the filenames on a queue.Queue
queue.Queue But Get File can be parallelized.
![Page 55: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/55.jpg)
Multithreading
List Files can't be parallelized.
WorkerThread-1
WorkerThread-2
WorkerThread-3One thread calls List Files and puts the filenames on a queue.Queue
queue.Queue But Get File can be parallelized.
![Page 56: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/56.jpg)
Multithreading
List Files can't be parallelized.
WorkerThread-1
WorkerThread-2
WorkerThread-3One thread calls List Files and puts the filenames on a queue.Queue
queue.Queue But Get File can be parallelized.
![Page 57: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/57.jpg)
Multithreading
List Files can't be parallelized.
WorkerThread-1
WorkerThread-2
WorkerThread-3One thread calls List Files and puts the filenames on a queue.Queue
queue.Queue But Get File can be parallelized.
![Page 58: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/58.jpg)
Multithreading
List Files can't be parallelized.
WorkerThread-1
WorkerThread-2
WorkerThread-3One thread calls List Files and puts the filenames on a queue.Queue
queue.Queue
Results Queue
Result thread prints progress, tracks overall results, failures, etc.
![Page 59: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/59.jpg)
def download_files(host, port, outdir, num_threads): # ... same constants as before ...
work_queue = queue.Queue(MAX_SIZE) result_queue = queue.Queue(MAX_SIZE)
threads = [] for i in range(num_threads): t = threading.Thread( target=worker_thread, args=(work_queue, result_queue)) t.start() threads.append(t) result_thread = threading.Thread(target=result_poller, args=(result_queue,)) result_thread.start() threads.append(result_thread)
# ...
response = requests.get(list_url)
![Page 60: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/60.jpg)
def download_files(host, port, outdir, num_threads): # ... same constants as before ...
work_queue = queue.Queue(MAX_SIZE) result_queue = queue.Queue(MAX_SIZE)
threads = [] for i in range(num_threads): t = threading.Thread( target=worker_thread, args=(work_queue, result_queue)) t.start() threads.append(t) result_thread = threading.Thread(target=result_poller, args=(result_queue,)) result_thread.start() threads.append(result_thread)
# ...
response = requests.get(list_url)
![Page 61: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/61.jpg)
response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' outfile = os.path.join(outdir, filename) work_queue.put((remote_url, outfile)) if 'NextFile' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
![Page 62: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/62.jpg)
response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' outfile = os.path.join(outdir, filename) work_queue.put((remote_url, outfile)) if 'NextFile' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
![Page 63: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/63.jpg)
def worker_thread(work_queue, result_queue): while True: work = work_queue.get() if work is _SHUTDOWN: return remote_url, outfile = work download_file(remote_url, outfile) result_queue.put(_SUCCESS)
![Page 64: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/64.jpg)
def worker_thread(work_queue, result_queue): while True: work = work_queue.get() if work is _SHUTDOWN: return remote_url, outfile = work download_file(remote_url, outfile) result_queue.put(_SUCCESS)
![Page 65: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/65.jpg)
Multithreaded Results - 10 threads
![Page 66: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/66.jpg)
One request 0.0036 seconds
Multithreaded Results - 10 threads
![Page 67: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/67.jpg)
One request 0.0036 seconds
One billion requests 3,600,000 seconds1000.0 hours
41.6 days
Multithreaded Results - 10 threads
![Page 68: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/68.jpg)
Multithreaded Results - 100 threads
![Page 69: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/69.jpg)
One request 0.0042 seconds
Multithreaded Results - 100 threads
![Page 70: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/70.jpg)
One request 0.0042 seconds
One billion requests 4,200,000 seconds1166.67 hours
48.6 days
Multithreaded Results - 100 threads
![Page 71: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/71.jpg)
Why?
Not necessarily IO bound due to low latency and small file size
GIL contention, overhead of passing data through queues
![Page 72: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/72.jpg)
Things to keep in mind
The real code is more complicated, ctrl-c, graceful shutdown, etc.
Debugging is much harder, non-deterministic
The more you stray from stdlib abstractions, more likely to encounter race conditions
Can't use concurrent.futures map() because of large number of files
![Page 73: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/73.jpg)
Multiprocessing
![Page 74: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/74.jpg)
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory
Our client machine is on the same network as the service with remote files
Approximately one billion files, 100 bytes per file
What client machine will this run on?
What about the network between the client and server?
How many files are on the remote server?
Please have this done as soon as possible
When do you need this done?
![Page 75: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/75.jpg)
Multiprocessing
WorkerProcess-1
WorkerProcess-2
WorkerProcess-3Download one page at a time in parallel across multiple processes
![Page 76: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/76.jpg)
Multiprocessing
WorkerProcess-1
WorkerProcess-2
WorkerProcess-3Download one page at a time in parallel across multiple processes
![Page 77: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/77.jpg)
Multiprocessing
WorkerProcess-1
WorkerProcess-2
WorkerProcess-3Download one page at a time in parallel across multiple processes
![Page 78: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/78.jpg)
Multiprocessing
WorkerProcess-1
WorkerProcess-2
WorkerProcess-3Download one page at a time in parallel across multiple processes
![Page 79: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/79.jpg)
from concurrent import futures
def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list'
all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()
![Page 80: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/80.jpg)
from concurrent import futures
def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list'
all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()
![Page 81: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/81.jpg)
from concurrent import futures
def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list'
all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()
Start parallel downloads
![Page 82: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/82.jpg)
from concurrent import futures
def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list'
all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()
Wait for downloads to finish
![Page 83: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/83.jpg)
def iter_all_pages(list_url): session = requests.Session() response = session.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: yield content['FileNames'] if 'NextFile' not in content: break response = session.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
![Page 84: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/84.jpg)
class Downloader: # ...
def download(self, filename): remote_url = f'{self.get_url}/{filename}' response = self.session.get(remote_url) response.raise_for_status() outfile = os.path.join(self.outdir, filename) with open(outfile, 'wb') as f: f.write(response.content)
![Page 85: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/85.jpg)
Multiprocessing Results - 16 processes
![Page 86: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/86.jpg)
One request 0.00032 seconds
Multiprocessing Results - 16 processes
![Page 87: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/87.jpg)
One request 0.00032 seconds
One billion requests 320,000 seconds
88.88 hours
Multiprocessing Results - 16 processes
![Page 88: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/88.jpg)
One request 0.00032 seconds
One billion requests 320,000 seconds
88.88 hours
Multiprocessing Results - 16 processes
3.7 days
![Page 89: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/89.jpg)
Things to keep in mind
Speed improvements due to truly running in parallel
Debugging is much harder, non-deterministic, pdb doesn't work out of the box
IPC overhead between processes higher than threads
Tradeoff between entirely in parallel vs. parallel chunks
![Page 90: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/90.jpg)
Asyncio
![Page 91: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/91.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
![Page 92: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/92.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
![Page 93: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/93.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
![Page 94: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/94.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
![Page 95: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/95.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
![Page 96: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/96.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
![Page 97: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/97.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
![Page 98: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/98.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
![Page 99: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/99.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
![Page 100: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/100.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
![Page 101: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/101.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
Meanwhile tasks from the first page will finish downloading their file.
![Page 102: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/102.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
Meanwhile tasks from the first page will finish downloading their file.
![Page 103: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/103.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
Meanwhile tasks from the first page will finish downloading their file.
![Page 104: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/104.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
Meanwhile tasks from the first page will finish downloading their file.
![Page 105: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/105.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
Meanwhile tasks from the first page will finish downloading their file.
![Page 106: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/106.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
Meanwhile tasks from the first page will finish downloading their file.
![Page 107: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/107.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
Meanwhile tasks from the first page will finish downloading their file.
![Page 108: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/108.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
Meanwhile tasks from the first page will finish downloading their file.
![Page 109: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/109.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
Meanwhile tasks from the first page will finish downloading their file.
![Page 110: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/110.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
Meanwhile tasks from the first page will finish downloading their file.
![Page 111: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/111.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
Meanwhile tasks from the first page will finish downloading their file.
![Page 112: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/112.jpg)
Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
Move on to the next page and start creating tasks.
Meanwhile tasks from the first page will finish downloading their file.
All in a single process
All in a single thread
Switch tasks when waiting for IO
Should keep CPU busy
![Page 113: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/113.jpg)
import asynciofrom aiohttp import ClientSessionimport uvloop
async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)
![Page 114: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/114.jpg)
import asynciofrom aiohttp import ClientSessionimport uvloop
async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)
![Page 115: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/115.jpg)
import asynciofrom aiohttp import ClientSessionimport uvloop
async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)
![Page 116: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/116.jpg)
import asynciofrom aiohttp import ClientSessionimport uvloop
async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)
![Page 117: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/117.jpg)
import asynciofrom aiohttp import ClientSessionimport uvloop
async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)
![Page 118: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/118.jpg)
async def iter_all_files(session, list_url): async with session.get(list_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read()) while True: for filename in content['FileNames']: yield filename if 'NextFile' not in content: return next_page_url = f'{list_url}?next-marker={content["NextFile"]}' async with session.get(next_page_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read())
![Page 119: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/119.jpg)
async def iter_all_files(session, list_url): async with session.get(list_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read()) while True: for filename in content['FileNames']: yield filename if 'NextFile' not in content: return next_page_url = f'{list_url}?next-marker={content["NextFile"]}' async with session.get(next_page_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read())
![Page 120: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/120.jpg)
async def download_file(session, semaphore, remote_url, local_filename): async with semaphore: async with session.get(remote_url) as response: contents = await response.read() # Sync version. with open(local_filename, 'wb') as f: f.write(contents) return local_filename
![Page 121: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/121.jpg)
async def download_file(session, semaphore, remote_url, local_filename): async with semaphore: async with session.get(remote_url) as response: contents = await response.read() # Sync version. with open(local_filename, 'wb') as f: f.write(contents) return local_filename
![Page 122: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/122.jpg)
Asyncio Results
![Page 123: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/123.jpg)
One request 0.00056 seconds
Asyncio Results
![Page 124: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/124.jpg)
One request 0.00056 seconds
One billion requests 560,000 seconds155.55 hours
6.48 days
Asyncio Results
![Page 125: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/125.jpg)
SummaryApproach SingleRequestTime(s) Days
Synchronous 0.003 34.7
Multithread 0.0036 41.6
Multiprocess 0.00032 3.7
Asyncio 0.00056 6.5
![Page 126: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/126.jpg)
Asyncio and Multiprocessing
![Page 127: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/127.jpg)
Asyncio and Multiprocessing
and Multithreading
![Page 128: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/128.jpg)
WorkerProcess-1
![Page 129: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/129.jpg)
WorkerProcess-1
Thread-2
Thread-1
![Page 130: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/130.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
![Page 131: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/131.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
![Page 132: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/132.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
The Input/Output queues contain pagination tokens
foo
![Page 133: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/133.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
The Input/Output queues contain pagination tokens
foo
![Page 134: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/134.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
foo
The main thread of the worker process is a bridge to the event loop running on a separate thread. It sends the pagination token to the async Queue.
![Page 135: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/135.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
foo
The event loop makes the List call with the provided pagination token "foo".
![Page 136: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/136.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
foo
The event loop makes the List call with the provided pagination token "foo".
{"FileNames": [...], "NextMarker": "bar"}
![Page 137: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/137.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar
The next pagination token "bar", eventually makes its way back to the main process.
![Page 138: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/138.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar
The next pagination token "bar", eventually makes its way back to the main process.
![Page 139: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/139.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar
While another process starts goes through the same steps, WorkerProcess-1 is downloading 1000 files using asyncio.
![Page 140: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/140.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar
![Page 141: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/141.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar
We get to leverage all our cores.1.
![Page 142: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/142.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar
We get to leverage all our cores.1.
We download individual files efficiently with asyncio.
2.
![Page 143: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/143.jpg)
WorkerProcess-1
EventLoop
Thread-2
Thread-1
Queue
WorkerProcess-2
EventLoop
Thread-2
Thread-1
Queue
Main process
Input Queue
Output Queue
bar
We get to leverage all our cores.1.
We download individual files efficiently with asyncio.
2.
Minimal IPC overhead, only passing pagination tokens across processes, only
one per thousand files.
3.
![Page 144: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/144.jpg)
Combo Results
![Page 145: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/145.jpg)
One request 0.0000303 seconds
Combo Results
![Page 146: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/146.jpg)
One request 0.0000303 seconds
One billion requests 30,300 seconds
Combo Results
![Page 147: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/147.jpg)
One request 0.0000303 seconds
8.42 hoursOne billion requests 30,300 seconds
Combo Results
![Page 148: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/148.jpg)
Summary
Approach SingleRequestTime(s) Days
Synchronous 0.003 34.7
Multithread 0.0036 41.6
Multiprocess 0.00032 3.7
Asyncio 0.00056 6.5
Combo 0.0000303 0.35
![Page 149: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/149.jpg)
Tradeoff between simplicity and speed
Multiple orders of magnitude difference based on approach used
Lessons Learned
Need to have max bounds when using queueing or any task scheduling
![Page 150: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client](https://reader035.vdocuments.site/reader035/viewer/2022071413/610b9f2388542827e85e0a60/html5/thumbnails/150.jpg)
Thanks!
J a m e s S a r y e r w i n n i e @ j s a r y e r