Concurrent Requests with Python3

Intro

Pulling data from websites is often the first step of a data-analytic process.

The number of data resources required for an analysis influences the time this process take. Few resources, of course, require little time to gather. But gathering data from 1000 resources (i.e. making 1000 API calls) could take a substantial amount of time. If the resources must gathered on a repeating basis, the problem is compounded.

People new to python might be uncertain as to how to make this process faster; here’s a demonstration and comparison of some approaches!

We start with a list of resources:

subs = [ 'politics', 'canada', 'funny', 'news', 'gifs', 'python', 'worldnews', 'aww', 'movies', 'books', 'space', 'creepy', ] endpoints = ['https://reddit.com/r/%s/top.json?t=day&limit=10' % s for s in subs] Blocking

With the requests library, we can put the data from each resource into a list shown below.

Note, the resources are downloaded sequentially. The total time is approximately:

time_per_resource * number_of_resources .

import requests %%timeit done_blocking = [requests.get(u) for u in endpoints] 1 loop, best of 3: 8.46 s per loop Parallel

Parallel methods split the acquisition of resources across workers. Workers can be threads or processes and are accessed through the Executor class from the concurrent.futures module. Users can overlook some of these details. requests_futures provides a API the same as the requests with a parallel underlying implementation.

Each worker handles tasks sequentially. *If the number of worker (threads or processes) is close to the number of tasks, the process requires a fixed time for any number of tasks; specifically it requires approximately the time for the longest task*:

from requests_futures.sessions import FuturesSession from concurrent.futures import wait session = FuturesSession(max_workers=len(endpoints)) %%timeit futures = [ session.get(u) for u in endpoints ] done, incomplete = wait(futures) 1 loop, best of 3: 189 ms per loop

*More generally, the process requires the time to process the number of tasks / number of workers in sequence *:

session = FuturesSession(max_workers=2) %%timeit futures = [ session.get(u) for u in endpoints ] done, incomplete = wait(futures) 1 loop, best of 3: 1.1 s per loop Asyncio

A third method is asynchronous. In this case, nothing is guaranteed to happen in sequence. Tasks must have entry/exit points where the worker (i.e. the main thread) can leave them and work on something else. In this case, the web request constitutes that entry point; so, for example, once the first web request is started, the main thread works on something else, i.e. starting the next web request.

I have a hard time coming up with an expression for the duration of the asynchronous case. I suppose its something like:

time_not_waiting + max(time_for_task_i - time_task_i_started)

import asyncio import aiohttp import json loop = asyncio.get_event_loop() client = aiohttp.ClientSession(loop=loop) async def get_json(client, url): async with client.get(url) as response: return await response.read() %%timeit result = loop.run_until_complete( asyncio.gather( *[get_json(client, e) for e in endpoints] ) ) 1 loop, best of 3: 741 ms per loop When to use which?

There’s a few ways to look at this. The key for me is that, in terms of simplicity, sequential > parallel > asynchronous. That’s my apriori preference.

For a few tasks, use sequential.

With a large number of tasks that are not meaningful entered/exited (i.e. they are not waiting for input/output), parallel. A good example here is running an operation on rows of a data set which is already in memory.

With a large number of tasks which are usually waiting for intput/output, use asynchronous. For web requests, asynchronous fits the bill for the large number of tasks.

Large depends on how long a task tasks and your time sensitivity.

Bonus

Asynchronous parallel would be fascinating and useful for a very large number of i/o heavy tasks; if you have any idea how to achieve this do share!

Concurrent Requests with Python3

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本