Running Python Background Jobs with Heroku

Recently, I’ve been working on a project that scrapes Reddit looking for links to products on Amazon. Basically the idea being that there’s valuable info in what people are linking to and talking about online, and a starting point would be looking for links to Amazon products on Reddit. And theresult of that work turned into Product Mentions .

To build this, and I can talk more about this later, I have two parts. First being a basic Rails app that displays the products and where they’re talked about, and the second being a python app that does the scraping, and alsodisplays the scraping logs for me usingFlask. I thought of just combining the two functionalities at first, but decided It was easier in both regards to separate the two functionalities. The scraper populates the database, and the Rails app displays what’s in there. I hosted the Rails app on Heroku, and after some poking around, decided to also run the Python scraper on Heroku as well (for now at least!)

Also, if at this point, you’re thinking to yourself, “why the hell is he using an overpriced, web app hosting service like Heroku when there are so many other options available?” you’re probably half right, but in terms of ease of getting started, Heroku was by far the easiest PaaS to get this churning.Heroku is nice, and this set up is really simple, especially compared to some of the other PaaS options out there that require more configuration. You can definitely look for different options if you’re doing a more full web crawl, but this’ll work for alot of purposes.

So what I’m going to describe here today, is how I went about running the scrapers on Heroku as background jobs, using clock and worker processes. I’ll alsotalk a little about what’s going on so it makes a little more sense than those copy paste tutorials I see a lot (though that type of tutorial from Heroku’s docs is what I used here, so I can’t trash them too badly!).

worker.py

First file you’re going to need here is a worker file, which will perform the function that it sees coming off a queue. For ease, I’ll name this worker.py file. This willconnect to Redis, and just wait for a job to be put on the queue, and then run whatever it sees. First, we need rq the library that deals with Redis in the background (all of this is assuming you’re in a virtualenv

$ pip install rq $ pip freeze > requirements.txt

This is the only external library you’re going to need forafunctioning worker.py file, as specified by the nice Heroku doc. This imports the required objects from rq, connects to Redis using either an environment variable (that would be set in a production / Heroku environment), creates a worker, and then calls work. So in the end, running python worker.py will just sit there waiting to get jobs to run, in this case, scraping Reddit. We also have ‘high’ ‘default’ and ‘low’ job types, so the queue will know which ones to run first, but we aren’t going to need that here.

import os import redis from rq import Worker, Queue, Connection listen = ['high', 'default', 'low'] redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379') conn = redis.from_url(redis_url) if __name__ == '__main__': with Connection(conn): worker = Worker(map(Queue, listen)) worker.work() clock.py

Now that we have the worker set up, here’s the clock.py file that I’m using to do the scraping. Here, it imports the conn variable from the worker.py file, uses that to make sure we’re connected to the same Redis queue. We also import the functions that use the scrapers from run.py , and in this file, createfunctions that will enqueue the respective functions. Then we use apscheduler to schedule when we want to call these functions, and then start the scheduler. If we run python clock.py , we scheduler will run in perpetuity (hopefully), and then will call the correct code on the intervals we defined.

For this, we’re going to need to pip installapscheduler, and then again, save in requirements for use on Heroku later, and frankly, for anywhere else you might want to use this code.

$ pip install apscheduler $ pip freeze > requirements.txt

In the code below, I’m looking for threads with amazon.com links every 30 minutes (since this is searchable, and there aren’t more than a few per half hour anyway), and then looks for new comments using the /comments endpoint, and there are about 400-500 per minute. In terms of hits on Reddit, all these scrapers combined make something like 5 per minute, so it’s definitely not going to overload their servers, considering I probably make more requests than that per minute just browsing.

from apscheduler.schedulers.blocking import BlockingScheduler from rq import Queue from worker import conn from run import run_gather_threads, run_gather_comments import logging import sys logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) sched = BlockingScheduler() q = Queue(connection=conn) def gather_threads(): q.enqueue(run_gather_threads) def gather_comments(): q.enqueue(run_gather_comments) sched.add_job(gather_comments) #enqueue right away once sched.add_job(gather_comments, 'interval', minutes=1) sched.add_job(gather_threads) #enqueue right away once sched.add_job(gather_threads, 'interval', minutes=30) sched.start()

The actual code that does the work (the functions run_gather_comments and run_gather_threads) can live anywhere, even in the clock.py file. Butsince all these files have nice, short, one word names, I put that code in run.py file, which loads the scraper object from a different file, to keep all the logic separate. I’m won’t go over the scraping heresincethat’s a different topic.

Just so everything is complete, here’s this run.py file.

#run.py from scrapers.reddit_scraper import RedditScraper rs = RedditScraper() def run_gather_threads(): #code that goes to reddit and gets thread info rs.gather_threads() print "Gathering Threads" def run_gather_comments(): #code that goes to reddit and gets comment info print "Gathering Comments" rs.gather_comments() Running Locally

Now with these files set up, open up three tabs in your terminal, all within the correct folder, and all making sure you have the correct pip libraries installed either globally, or within your virtual environment.

The first tab, run redis-server , which you can install with homebrew if you haven’t yet. In the second tab, run python worker.py which should output something like the following, and then hang, waiting for work:

(pm)MacBook-Pro:product_mentions_scraper jackschultz$ python worker.py 12:05:12 RQ worker u'rq:worker:MacBook-Pro.91002' started, version 0.7.0 12:05:12 Cleaning registries for queue: high 12:05:12 Cleaning registries for queue: default 12:05:12 Cleaning registries for queue: low 12:05:12 12:05:12 *** Listening on high, default, low...

And then in the last tab, run python clock.py which will look something like the following:

(pm)MacBook-Pro:product_mentions_scraper jackschultz$ python clock.py INFO:apscheduler.scheduler:Adding job tentatively -- it will be properly scheduled when the scheduler starts INFO:apscheduler.scheduler:Adding job tentatively -- it will be properly scheduled when the scheduler starts INFO:apscheduler.scheduler:Adding job tentatively -- it will be properly scheduled when the scheduler starts INFO:apscheduler.scheduler:Adding job tentatively -- it will be prop

Running Python Background Jobs with Heroku

Trending Articles

雲林縣斗六市科 - 新東京夢公園

[BDMV] Yuragi-sou no Yuuna-san (Yuuna of Yuragi Manor) [Vol.1~6Fin+OADx3][JP]...

高德导航SDK存在频繁采集“精确位置”的行为，这个最低频率怎么设置

[转载]煞貢、直星、人專吉日\金神七煞歌

[一般] 毀滅神州的外掛

Pro-face GP-Pro EX 4.09.100 破解版

[下載]AutoCAD 2015~2018 典型工作區

关门一家亲：习远平、张澜澜、徐才厚

晴色杀手《ＸＸ系列》：1993 美丽凶器、1994 美丽猎人、1996 掠色无罪、1997 温柔的美兽、1997 狂爱、1998 另一个XX

【日影】[MagicStar] Sweet Rain 死神的精度 / Sweet Rain 死神の精度 2008 [WEBDL] [1080p]...

盜伐七里香42棵市價逾5千萬

Delphi 12.2.5 绿色版

台湾萌妹COSer Misa米砂写真集赠送活动获奖名单揭晓

香港虽变色大陆游客赴港仍可购到禁书

亮亮视野推出消费级AR眼镜Leion Hey2

每日一句泰语：不经历风雨，怎么见彩虹

最新版Parallels Desktop 20.4.0 55980

[正版購買] YT Saver 10.3.0 中文版 - 網路影片下載兼轉檔軟體支援私人影片下載

清科2016中国股权投资年度排名公布

[字体]古风字体合集[百度云下载][1.68GB]