Recently, I’ve been working on a project that scrapes Reddit looking for links to products on Amazon. Basically the idea being that there’s valuable info in what people are linking to and talking about online, and a starting point would be looking for links to Amazon products on Reddit. And theresult of that work turned into Product Mentions .
To build this, and I can talk more about this later, I have two parts. First being a basic Rails app that displays the products and where they’re talked about, and the second being a python app that does the scraping, and alsodisplays the scraping logs for me usingFlask. I thought of just combining the two functionalities at first, but decided It was easier in both regards to separate the two functionalities. The scraper populates the database, and the Rails app displays what’s in there. I hosted the Rails app on Heroku, and after some poking around, decided to also run the Python scraper on Heroku as well (for now at least!)
Also, if at this point, you’re thinking to yourself, “why the hell is he using an overpriced, web app hosting service like Heroku when there are so many other options available?” you’re probably half right, but in terms of ease of getting started, Heroku was by far the easiest PaaS to get this churning.Heroku is nice, and this set up is really simple, especially compared to some of the other PaaS options out there that require more configuration. You can definitely look for different options if you’re doing a more full web crawl, but this’ll work for alot of purposes.
So what I’m going to describe here today, is how I went about running the scrapers on Heroku as background jobs, using clock and worker processes. I’ll alsotalk a little about what’s going on so it makes a little more sense than those copy paste tutorials I see a lot (though that type of tutorial from Heroku’s docs is what I used here, so I can’t trash them too badly!).
worker.pyFirst file you’re going to need here is a worker file, which will perform the function that it sees coming off a queue. For ease, I’ll name this worker.py file. This willconnect to Redis, and just wait for a job to be put on the queue, and then run whatever it sees. First, we need rq the library that deals with Redis in the background (all of this is assuming you’re in a virtualenv
$ pip install rq $ pip freeze > requirements.txtThis is the only external library you’re going to need forafunctioning worker.py file, as specified by the nice Heroku doc. This imports the required objects from rq, connects to Redis using either an environment variable (that would be set in a production / Heroku environment), creates a worker, and then calls work. So in the end, running python worker.py will just sit there waiting to get jobs to run, in this case, scraping Reddit. We also have ‘high’ ‘default’ and ‘low’ job types, so the queue will know which ones to run first, but we aren’t going to need that here.
import os import redis from rq import Worker, Queue, Connection listen = ['high', 'default', 'low'] redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379') conn = redis.from_url(redis_url) if __name__ == '__main__': with Connection(conn): worker = Worker(map(Queue, listen)) worker.work() clock.pyNow that we have the worker set up, here’s the clock.py file that I’m using to do the scraping. Here, it imports the conn variable from the worker.py file, uses that to make sure we’re connected to the same Redis queue. We also import the functions that use the scrapers from run.py , and in this file, createfunctions that will enqueue the respective functions. Then we use apscheduler to schedule when we want to call these functions, and then start the scheduler. If we run python clock.py , we scheduler will run in perpetuity (hopefully), and then will call the correct code on the intervals we defined.
For this, we’re going to need to pip installapscheduler, and then again, save in requirements for use on Heroku later, and frankly, for anywhere else you might want to use this code.
$ pip install apscheduler $ pip freeze > requirements.txtIn the code below, I’m looking for threads with amazon.com links every 30 minutes (since this is searchable, and there aren’t more than a few per half hour anyway), and then looks for new comments using the /comments endpoint, and there are about 400-500 per minute. In terms of hits on Reddit, all these scrapers combined make something like 5 per minute, so it’s definitely not going to overload their servers, considering I probably make more requests than that per minute just browsing.
from apscheduler.schedulers.blocking import BlockingScheduler from rq import Queue from worker import conn from run import run_gather_threads, run_gather_comments import logging import sys logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) sched = BlockingScheduler() q = Queue(connection=conn) def gather_threads(): q.enqueue(run_gather_threads) def gather_comments(): q.enqueue(run_gather_comments) sched.add_job(gather_comments) #enqueue right away once sched.add_job(gather_comments, 'interval', minutes=1) sched.add_job(gather_threads) #enqueue right away once sched.add_job(gather_threads, 'interval', minutes=30) sched.start()The actual code that does the work (the functions run_gather_comments and run_gather_threads) can live anywhere, even in the clock.py file. Butsince all these files have nice, short, one word names, I put that code in run.py file, which loads the scraper object from a different file, to keep all the logic separate. I’m won’t go over the scraping heresincethat’s a different topic.
Just so everything is complete, here’s this run.py file.
#run.py from scrapers.reddit_scraper import RedditScraper rs = RedditScraper() def run_gather_threads(): #code that goes to reddit and gets thread info rs.gather_threads() print "Gathering Threads" def run_gather_comments(): #code that goes to reddit and gets comment info print "Gathering Comments" rs.gather_comments() Running LocallyNow with these files set up, open up three tabs in your terminal, all within the correct folder, and all making sure you have the correct pip libraries installed either globally, or within your virtual environment.
The first tab, run redis-server , which you can install with homebrew if you haven’t yet. In the second tab, run python worker.py which should output something like the following, and then hang, waiting for work:
(pm)MacBook-Pro:product_mentions_scraper jackschultz$ python worker.py 12:05:12 RQ worker u'rq:worker:MacBook-Pro.91002' started, version 0.7.0 12:05:12 Cleaning registries for queue: high 12:05:12 Cleaning registries for queue: default 12:05:12 Cleaning registries for queue: low 12:05:12 12:05:12 *** Listening on high, default, low...And then in the last tab, run python clock.py which will look something like the following:
(pm)MacBook-Pro:product_mentions_scraper jackschultz$ python clock.py INFO:apscheduler.scheduler:Adding job tentatively -- it will be properly scheduled when the scheduler starts INFO:apscheduler.scheduler:Adding job tentatively -- it will be properly scheduled when the scheduler starts INFO:apscheduler.scheduler:Adding job tentatively -- it will be properly scheduled when the scheduler starts INFO:apscheduler.scheduler:Adding job tentatively -- it will be prop