
Hubba’s Product Recommendations Data Pipeline Building a recommendation engine with AWS Data Pipeline, Elastic MapReduce and Spark
From Google’s advertisements to Amazon’s product suggestions, recommendation engines are everywhere. As users of smart internet services, we’ve become so accustomed to seeing things we like. This blog post is an overview of how we built a product recommendation engine for Hubba . I’ll start with an explanation of different types of recommenders and how we went about the selection process. Then I’ll cover our AWS solution before diving into some implementation details.
What’s a recommender?There are three main types of recommendation engines used today:
Content-based filtering Collaborative filtering A hybrid of content and collaborative (hybrid recommender)Content-based recommenders use discrete properties of an item, such as its tags. If a user views products tagged “dogs”, “pets”, “chow”, the recommender may suggest to view more pet food products. Collaborative filtering recommenders use a user’s past actions to guess what items the user of interest may have a preference for. Take for example two users. The first views items 1 and 3, while the second views items 2 and 3. The first user would get recommended to view item 2, and the second user would get recommended to view item 1. The hybrid recommender combines the above two methods using much more sophisticated methods.
We decided to go with collaborative filtering (CF) due to the quality and quantity of our data. The ability to tag products with keywords was only recently implemented, so we didn’t have enough content to learn from. But we had been tracking user actions for at least a year and knew it to be reliable. Also, we didn’t have to worry about users who had artificially spammed their products with tags.
CF does have its weaknesses, though. CF recommenders have a problem known as the “cold-start problem”, where they can’t find new items to recommend unless the items have already been seen by a user. Though we were confident that this problem would get ameliorated over time as more users interacted with more products.
An implementation of a CF recommender is the alternating least squares (ALS) algorithm. ALS takes a large matrix of user activities and products, then factors the matrix into two smaller matrices called latent factors. ALS spits out a latent factor for users, and another for products. These factors describe the initial large matrix but with less data. I won’t go into the math behind ALS, but if you’re interested, this paper here by Hu et al. describes it in detail. If you’re familiar with singular value decomposition , ALS is another approach which accomplishes the same goal. It’s less accurate but faster, especially when powered by Spark.

M here is the matrix of users and products, U is the matrix of users, and V is the matrix of products. With ALS, the product of U and V would give an approximation of M. Illustration source: http://www.cs.cmu.edu/~yuxiangw/research.html Our recommender solution
We had a couple of requirements while planning the architecture for our recommendation engine.
We needed to be able to keep a snapshot of all data that went into generating our models. Our recommendation engine’s compute solution needed to be robust enough to scale with growing data. Whether that’s in compute power, memory capacity, or both. The compute solution needed to be cost-effective. The compute solution needed a good monitoring tool for maintaining efficient distribution of load. The models generated by each run of the pipeline needed to be accessible afterwards. We needed to be able to store user recommendations somewhere for quick retrieval through an API.I’ll address the above six points in order.
AWS is good at transferring data between their services, and data from Redshift to S3 is no exception. AWS Data Pipeline allowed us to regularly run a SELECT query and save the output as a CSV in S3 with a file name specific to a particular run of the pipeline. We could then import the data into the downstream model generation activities. Data input snapshots ― check.
AWS Elastic MapReduce (EMR) provided us with a large catalogue of managed instances for compute. We could choose from memory or storage optimized instances and scale up or down as our needs changed. Then through Data Pipeline, we could fire up clusters when a pipeline started and install monitoring software and Spark on the fly. We could run bootstrapping tasks after launch, and shut down when the jobs finished. We could also bid on spot instances which would decrease operating costs even further. Scalable, cost-effective, and monitored compute were what we needed and what we got with EMR.
EMR can use S3 as a native data store, so it’s easy to transfer files back and forth from S3 to EMR and vice versa. As soon as we generated a model in EMR, it never touched the local file system in EMR. It saved directly into S3 for retrieval later. Generated model snapshots ― check.
For storage and access to the generated predictions for a particular user, we decided to go with DynamoDB. DynamoDB is AWS’ managed NoSQL database. It’s super easy to use since there’s only two knobs that we needed to control ― the read and write capacity units. The two knobs controlled the volume of data that the database could read and write on a per-second level. DynamoDB also worked well with AWS API Gateway, a managed API to access DynamoDB data that had none to little technical upkeep. We could also lock in this API to only be accessible within our production EC2 instances, by having both our servers and APIs on AWS.
The solution relied on AWS a lot, but it came with a lot of integration shortcuts that saved us a lot of time. I’ll now walk you through a big-picture overview of how we got to a production-ready recommendation engine.
Start with dataMost of my data preprocessing and model generation is on an Ipython Notebook. This allows me to document my steps as I go, and keep track of why I made certain decisions. It’s also much easier for someone else to read through the process in as close to sentence form as possible than have to read through all of the pipeline code.
I preface every notebook with a question that the notebook seeks to answer. Then followed by a description of my approach and a rationale for why I chose to approach the question that way. You don’t always have to do this, but it helps me focus on the problem at hand. It also re-contextualizes me when I haven’t opened up the notebook in a while.

The motivating question should be concise. I use the PySpark API to talk to