Looking Back at 2016

We started 2016 with an eye on blowing 2015 out of the water. Mission accomplished.

Together with our users, we have crawled more in 2016 than the rest of Scrapinghub’s history combined: a whopping 43.7 billion web pages, resulting in 70.3 billion scraped records! Great work everyone!

In the what follows, we’ll give you a whirlwind tour of what we’ve been up to in 2016, along with a quick peek at what you can expect in 2017.

Platform Scrapy Cloud

It’s been another high growth year for Scrapy Cloud as we wrap up with over 4,700 active customer projects (a 450% increase from last year).

We proudly announced our biggest platform upgrade to date this year with the launch ofScrapy Cloud 2.0. Alongside technical improvements like Docker support andpython 3 support, this upgrade introduced an improved pricing model that is both less expensive for you while allowing you to better customize Scrapy Cloud based on your resource needs.

As we move into 2017, we will continue to focus on usability and on expanding the technical capabilities of our platform such as:

Support for non-Scrapy spiders : We’re well aware that there are alternatives to Scrapy in the wild. If you’ve got a crawler created using another framework, be it in Python or another language, you’ll soon be able to run it in our cloud-based platform. GitHub integration : You’ll soon be able to sign up and easily deploy from GitHub. We’ll support automatic deploys shortly after: push an update to your GitHub repo and it will automatically be reflected within Scrapinghub.

Heads up, Scrapy Cloud is in for a massive change this year, so stay tuned!

Crawlera

Crawlera is our other flagship product and it basically helps your crawls continue uninterrupted. We were thrilled to launch ourCrawlera Dashboard this year, which gives you the ability to visualize how you are using the product, to examine what sites and specific URLs you are targeting, and to manage multiple accounts.

Portia

The main goal of Portia is to lower the barrier of entry to web data extraction and to increase the democratization of data (it’s open source!).

Portia got a lot of love this year with the betarelease of Portia 2.0. This 2.0 update includes new features like simple extraction of repeated data, loading start urls from a feed, the option to download Portia projects as python code , and the use of CSS selectors to extract specific data.

Next year we’re going to be bringing a host of new features that will make Portia an even more valuable tool for developers and non-developers alike.

Data Science

While we had been engaging in data science activities since our earliest days, 2016 saw us formalize a Data Science team proper.We’re continuing to push the envelope for machine learning data extraction, so get pumped for some really exciting developments in 2017!

Open Source

Scrapinghub has been as committed to open source as ever in 2016. Running Scrapinghub relies on a lot of open source software so we do our best to pay it forward by providing high quality and useful software to the world. Since nearly 40 open source projects maintained by Scrapinghub staff saw new releases this year, we’ll just give you the key highlights.

Scrapy

Scrapy is the most well-known project that we maintain and 2016 saw our first Python 3 compatible version (version 1.1) back in May (running it on windows is still a challenge, we know, but bear with us). We just released version 1.3.0 and Scrapy is now the 11th most starred Python project on GitHub! 2017 should see it get many new features to keep it the best tool you have (we think) to tackle any web scraping project. So keep sending it some GitHub star love and feature requests!

Splash

Splash, our headless browser with an HTTP interface, hit a major milestone a few weeks ago with the addition of the long-awaited web scraping helpers: CSS selectors, form filling, interacting with DOM nodes… This 2.3 release came after a steady series of improvements and a successful Google Summer of Code (GSoC) project this summer by our student Michael Manukyan .

Dateparser

Our “little” library to help with dates in natural language got a bit of attention on GitHub whenit was picked up as a dependency for Kenneth Reitz’ latest project, Maya . We’re quite proud of this little library :). Keep the bug reports coming if you find any, and if you can, please help us support yet more languages.

Frontera

Frontera is the framework we built to allow you to implement distributed crawlers in Python. It provides scaling primitives and crawl frontier capabilities .

2016 brought us 11 releases including support for Python 3! A huge thank you to Preetwinder Bath , one of our GSoC students, who helped us to improve test coverage and made sure that all of the parts of Frontera support Python 3.

Google Summer of Code

As in 2014 and 2015, Scrapinghub participated in GSoC 2016 under the Python Software Foundation umbrella. We had four students complete their projects and two of them got their contribution merged into the respective code base (see the “Frontera” and “Splash” sections above). Another completed project was related to Scrapy performance improvements and is still under discussion with the maintainers before we can integrate it. The last one is a standalone set of helpers to use Scrapy with other programming languages. To our students, Aron , Preet , Michael , and Avishkar , thank you all very much for your contributions!

Conferences

Conferences are always a great opportunity to learn new skills, showcase our projects, and, of course, hang out with our clients, users, and coworkers. As a remote staff, we don’t have the opportunity to meet each other in person often, so tech conferences are always a great way to strengthen ties. The traveling Scrapinghubbers thoroughly enjoyed sharing their knowledge and web scraping experiences through presentations, tutorials, and workshops.

Check

Looking Back at 2016

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本