Dask EC2 Startup Script

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

A screencast version of this post is available here: https://youtu.be/KGlhU9kSfVk

Summary

Copy-pasting the following commands gives you a Dask cluster on EC2.

pip install dec2 dec2 up --keyname YOUR-AWS-KEY-NAME --keypair ~/.ssh/YOUR-AWS-KEY-FILE.pem --count 9 # Provision nine nodes --nprocs 8 # Use eight separate worker processes per node dec2 ssh # SSH into head node ipython # Start IPython console on head node

from distributed import Executor, s3, progress e = Executor('127.0.0.1:8786') df = s3.read_csv('dask-data/nyc-taxi/2015', lazy=False) progress(df) df.head()

You will have to use your own AWS credentials, but you’ll get fast distributed Pandas access on the NYCTaxi data across a cluster, loaded from S3.

Motivation

Reducing barriers to entry enables curious play.

Curiosity drives us to play with new tools. We love the idea that previously difficult tasks will suddenly become easy, expanding our abilities and opening up a range of newly solvable problems.

However, as our problems grow more complex our tools grow more cumbersome and setup costs increase. This cost stops us from playing around, which is a shame, because playing is good both for the education of the user and for the development of the tool. Tool makers who want feedback are strongly incentiveized to decrease setup costs, especially for the play case.

In February we introduced dask.distributed, a lightweight distributed computing framework for Python. We focused on processing data with high level abstractions like dataframes and arrays in the following blogposts:

Analyze GitHub JSON record data in S3 Use Dask DataFrames on CSV data in HDFS Process NetCDF data with Dask arrays on a traditional cluster

Today we present a simple setup script to launch dask.distributed on EC2, enabling any user with AWS credentials to repeat these experiments easily.

dec2

Devops tooling and EC2 to the rescue

DEC2 does the following:

Provisions nodes on EC2 using your AWS credentials Installs Anaconda on those nodes Deploys a dask.distributed Scheduler on the head node and Worker s on the rest of the nodes Helps you to SSH into the head node or connect from your local machine $ pip install dec2 $ dec2 up --help Usage: dec2 up [OPTIONS] Options: --keyname TEXT Keyname on EC2 console [required] --keypair PATH Path to the keypair that matches the keyname [required] --name TEXT Tag name on EC2 --region-name TEXT AWS region [default: us-east-1] --ami TEXT EC2 AMI [default: ami-d05e75b8] --username TEXT User to SSH to the AMI [default: ubuntu] --type TEXT EC2 Instance Type [default: m3.2xlarge] --count INTEGER Number of nodes [default: 4] --security-group TEXT Security Group Name [default: dec2-default] --volume-type TEXT Root volume type [default: gp2] --volume-size INTEGER Root volume size (GB) [default: 500] --file PATH File to save the metadata [default: cluster.yaml] --provision / --no-provision Provision salt on the nodes [default: True] --dask / --no-dask Install Dask.Distributed in the cluster [default: True] --nprocs INTEGER Number of processes per worker [default: 1] -h, --help Show this message and exit.

Note: dec2 was largely built by Daniel Rodriguez

Run

As an example we use dec2 to create a new cluster of nine nodes. Each worker will run with eight processes, rather than using threads.

dec2 up --keyname my-key-name --keypair ~/.ssh/my-key-file.pem --count 9 # Provision nine nodes --nprocs 8 # Use eight separate worker processes per node

Connect

We ssh into the head node and start playing in an IPython terminal:

localmachine:~$ dec2 ssh # SSH into head node ec2-machine:~$ ipython # Start IPython console

In [1]: from distributed import Executor, s3, progress In [2]: e = Executor('127.0.0.1:8786') In [3]: e Out[3]: <Executor: scheduler=127.0.0.1:8786 workers=64 threads=64> Notebooks

Alternatively we set up a globally visible Jupyter notebook server:

localmachine:~$ dec2 dask-distributed address # Get location of head node Scheduler Address: XXX:XXX:XXX:XXX:8786 localmachine:~$ dec2 ssh # SSH into head node ec2-machine:~$ jupyter notebook --ip="*" # Start Jupyter Server

Then navigate to http://XXX:XXX:XXX:XXX:8888 from your local browser. Note, this method is not secure, see Jupyter docs for a better solution.

Public datasets

We repeat the experiments from ourearlier blogpost on NYCTaxi data.

df = s3.read_csv('dask-data/nyc-taxi/2015', lazy=False) progress(df) df.head()

VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount 0 2 2015-01-15 19:05:39 2015-01-15 19:23:42 1 1.59 -73.993896 40.750111 1 N -73.974785 40.750618 1 12.0 1.0 0.5 3.25 0 0.3 17.05 1 1 2015-01-10 20:33:38 2015-01-10 20:53:28 1 3.30 -74.001648 40.724243 1 N -73.994415 40.759109 1 14.5 0.5 0.5 2.00 0 0.3 17.80 2 1 2015-01-10 20:33:38 2015-01-10 20:43:41 1 1.80 -73.963341 40.802788 1 N -73.951820 40.824413 2 9.5 0.5 0.5 0.00 0 0.3 10.80 3 1 2015-01-10 20:33:39 2015-01-10 20:35:31 1 0.50 -74.009087 40.713818 1 N -74.004326 40.719986 2 3.5 0.5 0.5 0.00 0 0.3 4.80 4 1 2015-01-10 20:33:39 2015-01-10 20:52:58 1 3.00 -73.971176 40.762428 1 N -74.004181 40.742653 2 15.0 0.5 0.5 0.00 0 0.3 16.30 Acknowledgments

The dec2 startup script is largely the work of Daniel Rodriguez. Daniel usually works on Anaconda Cluster , a proprietary product for cluster management that does things similar to dec2 , but much more maturely.

DEC2 was inspired by the excellent spark-ec2 setup script, which is how most Spark users, myself included, were first able to try out the library. The spark-ec2 script empowered many new users to try out a distributed system for the first time.

The S3 work was largely done by Hussain Sultan (Capital One) and Martin Durant (Continuum).

What didn’t work Originally we automatically started an IPython console on the local machine and connected it to the remote scheduler. This felt slick, but was error prone due to mismatches between the user’s environment and the remote cluster’s environment. It’s tricky to replicate functionality that’s present in a proprietary and profitable product, Anaconda Cluster. Fortunately, Continuum management has been quite supportive. There aren’t many people in the data science community who know Salt, the system that backs dec2 . I expect maintenance to be a bit tricky moving forward, especially during periods when Daniel and other developers are occupied.

Dask EC2 Startup Script

Trending Articles

泰语每日一词：เมา“醉”（Day 726）

出售: Sound Master TA-377 FET preamp

漫谈赵婷、李安、泰伦斯·马利克和摄影机的“上帝位置”

NZXT CAM 3.0.3 中文版 - 電腦溫度監控軟體支援手機遠端監控

[風車字幕組][名偵探柯南][1020][無處可藏的古董盤（後篇）][1080P][繁體][MP4]

【白恋字幕组】[战地启示录][The Cockpit][BDRIP][GB][720P][MP4]

詐騙猖獗網路名師也中鏢江兆君(小M老師)：學員勿上當！

onActivityCallback的params.result返回值没有生效

[公告] 無法登入水族箱和解決當機的方式

令Gaussian 16中SCF未收敛到默认收敛限也能继续做后续计算的方法

fabia combi 原廠音響加裝後車門音響喇叭的經驗分享

出售: PERREAUX 750 POWER AMP

HAKERS哈克士戶外 12月8~14日廠拍

琥珀金開箱

Mac 26 Tahoe 有些软件是不是内存泄露了，运存占用异常高？？？

地方扫描－涉贿选员林镇代江世钟一审当选无效

唐澤壽明渡部篤郎首次合作主演日劇《滅癌陷阱》

人气声优井上麻里奈裸背写真集「Marilro」美图欣赏

集法荷12大馆珍藏故宫特展探索大航海时代

54歲歌王洪榮宏梅開三度！相識6個月甜娶「小鄧麗君」張瀞云