Using Airflow to manage your DevOps ETLs

In this article we will be describing the use Apache’s Airflow project to manage ETL ( Extract, Transform, Load ) processes in a Business Intelligence Analytics environment. First we will describe the history of Airflow, some context around its uses, and why it is fast becoming an important tool in the DevOps pipeline for managing the Extraction, Transformation, and Loading of data from large scale data warehouses. Second, we will provide a practical guide for its integration into a DevOps environment. Finally we will provide some installation instructions and describe some pitfalls when deploying with Airflow.

Key Airflow concepts

Airflow was developed by engineers at AirBnB to provide a standardized way to handle multiple ETL processes around an Enterprise Data Warehouse system. As of the time of this article, it is undergoing incubation with the Apache Software project . Airflow is written in python but is language agnostic. It utilizes rabbitMQ, Jinja,

The idea behind Airflow is that the user will create DAGs or Directed Acyclic Graphs which are really just a visual representation of how each of the things that you are asking your ETL to do relate to each other. It also acts as a job scheduler and allows the developer or ops manager to check the status of several tasks utilizing the web interface. It is scalable to infinity, elegant, and dynamically capable of providing a layer of abstraction around multiple possible environments. Finally, it provides several built-in connectors, which means that your developers don’t need to spend time writing connection codes to various databases.

Use Case:

Let’s use a hypothetical to demonstrate: let’s say that you want to create an ETL that would:

Extract data from an hadoop cluster utilizing HIVE Transform the data Email it to a user

Breaking each of these steps into tasks the visual representation of your DAG might look something like this:

Using Airflow to manage your DevOps ETLs

From the developer perspective

So let’s go through each step from the developer perspective:

From the developer perspective their job is now to write these DAGs, which refer to each of these tasks in code. Utilizing Airflow to do this means that the developer avoids the need to write new connectors and the systems engineer gets an easy-to-follow standardized road map of what the code is doing in each step and what to follow if there are issues.

The first thing we want to do is create our DAG, which we will do by importing the DAG object from the Airflow library and entering some parameters:

from airflowimport DAG default_args = { 'owner': 'fpombeiro', 'depends_on_past': True, 'start_date': datetime(2017, 2, 1), 'email': ['fpombeiro@airflow.com'], 'email_on_failure': True, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(minutes=5), # 'queue': 'bash_queue', # 'pool': 'backfill', # 'priority_weight': 10, # 'end_date': datetime(2016, 1, 1), } dag = DAG(‘read_hive’, default_args=default_args)

My first job, as a developer, is to connect to the HIVE database. Fortunately I can find a pre-existing HIVEServer2Hook() in the standard Airflow library by looking up my “connections” in my web ui:

Utilizing the Airflow webserver UI (found at localhost:8080 locally) I can go in and add my own connection arguments with a point and click interface. Now that I have added these it’s simply a matter of:

from airflow.hooksimport HiveServer2Hook

…and my connection is right there, pre-written, and re-usable by any other developers who are also working on the project.

Now that the connection is good, let’s create an OPERATOR to call some code and do the work!

callHook = PythonOperator( task_id='get_data', python_callable=do_work, xcom_push=True dag=dag ) def do_work(): hiveserver = HiveServer2Hook() hql = "SELECT COUNT(*) FROM foo.bar" row_count = hiveserver.get_records(hql, schema='foo') return row_count[0][0]

So we have taken a “task”, here called “do_work” written in python and placed that task in a “Python Operator” (which basically means “use python to run this task”. There are all sorts of these operators, which allows for Airflow to be language agnostic).

For our second task, let’s write a simple python function that multiplies the data by 15:

def multiply_count(**context): value = context['get_data’].xcom_pull(task_ids='do_work') return value * 15 transform = PythonOperator( task_id=’transform_data’ python_callable=’multiply_count’ dag=dag

Okay, so we now know that we want to run task one (called ‘get_data’) and then run task two (‘transform data’).

One quick note:‘xcom’ is a method available in airflow to pass data in between two tasks. The function is simple to use: you “push” data from one task (xcom_push) and “pull” data from a second task (xcom_pull). The data is stored in a key->value store and is accessed via the task_id (which you can see above).

Finally let’s add in our last step, email out this number:

email=EmailOperator( task_id=alertTaskID, to='please@dontreply.com', subject='Airflow processing report', html_content=EMAIL_CONTENT, dag=dag ) EMAIL_CONTENT = """ <b>Copied:</b> {{ transform.xcom_pull(task_ids=html_output) }}<br> """ def html_output(): value=context[‘transform_data’].xcom_pull(task_ids='multiply_count') return “<h1>” + value “ </h1>”

What we are doing here is simply adding h1 html tags to our outputs for the email and creating a EMAIL_CONTENT constant and then creating an operator to email our data out (all of the SMTP details are contained in the airflow environment. Your developers will not need to re-create these each time.

Our final step is to create an orderly method to run these tasks by looking at basic dependencies. In the below examples, we are saying “run callHook, then run transform, and finally email out the results”

transform.set_upstream(callHook) email.set_upstream(transform)

This is a fairly straightforward example. Some of the things that have to go on “behind the scenes” include: setting up the connections, variables, and sub-dags.

Hopefully this has helped you see how useful Airflow can be in managing your ETL processes.

Using Airflow to manage your DevOps ETLs

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本