Data newsletter

Good Morning Data #8 ☕

6 min readOct 21, 2022

Today’s Good Morning Data will focus on Apache Airflow, its use cases, use at scale and growing popularity.

Grab a cup of coffee or any other favourite beverage and start the day with some noteworthy material from the world of data. Enjoy reading 📚

[Benji Lampel, Tal Gluck] Advanced Data Quality Use Cases with Airflow and Great Expectations

What I really like about Airflow is the fact that it is an open-source project and a perfect example of how such initiatives can become a powerful tool commonly used within companies.

Within the scope of this video, an interesting topic was raised concerning the open library for data quality — Great Expectations. Thanks to it, it is possible to literally define data expectations in the form of rules. In effect, this allows you to test the data and then take steps on their basis.

[Hilmi Yildirim, Stefan Haase] Accelerate testing in Apache Airflow through DAG versioning

Zalando’s Performance Marketing department tackled a problem with their marketing ROI (return on investment) pipeline test setup.

They have two active environments: test and production. As with the Marketing ROI pipeline, each environment can only have one version of the Airflow DAG. When developing multiple features simultaneously, it is necessary to share a test environment. Conflicts often arise due to the inability to test in isolation. Alternatively, functions can be tested in sequence, but this introduces latency. To solve this problem, they create an implementation of having a flexible number of airflow environments.

They have identified two areas:

pipeline area — all the DAGS required to meet the ROI requirements of Marketing ROI. What is also worth mentioning is the ability to have multiple test environments on the server.
data area— Spark/Hive database collection.

To enable the testing of several DAGS relating to the same pipeline simultaneously, a new Airflow environment is created for the pipeline so that testing can take place in isolation.

The process is as follows:

when a pull request is opened, a new environment is created on the existing test server, which causes multiple environments to exist on the same Airflow server. It takes less than a minute to create a new environment.
environments are automatically deleted when the corresponding pull request is closed.

But here, the question may arise how is this implemented if there is no environment configuration in Airflow?

This is achieved through:

deploying Airflow code as a zip file,
use correct Jinja Paths,
renaming DAG Ids.

Zalando Engineering Blog - Accelerate testing in Apache Airflow through DAG versioning

In the Performance Marketing department, we run paid advertisement campaigns for Zalando. To do so, we build services…

engineering.zalando.com

[Apache Airflow, John Jackson] Managing Apache Airflow at Scale

The first question to ask yourself is what apache airflow at scale is.

It boils down to the following:

running multiple tasks simultaneously both within a single DAG or multiple DAGs, but also multiple DAGs in parallel,
offloading using containers,
complex relations between tasks add the level of complexity because then Airflow must work out what the appropriate sequence of execution integrated with the parallelism of some tasks should be performed,
management of multi-level DAG access control in a multi-access environment,
logging, monitoring, alerting,
splitting workloads between environments,
distributing DAGs between environments.

Secondly, one of the most important steps is understanding what aspects affect the ability to scale.

Mainly, it needs to be remembered:

DAGs are parsed continuously, whether are active or not,
the scheduler analyzes DAGs objects to know which one should be queued next so that the scheduler can be very busy in some cases,
think about the capabilities of the Airflow environment in the context of performing operations.

There are also essential considerations for configuration options, as they have a definite impact on the performance of the entire Airflow setup. Depending on the use case for Airflow, its configuration is different.
Examples:

some environments will require scanning the DAG folder every 5 minutes, while others do not because new files are added less/more frequently,
the same applies to scanning DAG files and searching for changes in them,
the number of processes scanning for changes can vary,
timeouts need to be taken into account.

Scaling workloads is mainly how to squeeze as much as possible out of Airflow. It can be achieved using the following:

the ability to offload work from Airflow to containers,
dedicated services for ETL because it should be remembered that Airflow is a sophisticated orchestrator, but using it for the implementation of ETL processes in itself is not recommended,
the use of pools to limit the parallelism of tasks,
the use of priority weights to determine the priority of the task in the executor’s queue,
the configuration of the number of tasks executed simultaneously in the DAG and the tasks that can be run at the level of the whole system,
the use of the deferrable operator when the worker knows it must wait and then releases its resources.

Scaling DAGs is, for instance, about the concept of dynamic DAGs and DAG Factories. One aspect to understand at the outset is that a python file in a DAG folder is not a DAG itself. A DAG is a python object generated based on that file. So, based on one file, you can generate many DAG objects or not generate them at all, and within this file, gain access to the library based on which DAG objects are generated.

There were also some critical issues concerning scaling between different environments. However, having multiple environments for multiple teams running DAGs at different times, there is a risk that productivity will decrease due to the complexity of the current environments.
Therefore, it is necessary to consider working efficiently with multiple environments. To this end, it is necessary to take care of several aspects concerning:

creation — speed up and organize the processes of creating new environments (Terraform, Kubernetes, Docker Compose etc.).
logging — allows you to dump all information about processes in one central place, where you can query this data for analysis and conclusions about the working environment (Datadog, S3, Prometheus, etc.)
monitoring — you can supervise many key metrics in one place with the use of dashboards and determine what is happening in terms of DAGs, tasks running on multiple environments (Grafana, CloudWatch),
alerting — if based on previously monitored logs and achieved metrics, you can determine what levels/values indicate malfunctioning, then you can start informing about these things (Callbacks, Prometheus).

https://www.youtube.com/watch?v=JsV04lsH8_U

[Shopify, Megan Parker] Lessons Learned From Running Apache Airflow at Scale

The use of Airflow in Shopify has changed dramatically over the past two years. They use as many as 10'000 DAGs in their most extensive environment.

As a result, they may have seen some challenges on the way in scaling Airflow usage, such as:

fast access to python files defining DAGs,
the increasing amount of metadata can cause a heavy load on the database, especially during Web UI loading times and even more during Airflow updates, during which migrations can take hours,
the ability to navigate the DAGs’ ownerships to teams and users,
checking the DAG against the policy from the file manifest,
ensuring an even load with multiple DAGs is very difficult,
using airflow mechanisms to manage resource contention.

Lessons Learned From Running Apache Airflow at Scale

By Megan Parker and Sam WheatingApache Airflow is an orchestration platform that enables development, scheduling and…

shopifyengineering.myshopify.com

[Yifei Sun] Secure Apache Airflow Using Customer Security Manager

Square uses Apache Airflow to manage ETL jobs in a multi-tenancy environment. Due to this, different teams may share the same airflow cluster running their DAGs.

The company mainly wants to have an auth strategy to create a distinction between users of the system, control permissions at the DAGs level and allows users to log in automatically to the Airflow console as they are verified by proxy earlier.

Based on these requirements, Square has demonstrated a way to exaggerate the customer security manager to secure the airflow at its site.

Secure Apache Airflow Using Customer Security Manager

Apache Airflow is an open source orchestration tool to programmatically author, schedule and monitor workflows. Square…

developer.squareup.com

[John Thomas, Ewa Tatarczak] Airflow Survey 2022

The survey is always an excellent way to get information among those using the tool. As a result, various statistics on usage, deployments, community and contribution and the future of Airflow were presented concerning previous years. It seems that airflow is experiencing solid growth in popularity in the market, which is justified.

Airflow Survey 2022

This year's survey has come and gone, and with it we've got a new batch of data for everyone! We collected 210…

airflow.apache.org

Dev Quote ✍️

Code is like humor. When you have to explain it, it’s bad. — Cory House