A Practical Guide to Modern Airflow



Image by Author

 

Airflow was created to resolve the complexity of managing multiple pipelines and workflows. Before the invention of Airflow, many organizations depended on cron jobs, custom scripts, and other inefficient means when faced with big data generated by millions of users frequently. These solutions became hard to maintain, inflexible, and lacked visibility due to the inability to visualize the status of running workflows, monitor failure points, and debug errors.

Apache Airflow, as it is popularly known today, was started by Maxime Beauchemin at Airbnb in October 2014 as Airflow. From the onset, it has been open-source, and in June 2015, it was officially announced to be under Airbnb GitHub. In March 2016, the project became part of the Apache Software Foundation incubation program and thereafter became known as Apache Airflow.

Here is the list of the project contributors.

Most data professionals (data engineers, machine learning engineers) and top companies, such as Airbnb and Netflix, use Apache Airflow daily. That is why you will learn how to install and use Apache Airflow in this article.

 

Prerequisites

 
A good working knowledge of the Python programming language is needed to fully utilize this article, as code snippets and the Airflow framework are written in Python. This article will familiarize you with the Apache Airflow platform and teach you how to install it and carry out simple tasks

 

What is Apache Airflow

 
The Apache Airflow official documentation defines Apache Airflow as “an open-source platform for developing, scheduling, and monitoring batch-oriented workflows”.

The platform’s Python framework allows users to build workflows that connect with virtually all technologies. Airflow is deployable and can be deployed as a single unit on your laptop or on a distributed system to support workflows as large as you can imagine.

At the core of Airflow design is its “programmatic nature”; it ensures that workflows are represented as Python code.

 

Key Components in Apache Airflow

 

1. DAG

DAG (or Directed Acyclic Graph) is the collection of the several tasks you intend to run, arranged in a way that shows their relationships and dependencies. It represents a workflow graph structure where the task to be executed is a node, and the edges are the dependencies between tasks.

Directed” ensures that tasks are executed in a certain order, and “Acyclic” prevents cellular dependencies, preventing tasks from repeating all over again. DAGs are written as Python scripts and placed in Airflow’s DAG_FOLDER.

 

2. Tasks

These are the individual activities or units of work performed in DAG. Examples include running an SQL query, reading from a database, and so on.

 

3. Operators

Operators are used as building blocks to create specific tasks within a DAG. Every operator states the type of task to be executed; this could be sending an email, executing a bash script, and so on. While a DAG is useful for ordering tasks, an operator is useful for initiating the specific task to be executed. Some of the common operators used in Airflow are BashOperator (for executing bash commands), EmailOperator (for sending emails), and PythonOperator (for calling an arbitrary Python function).

 

4. Scheduling

Scheduling in Airflow is achieved with a scheduler. It monitors all available tasks and DAGs and triggers the task instances when the dependencies (prior tasks to be completed) are met. So, the scheduler stays working behind the scenes by inspecting active tasks to determine whether they can be triggered.

 

5. XComs

XComs is an abbreviation for “cross-communication.” It enables communication between tasks. It contains the key, value, and timestamp, and, most likely, the task/DAG that created the XCom.

 

6. Hooks

A hook can be thought of as an abstraction layer or interface to external platforms or resource locations. It enables tasks to connect to these platforms easily without having to go through the rigors of authentication and what would have been a complicated communication process.

 

7. Web UI

The Web UI gives a pleasing interface for visually monitoring and troubleshooting data pipelines. See the image below:
 

A Practical Guide to Modern Airflow
Photo from Apache Airflow Documentation

 

 

A Guide on How to Run Apache Airflow on Your Machine

 
Setting up Apache Airflow on your machine typically entails establishing the Airflow environment, initializing the database, and starting the Airflow webserver and environment.

Follow the steps below:

Step1: Set up a Python virtual environment for the project

python3 -m venv airflow_tutorial

 

Step 2: Activate the created virtual environment

On Mac/Linux

source airflow_tutorial/bin/activate

 

On Windows

airflow_tutorial\Scripts\activate

 

Step 3: Install Apache Airflow
Run the following code in your terminal inside your activated virtual environment.

pip install apache-airflow

 

Step 4: Set up the Airflow directory and configure the database
Initialize the Airflow database

 

This generates the necessary tables and configurations in the ~/airflow directory by default.

Step 5: Create Airflow user
Creating an admin user enables you to access the Airflow web interface. On your terminal run:

airflow users create \
    --username admin \
    --firstname FirstName \
    --lastname LastName \
    --role Admin \
    --email admin@example.com

 

After running this bash script on your terminal, you will be prompted to enter your admin password of choice.

Step 6: Start the Airflow webserver
Starting the webserver grants you access to the Airflow UI. Run this code on your terminal:

airflow webserver --port 8080

 

Open the URL showing in your console and log in with the credentials you created in step 5.

Step 7: Start the Airflow Scheduler
The scheduler handles task execution. Open a new terminal window and activate the same virtual environment as we did in step 2. Then start the scheduler by running this bash script on your terminal:

 

Step 8: Create and run a DAG of choice
Remember, from step 3, we created our airflow directory, which typically would live in our root folder. Create a dags folder inside the airflow directory and place your DAG files there. Example ~/airflow/dags/dags_tutorial.py

In your dags_tutorial.py file, write the following code:

from datetime import datetime

from airflow import DAG
from airflow.decorators import task
from airflow.operators.bash import BashOperator

# A DAG represents a workflow, a collection of tasks
with DAG(dag_id="demo", start_date=datetime(2025, 1, 5), schedule="0 0 * * *") as dag:
    # Tasks are represented as operators
    hello = BashOperator(task_id="hello", bash_command="echo hello")

    @task()
    def airflow():
        print("airflow")

    # Set dependencies between tasks
    hello >> airflow()

 

Shortly after running this code, the available DAGs will automatically appear on the web UI, as shown below.

 

A Practical Guide to Modern Airflow
Image by Author

 

Conclusion

 
Apache Airflow is an amazing open-source platform that efficiently simplifies the handling of multiple workflows and pipelines. It provides a programmatic feel and a UI for monitoring and troubleshooting tasks.

In this article, we have learned about this awesome technology and used it to create a simple DAG. I recommend incorporating Airflow into your routine to quickly become familiar with the technology. Thanks for reading.
 
 

Shittu Olumide is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Twitter.



Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here