Fundamental Concepts

This tutorial walks you through some of the fundamental Airflow concepts, objects, and their usage while writing your first DAG.

Example Pipeline definition

Here is an example of a basic pipeline definition. Do not worry if this looks complicated, a line by line explanation follows below.

airflow/example_dags/tutorial.py[source]

  1. from datetime import datetime, timedelta
  2. from textwrap import dedent
  3. # The DAG object; we'll need this to instantiate a DAG
  4. from airflow import DAG
  5. # Operators; we need this to operate!
  6. from airflow.operators.bash import BashOperator
  7. with DAG(
  8. "tutorial",
  9. # These args will get passed on to each operator
  10. # You can override them on a per-task basis during operator initialization
  11. default_args={
  12. "depends_on_past": False,
  13. "email": ["airflow@example.com"],
  14. "email_on_failure": False,
  15. "email_on_retry": False,
  16. "retries": 1,
  17. "retry_delay": timedelta(minutes=5),
  18. # 'queue': 'bash_queue',
  19. # 'pool': 'backfill',
  20. # 'priority_weight': 10,
  21. # 'end_date': datetime(2016, 1, 1),
  22. # 'wait_for_downstream': False,
  23. # 'sla': timedelta(hours=2),
  24. # 'execution_timeout': timedelta(seconds=300),
  25. # 'on_failure_callback': some_function, # or list of functions
  26. # 'on_success_callback': some_other_function, # or list of functions
  27. # 'on_retry_callback': another_function, # or list of functions
  28. # 'sla_miss_callback': yet_another_function, # or list of functions
  29. # 'trigger_rule': 'all_success'
  30. },
  31. description="A simple tutorial DAG",
  32. schedule=timedelta(days=1),
  33. start_date=datetime(2021, 1, 1),
  34. catchup=False,
  35. tags=["example"],
  36. ) as dag:
  37. # t1, t2 and t3 are examples of tasks created by instantiating operators
  38. t1 = BashOperator(
  39. task_id="print_date",
  40. bash_command="date",
  41. )
  42. t2 = BashOperator(
  43. task_id="sleep",
  44. depends_on_past=False,
  45. bash_command="sleep 5",
  46. retries=3,
  47. )
  48. t1.doc_md = dedent(
  49. """\
  50. #### Task Documentation
  51. You can document your task using the attributes `doc_md` (markdown),
  52. `doc` (plain text), `doc_rst`, `doc_json`, `doc_yaml` which gets
  53. rendered in the UI's Task Instance Details page.
  54. ![img](http://montcs.bloomu.edu/~bobmon/Semesters/2012-01/491/import%20soul.png)
  55. **Image Credit:** Randall Munroe, [XKCD](https://xkcd.com/license.html)
  56. """
  57. )
  58. dag.doc_md = __doc__ # providing that you have a docstring at the beginning of the DAG; OR
  59. dag.doc_md = """
  60. This is a documentation placed anywhere
  61. """ # otherwise, type it like this
  62. templated_command = dedent(
  63. """
  64. {% for i in range(5) %}
  65. echo "{{ ds }}"
  66. echo "{{ macros.ds_add(ds, 7)}}"
  67. {% endfor %}
  68. """
  69. )
  70. t3 = BashOperator(
  71. task_id="templated",
  72. depends_on_past=False,
  73. bash_command=templated_command,
  74. )
  75. t1 >> [t2, t3]

It’s a DAG definition file

One thing to wrap your head around (it may not be very intuitive for everyone at first) is that this Airflow Python script is really just a configuration file specifying the DAG’s structure as code. The actual tasks defined here will run in a different context from the context of this script. Different tasks run on different workers at different points in time, which means that this script cannot be used to cross communicate between tasks. Note that for this purpose we have a more advanced feature called XComs.

People sometimes think of the DAG definition file as a place where they can do some actual data processing - that is not the case at all! The script’s purpose is to define a DAG object. It needs to evaluate quickly (seconds, not minutes) since the scheduler will execute it periodically to reflect the changes if any.

Importing Modules

An Airflow pipeline is just a Python script that happens to define an Airflow DAG object. Let’s start by importing the libraries we will need.

airflow/example_dags/tutorial.py[source]

  1. from datetime import datetime, timedelta
  2. from textwrap import dedent
  3. # The DAG object; we'll need this to instantiate a DAG
  4. from airflow import DAG
  5. # Operators; we need this to operate!
  6. from airflow.operators.bash import BashOperator

See Modules Management for details on how Python and Airflow manage modules.

Default Arguments

We’re about to create a DAG and some tasks, and we have the choice to explicitly pass a set of arguments to each task’s constructor (which would become redundant), or (better!) we can define a dictionary of default parameters that we can use when creating tasks.

airflow/example_dags/tutorial.py[source]

  1. # These args will get passed on to each operator
  2. # You can override them on a per-task basis during operator initialization
  3. default_args={
  4. "depends_on_past": False,
  5. "email": ["airflow@example.com"],
  6. "email_on_failure": False,
  7. "email_on_retry": False,
  8. "retries": 1,
  9. "retry_delay": timedelta(minutes=5),
  10. # 'queue': 'bash_queue',
  11. # 'pool': 'backfill',
  12. # 'priority_weight': 10,
  13. # 'end_date': datetime(2016, 1, 1),
  14. # 'wait_for_downstream': False,
  15. # 'sla': timedelta(hours=2),
  16. # 'execution_timeout': timedelta(seconds=300),
  17. # 'on_failure_callback': some_function, # or list of functions
  18. # 'on_success_callback': some_other_function, # or list of functions
  19. # 'on_retry_callback': another_function, # or list of functions
  20. # 'sla_miss_callback': yet_another_function, # or list of functions
  21. # 'trigger_rule': 'all_success'
  22. },

For more information about the BaseOperator’s parameters and what they do, refer to the airflow.models.BaseOperator documentation.

Also, note that you could easily define different sets of arguments that would serve different purposes. An example of that would be to have different settings between a production and development environment.

Instantiate a DAG

We’ll need a DAG object to nest our tasks into. Here we pass a string that defines the dag_id, which serves as a unique identifier for your DAG. We also pass the default argument dictionary that we just defined and define a schedule of 1 day for the DAG.

airflow/example_dags/tutorial.py[source]

  1. with DAG(
  2. "tutorial",
  3. # These args will get passed on to each operator
  4. # You can override them on a per-task basis during operator initialization
  5. default_args={
  6. "depends_on_past": False,
  7. "email": ["airflow@example.com"],
  8. "email_on_failure": False,
  9. "email_on_retry": False,
  10. "retries": 1,
  11. "retry_delay": timedelta(minutes=5),
  12. # 'queue': 'bash_queue',
  13. # 'pool': 'backfill',
  14. # 'priority_weight': 10,
  15. # 'end_date': datetime(2016, 1, 1),
  16. # 'wait_for_downstream': False,
  17. # 'sla': timedelta(hours=2),
  18. # 'execution_timeout': timedelta(seconds=300),
  19. # 'on_failure_callback': some_function, # or list of functions
  20. # 'on_success_callback': some_other_function, # or list of functions
  21. # 'on_retry_callback': another_function, # or list of functions
  22. # 'sla_miss_callback': yet_another_function, # or list of functions
  23. # 'trigger_rule': 'all_success'
  24. },
  25. description="A simple tutorial DAG",
  26. schedule=timedelta(days=1),
  27. start_date=datetime(2021, 1, 1),
  28. catchup=False,
  29. tags=["example"],
  30. ) as dag:

Operators

An operator defines a unit of work for Airflow to complete. Using operators is the classic approach to defining work in Airflow. For some use cases, it’s better to use the TaskFlow API to define work in a Pythonic context as described in Working with TaskFlow. For now, using operators helps to visualize task dependencies in our DAG code.

All operators inherit from the BaseOperator, which includes all of the required arguments for running work in Airflow. From here, each operator includes unique arguments for the type of work it’s completing. Some of the most popular operators are the PythonOperator, the BashOperator, and the KubernetesPodOperator.

Airflow completes work based on the arguments you pass to your operators. In this tutorial, we use the BashOperator to run a few bash scripts.

Tasks

To use an operator in a DAG, you have to instantiate it as a task. Tasks determine how to execute your operator’s work within the context of a DAG.

In the following example, we instantiate the BashOperator as two separate tasks in order to run two separate bash scripts. The first argument for each instantiation, task_id, acts as a unique identifier for the task.

airflow/example_dags/tutorial.py[source]

  1. t1 = BashOperator(
  2. task_id="print_date",
  3. bash_command="date",
  4. )
  5. t2 = BashOperator(
  6. task_id="sleep",
  7. depends_on_past=False,
  8. bash_command="sleep 5",
  9. retries=3,
  10. )

Notice how we pass a mix of operator specific arguments (bash_command) and an argument common to all operators (retries) inherited from BaseOperator to the operator’s constructor. This is simpler than passing every argument for every constructor call. Also, notice that in the second task we override the retries parameter with 3.

The precedence rules for a task are as follows:

  1. Explicitly passed arguments

  2. Values that exist in the default_args dictionary

  3. The operator’s default value, if one exists

A task must include or inherit the arguments task_id and owner, otherwise Airflow will raise an exception.

Templating with Jinja

Airflow leverages the power of Jinja Templating and provides the pipeline author with a set of built-in parameters and macros. Airflow also provides hooks for the pipeline author to define their own parameters, macros and templates.

This tutorial barely scratches the surface of what you can do with templating in Airflow, but the goal of this section is to let you know this feature exists, get you familiar with double curly brackets, and point to the most common template variable: {{ ds }} (today’s “date stamp”).

airflow/example_dags/tutorial.py[source]

  1. templated_command = dedent(
  2. """
  3. {% for i in range(5) %}
  4. echo "{{ ds }}"
  5. echo "{{ macros.ds_add(ds, 7)}}"
  6. {% endfor %}
  7. """
  8. )
  9. t3 = BashOperator(
  10. task_id="templated",
  11. depends_on_past=False,
  12. bash_command=templated_command,
  13. )

Notice that the templated_command contains code logic in {% %} blocks, references parameters like {{ ds }}, and calls a function as in {{ macros.ds_add(ds, 7)}}.

Files can also be passed to the bash_command argument, like bash_command='templated_command.sh', where the file location is relative to the directory containing the pipeline file (tutorial.py in this case). This may be desirable for many reasons, like separating your script’s logic and pipeline code, allowing for proper code highlighting in files composed in different languages, and general flexibility in structuring pipelines. It is also possible to define your template_searchpath as pointing to any folder locations in the DAG constructor call.

Using that same DAG constructor call, it is possible to define user_defined_macros which allow you to specify your own variables. For example, passing dict(foo='bar') to this argument allows you to use {{ foo }} in your templates. Moreover, specifying user_defined_filters allows you to register your own filters. For example, passing dict(hello=lambda name: 'Hello %s' % name) to this argument allows you to use {{ 'world' | hello }} in your templates. For more information regarding custom filters have a look at the Jinja Documentation.

For more information on the variables and macros that can be referenced in templates, make sure to read through the Templates reference.

Adding DAG and Tasks documentation

We can add documentation for DAG or each single task. DAG documentation only supports markdown so far, while task documentation supports plain text, markdown, reStructuredText, json, and yaml. The DAG documentation can be written as a doc string at the beginning of the DAG file (recommended), or anywhere else in the file. Below you can find some examples on how to implement task and DAG docs, as well as screenshots:

airflow/example_dags/tutorial.py[source]

  1. t1.doc_md = dedent(
  2. """\
  3. #### Task Documentation
  4. You can document your task using the attributes `doc_md` (markdown),
  5. `doc` (plain text), `doc_rst`, `doc_json`, `doc_yaml` which gets
  6. rendered in the UI's Task Instance Details page.
  7. ![img](http://montcs.bloomu.edu/~bobmon/Semesters/2012-01/491/import%20soul.png)
  8. **Image Credit:** Randall Munroe, [XKCD](https://xkcd.com/license.html)
  9. """
  10. )
  11. dag.doc_md = __doc__ # providing that you have a docstring at the beginning of the DAG; OR
  12. dag.doc_md = """
  13. This is a documentation placed anywhere
  14. """ # otherwise, type it like this

../_images/task_doc.png ../_images/dag_doc.png

Setting up Dependencies

We have tasks t1, t2 and t3 that do not depend on each other. Here’s a few ways you can define dependencies between them:

  1. t1.set_downstream(t2)
  2. # This means that t2 will depend on t1
  3. # running successfully to run.
  4. # It is equivalent to:
  5. t2.set_upstream(t1)
  6. # The bit shift operator can also be
  7. # used to chain operations:
  8. t1 >> t2
  9. # And the upstream dependency with the
  10. # bit shift operator:
  11. t2 << t1
  12. # Chaining multiple dependencies becomes
  13. # concise with the bit shift operator:
  14. t1 >> t2 >> t3
  15. # A list of tasks can also be set as
  16. # dependencies. These operations
  17. # all have the same effect:
  18. t1.set_downstream([t2, t3])
  19. t1 >> [t2, t3]
  20. [t2, t3] << t1

Note that when executing your script, Airflow will raise exceptions when it finds cycles in your DAG or when a dependency is referenced more than once.

Using time zones

Creating a time zone aware DAG is quite simple. Just make sure to supply a time zone aware dates using pendulum. Don’t try to use standard library timezone as they are known to have limitations and we deliberately disallow using them in DAGs.

Recap

Alright, so we have a pretty basic DAG. At this point your code should look something like this:

airflow/example_dags/tutorial.py[source]

  1. from datetime import datetime, timedelta
  2. from textwrap import dedent
  3. # The DAG object; we'll need this to instantiate a DAG
  4. from airflow import DAG
  5. # Operators; we need this to operate!
  6. from airflow.operators.bash import BashOperator
  7. with DAG(
  8. "tutorial",
  9. # These args will get passed on to each operator
  10. # You can override them on a per-task basis during operator initialization
  11. default_args={
  12. "depends_on_past": False,
  13. "email": ["airflow@example.com"],
  14. "email_on_failure": False,
  15. "email_on_retry": False,
  16. "retries": 1,
  17. "retry_delay": timedelta(minutes=5),
  18. # 'queue': 'bash_queue',
  19. # 'pool': 'backfill',
  20. # 'priority_weight': 10,
  21. # 'end_date': datetime(2016, 1, 1),
  22. # 'wait_for_downstream': False,
  23. # 'sla': timedelta(hours=2),
  24. # 'execution_timeout': timedelta(seconds=300),
  25. # 'on_failure_callback': some_function, # or list of functions
  26. # 'on_success_callback': some_other_function, # or list of functions
  27. # 'on_retry_callback': another_function, # or list of functions
  28. # 'sla_miss_callback': yet_another_function, # or list of functions
  29. # 'trigger_rule': 'all_success'
  30. },
  31. description="A simple tutorial DAG",
  32. schedule=timedelta(days=1),
  33. start_date=datetime(2021, 1, 1),
  34. catchup=False,
  35. tags=["example"],
  36. ) as dag:
  37. # t1, t2 and t3 are examples of tasks created by instantiating operators
  38. t1 = BashOperator(
  39. task_id="print_date",
  40. bash_command="date",
  41. )
  42. t2 = BashOperator(
  43. task_id="sleep",
  44. depends_on_past=False,
  45. bash_command="sleep 5",
  46. retries=3,
  47. )
  48. t1.doc_md = dedent(
  49. """\
  50. #### Task Documentation
  51. You can document your task using the attributes `doc_md` (markdown),
  52. `doc` (plain text), `doc_rst`, `doc_json`, `doc_yaml` which gets
  53. rendered in the UI's Task Instance Details page.
  54. ![img](http://montcs.bloomu.edu/~bobmon/Semesters/2012-01/491/import%20soul.png)
  55. **Image Credit:** Randall Munroe, [XKCD](https://xkcd.com/license.html)
  56. """
  57. )
  58. dag.doc_md = __doc__ # providing that you have a docstring at the beginning of the DAG; OR
  59. dag.doc_md = """
  60. This is a documentation placed anywhere
  61. """ # otherwise, type it like this
  62. templated_command = dedent(
  63. """
  64. {% for i in range(5) %}
  65. echo "{{ ds }}"
  66. echo "{{ macros.ds_add(ds, 7)}}"
  67. {% endfor %}
  68. """
  69. )
  70. t3 = BashOperator(
  71. task_id="templated",
  72. depends_on_past=False,
  73. bash_command=templated_command,
  74. )
  75. t1 >> [t2, t3]

Testing

Running the Script

Time to run some tests. First, let’s make sure the pipeline is parsed successfully.

Let’s assume we are saving the code from the previous step in tutorial.py in the DAGs folder referenced in your airflow.cfg. The default location for your DAGs is ~/airflow/dags.

  1. python ~/airflow/dags/tutorial.py

If the script does not raise an exception it means that you have not done anything horribly wrong, and that your Airflow environment is somewhat sound.

Command Line Metadata Validation

Let’s run a few commands to validate this script further.

  1. # initialize the database tables
  2. airflow db init
  3. # print the list of active DAGs
  4. airflow dags list
  5. # prints the list of tasks in the "tutorial" DAG
  6. airflow tasks list tutorial
  7. # prints the hierarchy of tasks in the "tutorial" DAG
  8. airflow tasks list tutorial --tree

Testing

Let’s test by running the actual task instances for a specific date. The date specified in this context is called the logical date (also called execution date for historical reasons), which simulates the scheduler running your task or DAG for a specific date and time, even though it physically will run now (or as soon as its dependencies are met).

We said the scheduler runs your task for a specific date and time, not at. This is because each run of a DAG conceptually represents not a specific date and time, but an interval between two times, called a data interval. A DAG run’s logical date is the start of its data interval.

  1. # command layout: command subcommand [dag_id] [task_id] [(optional) date]
  2. # testing print_date
  3. airflow tasks test tutorial print_date 2015-06-01
  4. # testing sleep
  5. airflow tasks test tutorial sleep 2015-06-01

Now remember what we did with templating earlier? See how this template gets rendered and executed by running this command:

  1. # testing templated
  2. airflow tasks test tutorial templated 2015-06-01

This should result in displaying a verbose log of events and ultimately running your bash command and printing the result.

Note that the airflow tasks test command runs task instances locally, outputs their log to stdout (on screen), does not bother with dependencies, and does not communicate state (running, success, failed, …) to the database. It simply allows testing a single task instance.

The same applies to airflow dags test, but on a DAG level. It performs a single DAG run of the given DAG id. While it does take task dependencies into account, no state is registered in the database. It is convenient for locally testing a full run of your DAG, given that e.g. if one of your tasks expects data at some location, it is available.

Backfill

Everything looks like it’s running fine so let’s run a backfill. backfill will respect your dependencies, emit logs into files and talk to the database to record status. If you do have a webserver up, you will be able to track the progress. airflow webserver will start a web server if you are interested in tracking the progress visually as your backfill progresses.

Note that if you use depends_on_past=True, individual task instances will depend on the success of their previous task instance (that is, previous according to the logical date). Task instances with their logical dates equal to start_date will disregard this dependency because there would be no past task instances created for them.

You may also want to consider wait_for_downstream=True when using depends_on_past=True. While depends_on_past=True causes a task instance to depend on the success of its previous task_instance, wait_for_downstream=True will cause a task instance to also wait for all task instances immediately downstream of the previous task instance to succeed.

The date range in this context is a start_date and optionally an end_date, which are used to populate the run schedule with task instances from this DAG.

  1. # optional, start a web server in debug mode in the background
  2. # airflow webserver --debug &
  3. # start your backfill on a date range
  4. airflow dags backfill tutorial \
  5. --start-date 2015-06-01 \
  6. --end-date 2015-06-07

What’s Next?

That’s it! You have written, tested and backfilled your very first Airflow pipeline. Merging your code into a repository that has a Scheduler running against it should result in being triggered and run every day.

Here are a few things you might want to do next:

See also

  • Continue to the next step of the tutorial: Working with TaskFlow

  • Skip to the the Core Concepts section for detailed explanation of Airflow concepts such as DAGs, Tasks, Operators, and more