Logging for Tasks

Airflow writes logs for tasks in a way that allows you to see the logs for each task separately in the Airflow UI. Core Airflow provides an interface FileTaskHandler, which writes task logs to file, and includes a mechanism to serve them from workers while tasks are running. The Apache Airflow Community also releases providers for many services (Provider packages) and some of them provide handlers that extend the logging capability of Apache Airflow. You can see all of these providers in Writing logs.

When using S3, GCS, WASB or OSS remote logging service, you can delete the local log files after they are uploaded to the remote location, by setting the config:

  1. [logging]
  2. remote_logging = True
  3. remote_base_log_folder = schema://path/to/remote/log
  4. delete_local_logs = True

Configuring logging

For the default handler, FileTaskHandler, you can specify the directory to place log files in airflow.cfg using base_log_folder. By default, logs are placed in the AIRFLOW_HOME directory.

Note

For more information on setting the configuration, see Setting Configuration Options

The default pattern is followed while naming log files for tasks:

  • For normal tasks: dag_id={dag_id}/run_id={run_id}/task_id={task_id}/attempt={try_number}.log.

  • For dynamically mapped tasks: dag_id={dag_id}/run_id={run_id}/task_id={task_id}/map_index={map_index}/attempt={try_number}.log.

These patterns can be adjusted by log_filename_template.

In addition, you can supply a remote location to store current logs and backups.

Writing to task logs from your code

Airflow uses standard the Python logging framework to write logs, and for the duration of a task, the root logger is configured to write to the task’s log.

Most operators will write logs to the task log automatically. This is because they have a log logger that you can use to write to the task log. This logger is created and configured by LoggingMixin that all operators derive from. But also due to the root logger handling, any standard logger (using default settings) that propagates logging to the root will also write to the task log.

So if you want to log to the task log from custom code of yours you can do any of the following:

  • Log with the self.log logger from BaseOperator

  • Use standard print statements to print to stdout (not recommended, but in some cases it can be useful)

  • Use the standard logger approach of creating a logger using the Python module name and using it to write to the task log

This is the usual way loggers are used directly in Python code:

  1. import logging
  2. logger = logging.getLogger(__name__)
  3. logger.info("This is a log message")

Interleaving of logs

Airflow’s remote task logging handlers can broadly be separated into two categories: streaming handlers (such as ElasticSearch, AWS Cloudwatch, and GCP operations logging, formerly stackdriver) and blob storage handlers (e.g. S3, GCS, WASB).

For blob storage handlers, depending on the state of the task, logs could be in a lot of different places and in multiple different files. For this reason, we need to check all locations and interleave what we find. To do this we need to be able to parse the timestamp for each line. If you are using a custom formatter you may need to override the default parser by providing a callable name at Airflow setting [logging] interleave_timestamp_parser.

For streaming handlers, no matter the task phase or location of execution, all log messages can be sent to the logging service with the same identifier so generally speaking there isn’t a need to check multiple sources and interleave.

Troubleshooting

If you want to check which task handler is currently set, you can use the airflow info command as in the example below.

  1. $ airflow info
  2. ...
  3. airflow on PATH: [True]
  4. Executor: [SequentialExecutor]
  5. Task Logging Handlers: [StackdriverTaskHandler]
  6. SQL Alchemy Conn: [sqlite://///root/airflow/airflow.db]
  7. DAGs Folder: [/root/airflow/dags]
  8. Plugins Folder: [/root/airflow/plugins]
  9. Base Log Folder: [/root/airflow/logs]

You can also run airflow config list to check that the logging configuration options have valid values.

Advanced configuration

Not all configuration options are available from the airflow.cfg file. Some configuration options require that the logging config class be overwritten. This can be done via the logging_config_class option in airflow.cfg file. This option should specify the import path to a configuration compatible with logging.config.dictConfig(). If your file is a standard import location, then you should set a PYTHONPATH environment variable.

Follow the steps below to enable custom logging config class:

  1. Start by setting environment variable to known directory e.g. ~/airflow/

    1. export PYTHONPATH=~/airflow/
  2. Create a directory to store the config file e.g. ~/airflow/config

  3. Create file called ~/airflow/config/log_config.py with following the contents:

    1. from copy import deepcopy
    2. from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
    3. LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)
  4. At the end of the file, add code to modify the default dictionary configuration.

  5. Update $AIRFLOW_HOME/airflow.cfg to contain:

    1. [logging]
    2. remote_logging = True
    3. logging_config_class = log_config.LOGGING_CONFIG
  6. Restart the application.

See Modules Management for details on how Python and Airflow manage modules.

When using remote logging, you can configure Airflow to show a link to an external UI within the Airflow Web UI. Clicking the link redirects you to the external UI.

Some external systems require specific configuration in Airflow for redirection to work but others do not.

Serving logs from workers and triggerer

Most task handlers send logs upon completion of a task. In order to view logs in real time, Airflow starts an HTTP server to serve the logs in the following cases:

  • If SequentialExecutor or LocalExecutor is used, then when airflow scheduler is running.

  • If CeleryExecutor is used, then when airflow worker is running.

In triggerer, logs are served unless the service is started with option --skip-serve-logs.

The server is running on the port specified by worker_log_server_port option in [logging] section, and option triggerer_log_server_port for triggerer. Defaults are 8793 and 8794, respectively. Communication between the webserver and the worker is signed with the key specified by secret_key option in [webserver] section. You must ensure that the key matches so that communication can take place without problems.

We are using Gunicorn as a WSGI server. Its configuration options can be overridden with the GUNICORN_CMD_ARGS env variable. For details, see Gunicorn settings.

Implementing a custom file task handler

Note

This is an advanced topic and most users should be able to just use an existing handler from Writing logs.

In our providers we have a healthy variety of options with all the major cloud providers. But should you need to implement logging with a different service, and should you then decide to implement a custom FileTaskHandler, there are a few settings to be aware of, particularly in the context of trigger logging.

Triggers require a shift in the way that logging is set up. In contrast with tasks, many triggers run in the same process, and with triggers, since they run in asyncio, we have to be mindful of not introducing blocking calls through the logging handler. And because of the variation in handler behavior (some write to file, some upload to blob storage, some send messages over network as they arrive, some do so in thread), we need to have some way to let triggerer know how to use them.

To accomplish this we have a few attributes that may be set on the handler, either the instance or the class. Inheritance is not respected for these parameters, because subclasses of FileTaskHandler may differ from it in the relevant characteristics. These params are described below:

  • trigger_should_wrap: Controls whether this handler should be wrapped by TriggerHandlerWrapper. This is necessary when each instance of handler creates a file handler that it writes all messages to.

  • trigger_should_queue: Controls whether the triggerer should put a QueueListener between the event loop and the handler, to ensure blocking IO in the handler does not disrupt the event loop.

  • trigger_send_end_marker: Controls whether an END signal should be sent to the logger when trigger completes. It is used to tell the wrapper to close and remove the individual file handler specific to the trigger that just completed.

  • trigger_supported: If trigger_should_wrap and trigger_should_queue are not True, we generally assume that the handler does not support triggers. But if in this case the handler has trigger_supported set to True, then we’ll still move the handler to root at triggerer start so that it will process trigger messages. Essentially, this should be true for handlers that “natively” support triggers. One such example of this is the StackdriverTaskHandler.