DAG File Processing

DAG File Processing

DAG File Processing refers to the process of turning Python files contained in the DAGs folder into DAG objects that contain tasks to be scheduled.

There are two primary components involved in DAG file processing. The DagFileProcessorManager is a process executing an infinite loop that determines which files need to be processed, and the DagFileProcessorProcess is a separate process that is started to convert an individual file into one or more DAG objects.

The DagFileProcessorManager runs user codes. As a result, you can decide to run it as a standalone process in a different host than the scheduler process. If you decide to run it as a standalone process, you need to set this configuration: AIRFLOW__SCHEDULER__STANDALONE_DAG_PROCESSOR=True and run the airflow dag-processor CLI command, otherwise, starting the scheduler process (airflow scheduler) also starts the DagFileProcessorManager.

DagFileProcessorManager has the following steps:

Check for new files: If the elapsed time since the DAG was last refreshed is > dag_dir_list_interval then update the file paths list
Exclude recently processed files: Exclude files that have been processed more recently than min_file_process_interval and have not been modified
Queue file paths: Add files discovered to the file path queue
Process files: Start a new DagFileProcessorProcess for each file, up to a maximum of parsing_processes
Collect results: Collect the result from any finished DAG processors
Log statistics: Print statistics and emit dag_processing.total_parse_time

DagFileProcessorProcess has the following steps:

Process file: The entire process must complete within dag_file_processor_timeout
The DAG files are loaded as Python module: Must complete within dagbag_import_timeout
Process modules: Find DAG objects within Python module
Return DagBag: Provide the DagFileProcessorManager a list of the discovered DAG objects