DAG
As previously described, in Airflow, a DAG or Directed Acyclic Graph represents an entire workflow. It defines what tasks exist, when the workflow runs, and how it handles failures and retries. A Python file defines each DAG. The DAG doesn’t perform any computation itself, it only defines the structure and scheduling of tasks.Where DAGs live in NexusOne
In NexusOne, Airflow loads DAGs from the/opt/airflow/dags path, while you can author DAGs
in JupyterHub under the /home/<user>/dags path. Any valid Python file placed in the Airflow
DAGs directory is automatically detected and scheduled.
In the JupyterHub path,
user refers to your NexusOne username.Core DAG components
Every Airflow DAG consists of a set of core components that control how and when it runs. A combination of default arguments and DAG-level parameters defines these components and controls scheduling, fault tolerance, execution behavior, and platform-level workflows. The core components include:dag_id: Unique identifier for the workflow.owner: Team or system responsible for the DAG.retries: Number of retry attempts for failed tasks.retry_delay: Time to wait between retries.start_date: Date from which the DAG becomes eligible to run.schedule/schedule_interval: When the DAG should run.catchup: Whether to backfill past runs or only run from now onward.
DAG owner and authorization
Each DAG includes anowner field in the @dag decorator. This owner identifies the
user, team, or service account under which the workflow executes.
DAG naming conventions
To ensure clarity and maintainability, DAGs should follow a consistent naming pattern.ingest_banking_files.pytransform_sales_daily.pyquality_customer_checks.pyplatform_health_canary.py
dags/. All workflow definitions, such as
Python files containing @dag or DAG objects must live within this directory. These definitions are then
picked up and scheduled by the Airflow scheduler.
Shared, reusable logic is typically stored in a sibling utils/ directory. Files in this utils/ directory
contain common functions used across multiple DAGs, such as configuration loaders, Spark session
helpers, validation logic, and API clients.
A file tree sample:
utils/, for example:
Defining a DAG
The following examples demonstrate how to define a DAG using two approaches. In each approach, the DAG has a known owner configured, a specific start date, and a fixed schedule, such as daily or via a cron job, and an explicit retry logic to ensure consistent, fault-tolerant execution.Approach 1: Classic DAG Method
This method uses the DAG object directly instead of decorators, but still supports the same parameters and scheduling logic.Approach 2: TaskFlow API using a decorator
It uses a Python decorator to define scheduling and runtime behavior in a clean and compact format. What the DAG in the following code defines:- DAG starts running from January 1, 2025.
- Executes daily at 12 AM.
- Tasks retry up to 2 times.
- 5-minute delay between retries.
- No backfill of historical runs.
Defining tasks
In Airflow, a task is a single unit of work inside a DAG, it defines what actually runs. For example, triggering a Spark job, calling an internal API, performing a data check, or moving data between systems. In NexusOne, tasks are primarily defined using the TaskFlow API with the@task decorator. The classic
operators are still used where appropriate, for example, in Bash, Python, or Spark submission.
The TaskFlow API also enables automatic
XCom-based
data passing, so that return values from one task get consumed by another without manual wiring.
The following sections describe how to define tasks in different ways.
Use the TaskFlow API
When using the TaskFlow API, tasks are just Python functions decorated with@task.
Airflow turns these functions into tasks and automatically wires them into the DAG.
The following example describes how to use the TaskFlow API. In the code:
- The
@taskdecorator tells Airflow to treat them as tasks. - You define the DAG logic by calling the functions using a variable. For example,
cfg = read_config().
Pass data automatically using XCom
With@task, any return value from a task gets automatically stored in XCom and passed to downstream
tasks as function arguments. You don’t need to manually call xcom_push or xcom_pull.
From the example in the previous section:
read_config()returns adictwhich is then passed ascfgintorun_job(cfg).run_job()returns"SUCCESS"which is then passed asresultintoupdate_status(result).
- Configuration dictionaries
- Table or object names
- Mode flags and status values
Use operators
Airflow provides operators with prebuilt task types designed to execute specific actions. This includes running shell commands, submitting Spark jobs, calling APIs, or triggering other DAGs. Operators define how a task runs, while Airflow manages when it runs. Commonly used operators are:BashOperator: Executes shell commands or scripts.PythonOperator: Runs a Python callable.HttpOperator: Calls internal or external APIs.EmailOperator: Sends notification emails.
- Run a Python function to execute a data job.
- Run a shell command to archive logs after the job completes.
run_jobis a variable holding aPythonOperatortask that executes therun_data_jobfunction.archive_logsis a variable holding aBashOperatortask that defines how to move log files after the job runs.- The dependency
run_job >> archive_logsensures that therun_jobtask runs beforearchive_logs.
Use @task.pyspark
The @task.pyspark decorator executes Spark code as a native Airflow task. When used inside a
@dag decorator-defined workflow, Airflow automatically provisions a Spark driver. Airflow then attaches
the driver to the configured cluster runtime and injects an active Spark session into the task at runtime.
This removes the need to explicitly configure or manage a Spark connection inside the DAG file.
The task author only needs to focus on the transformation logic, while the underlying environment
handles cluster configuration, authentication, and resource allocation.
The task function receives:
spark: An initializedSparkSessionsc: The SparkContext. It’s optional.*context: The standard Airflow runtime context.
- The
process_datafunction executes as a Spark app instead of a regular Python task. - No required additional Spark connection or cluster configuration in the DAG.
- The environment provides the Spark session automatically.
- The configured catalog reads or writes data to Iceberg tables.
Schedule, trigger, and task dependency
Airflow uses a schedule to control when it creates a new DAG run. It uses a trigger to create a new DAG run immediately, this bypasses the schedule. Lastly, it uses task dependencies inside the DAG to define how the DAG executes within Airflow. The following sections describe how to use each feature.Schedule
The schedule controls when the DAG runs, regardless of whether it’s defined in the TaskFlow API or with the classic DAG method. Ways to express schedules:- Presets:
@daily,@hourly,@weekly - Cron expressions:
0 2 * * *- daily at 2 AM/15 * * * *- every 15 minutes
nx1_scheduled_pipeline function.
@dag decorator and the classic DAG class.
| Format type | Example value | Context |
|---|---|---|
| Preset | @hourly | Runs once every hour |
| Preset | @daily | Runs once per day by midnight |
| Preset | @weekly | Runs once per week |
| Preset | @monthly | Runs once per month |
| Preset | @yearly/@annually | Runs once per year |
| Preset | @once | Runs only one time |
| Cron expression | Context |
|---|---|
0 2 * * * | Daily at 2:00 AM |
30 1 * * * | Daily at 1:30 AM |
0 */6 * * * | Every 6 hours |
0 9 * * 1 | Every Monday at 9:00 AM |
15 3 1 * * | First day of every month at 3:15 AM |
Airflow defines schedules using a five-field cron format
minute hour day-of-month month day-of-week.Trigger
There are several ways to trigger a DAG.-
Use the Airflow UI:
Navigate to the DAG you want to trigger in the Airflow web interface and click Trigger DAG.
This initiates a new run of the DAG immediately.
A list of DAGs in the Airlow UI -
Use the command-line tool:
The
airflow dags triggercommand executes a specified Airflow DAG manually via the command-line tool. It must run in an environment with access to the Airflow installation and its metadata database.The location where the command runs depends on your Airflow setup:Environment Execution method Command example Local/Server Standard terminal access airflow dags trigger <dag_id>
The DAG must be present in the DAGs folder, parsed by Airflow, and unpaused so
it can run on its schedule or trigger manually.
Task dependency
In the TaskFlow API, Airflow creates dependencies by calling task functions. When you call a@task decorator function inside a @dag decorator, Airflow wires the tasks into a graph
in the order that you call them.
If a task returns a value, it’s passed as an argument to another task function. Then,
Airflow automatically moves that data via XCom under the hood.
In the following example, the DAG defines three tasks using the TaskFlow API:
read_config()produces a configuration dictionary.run_job(cfg)consumes that dictionary and returns a status string.update_status(result)consumes the status string to print the final outcome.
Connections, hooks, and variables
Airflow provides connections and variables to decouple credentials and configuration from the DAG code. In NexusOne, this is how DAGs securely access databases, APIs, Spark clusters, object storage, and internal services, without hardcoding sensitive values.- Connections: Securely store system and service credentials.
- Hooks: Programmatic interfaces that use connections to interact with systems.
- Variables: Store runtime configuration value that tasks can read dynamically.
Use an Airflow connection in a task
Airflow connections are typically accessed through Hooks. The following is an example using a Postgres connection. Similar patterns apply to Trino, HTTP, and so on.Create an Airflow variable in the UI
Follow these steps to create an Airflow variable using the UI:- Open the Airflow UI.
- At the top menu, click Admin > Variables.
- Click the
+icon. - In the Key field, add a new record by entering a value.
- Click Save.

Creating an Airflow variable in the UI
Use Airflow variables in a DAG
In the following code example, this allows the same DAG to access thenx1_source_path and
nx1_target_table variables without any hardcoding.
Monitoring DAGs
Airflow provides full visibility into the state of each DAG and task run. In NexusOne, this is how users and platform teams does the following:- Debug failures
- Verify that pipelines are running as expected
- Track historical behavior over time.
- A task status describing the task lifecycle.
- Logs describing the execution detail of a task.
- A DAG run history showing the timeline of each DAG.
Task status
Every task instance in Airflow has a status that indicates its current state, such as:running: The task is currently running.success: The task finished without errors.failed: The task raised an exception or returned an error.up_for_retry: The task failed, but it might retry if you configureretriesorretry_delay.skipped: The task was intentionally not run, usually due to a branch or condition.queued: Waiting for executor resources.
- Open the Airflow UI.
- Click on a DAG.
- Click Graph

A graph representing several tasks in a DAG.
Access logs per task
Airflow captures all stdout, stderr, and Python logging output for each task run. These logs are essential for understanding why a task failed. Follow these steps to access a DAG’s task logs:- Open the Airflow UI.
- Click on a DAG.
- Click Graph and then select a task.
- Click Log.

Logs of a task in a DAG.
View a DAG run history
For each DAG run, Airflow maintains a run history, such as the following:- Overall DAG state. This could be in a success, failed, or running state.
- DAG ID
- Run ID
- Run type
- The start and end date of the execution.
- Open the Airflow UI.
- Click Browse > DAG Runs.

DAG run history.
Deployment workflow
Airflow treats DAGs as version-controlled workflow assets and deploys them through a controlled delivery process. You can author DAG definitions in a Git repository and push them into the Airflow environment via an automated or semi-automated deployment pipeline. This approach ensures the following:- There are multiple versions and traces of workflow definitions.
- Changes are peer-reviewed and auditable.
- Deployments are consistent across environments.
- No manual edits in production.
- You create DAGs and maintain them in a source-controlled repository.
- Changes are peer-reviewed and merged to a designated branch.
- A deployment pipeline validates and syncs DAG files.
- The pipeline places DAGs into the Airflow DAGs directory.
- The Airflow scheduler automatically discovers and registers the DAGs.