Skip to main content
Apache Airflow is a workflow orchestration system that represents pipelines as Directed Acyclic Graphs (DAG) of tasks. It doesn’t perform the actual compute, instead, it determines what runs, when it runs, and in what order. Each task can trigger external systems such as Spark, Trino, and databases. NexusOne uses Airflow as the control layer for data workflows, such as coordinating ingestion, transformation, or validation. NexusOne also uses Airflow for platform automation across Iceberg, Spark, and Trino.

Exploring the Airflow UI

The Airflow UI provides a web-based interface to view and manage DAGs. It shows task status, schedules, logs, and some configuration controls. You can launch Airflow using the following designated URL:
https://jupyter.<client>.nx1cloud.com/
When you purchase NexusOne, you receive a client name. Replace client with your assigned client name.
When you launch JupyterHub using the previous URL, you should see an image similar to the following image.
06-airflow-homepage

Airflow homepage layout
When interacting with NexusOne using the portal, you can also launch Airflow using the Build feature.

What is a DAG?

As previously described, in Airflow, a DAG or Directed Acyclic Graph represents an entire workflow. It defines what tasks exist, when the workflow runs, and how it handles failures and retries. A Python file defines each DAG. The DAG doesn’t perform any computation itself, it only defines the structure and scheduling of tasks.

Where DAGs live in NexusOne

In NexusOne, Airflow loads DAGs from the /opt/airflow/dags path, while you can author DAGs in JupyterHub under the /home/<user>/dags path. Any valid Python file placed in the Airflow DAGs directory is automatically detected and scheduled.
In the JupyterHub path, user refers to your NexusOne username.

Core DAG components

Every Airflow DAG consists of a set of core components that control how and when it runs. A combination of default arguments and DAG-level parameters defines these components and controls scheduling, fault tolerance, execution behavior, and platform-level workflows. The core components include:
  • dag_id: Unique identifier for the workflow
  • owner: Team or system responsible for the DAG
  • retries: Number of retry attempts for failed tasks
  • retry_delay: Time to wait between retries
  • start_date: Date from which the DAG becomes eligible to run
  • schedule/schedule_interval: When the DAG should run
  • catchup: Whether to backfill past runs or only run from now onward

DAG owner and authorization

Each DAG includes an owner field in the @dag decorator. This owner identifies the user, team, or service account under which the workflow executes.
@dag(
  dag_id="example",
  owner="data-engineering",
  start_date=datetime(2025, 1, 1),
  schedule="@daily"
)
def example_dag():
  ...
A DAG only executes if the configured owner has the necessary permissions to access the required systems, such as data sources, Spark cluster, Iceberg tables, or external services.

DAG naming conventions

To ensure clarity and maintainability, DAGs should follow a consistent naming pattern.
<feature>_<domain>_<purpose>
A few examples are:
  • ingest_banking_files.py
  • transform_sales_daily.py
  • quality_customer_checks.py
  • platform_health_canary.py
Airflow discovers DAGs from a designated directory commonly named dags/. All workflow definitions, such as Python files containing @dag or DAG objects must live within this directory. These definitions are then picked up and scheduled by the Airflow scheduler. Shared, reusable logic is typically stored in a sibling utils/ directory. Files in this utils/ directory contain common functions used across multiple DAGs, such as configuration loaders, Spark session helpers, validation logic, and API clients. A file tree sample:
dags/

├── ingestion/
├── transform/
├── quality/
├── monitoring/
└── utils/
    ├── io_helpers.py
    ├── spark_helpers.py
    └── quality_rules.py
DAG files can import functions from utils/, for example:
from utils.io_helpers import read_config
from utils.spark_helpers import build_spark_session
This separation keeps DAGs focused on orchestration logic while centralizing reusable code into shared modules.

Additional resources

  • To learn practical ways to use Airflow in the NexusOne environment, refer to the Airflow hands-on examples page.
  • If you are using the NexusOne portal and want to learn how to launch Airflow, refer to the Monitor page.
  • For more details about Airflow, refer to the Airflow official documentation.