DataHub - NexusOne

DataHub is a central metadata platform used for discovering datasets, understanding lineage, documenting tables, and managing ownership across an organization. It provides unified visibility into the following:

Dataset metadata
Schema structure
Data lineage
Dashboard and pipeline dependencies
Business glossary and governance information

DataHub integrates with multiple systems, including data lakes, data warehouses, BI tools, and orchestration frameworks.

Accessing DataHub

DataHub provides a web-based interface that allows users to browse datasets, view lineage, and manage metadata. You can access it through a standard web browser.

Accessing the UI

DataHub is available at the following designated URL:

https://datacatalog.<client>.nx1cloud.com/

When you purchase NexusOne, you receive a client name. Replace client with your assigned client name.

Authentication and authorization

DataHub supports multiple authentication mechanisms, such as:

Single Sign-On (SSO)
OAuth 2.0 or OpenID Connect

After navigating to the previously mentioned URL, you must enter the credentials assigned to you when you purchased NexusOne to be successfully authenticated. For authorization, DataHub controls user access to perform specific actions using role-based permissions in KeyCloak. The typical roles include:

Viewer: Read-only access to datasets, lineage, and documentation
Editor: Read and write access to update descriptions, add tags, and modify owners
Admin: Authority over all actions, such as ingestion pipelines, glossary terms, and policies

Browsing and discovering assets

Assets represent items such as datasets, dashboards, pipelines, or domains. DataHub indexes these assets and records who owns them, how they connect to other assets, applied tags, and documentation.

Searching for assets

You can search across all metadata types from the main search bar, such as:

Exact names or partial names, for example sales_daily or sales
Column names, for example, customer_id
Tags or glossary terms
The name of a task or pipeline that produces or transforms a dataset

Searching for an asset

While searching, you can also filter by the following categories:

Platforms: Spark, Airflow, Hive/Iceberg, Trino, or Superset
Domains: Sales, Marketing, Finance, or Operations
Tags: PII, Sensitive, or Analytics

Dataset overview page

Opening a dataset displays an overview page with several tabs containing details of the metadata. Some of these include:

Columns: Column names, data types, descriptions
Description: Purpose and usage notes
Owners: Responsible team or users
Tags: Labels that classify or categorize the dataset
Lineage: Upstream and downstream graph
Properties: Storage location and system-specific details
Data Preview: If enabled, it displays sample rows for quick inspection

A dataset overview page

Metadata details

The Metadata details page provides a comprehensive view of a dataset’s technical, business, and operational metadata. It’s the central place where you can understand what a dataset contains, how it’s used, where it came from, and who is responsible for it.

Dataset metadata summary

The summary sidebar provides a high-level summary of a dataset, so you can quickly determine its purpose and context.

A dataset metadata summary

Some of the details the metadata contains include:

Documentation: Human-readable explanation of what the dataset contains
Owners: Person or team responsible for maintaining the dataset
Domain: Organizational unit or functional area the dataset belongs to, such as Finance, Marketing, or Engineering
Data Product: Data managed by a domain
Tags: Informational labels such as PII, Finance, Sensitive, or Analytics
Composed of: Platform-specific representation of the dataset, such as Iceberg, or Trino
Status: Timestamp indicating when someone last modified the metadata

Schema detail

The table that appears when you open a dataset describes a schema detail, as previously described. However, the focus is on the schema in the “Columns” tab. Understanding the schema helps you know what data exists, how to use it, and how to interpret it correctly.

A dataset metadata summary

Some of these details include the following:

Column name
Data types such as STRING, INTEGER, or TIMESTAMP
Business-friendly description containing meanings or definitions
Tags containing labels for classification

Lineage details

The Lineage tab reveals what produced a dataset and which systems consume it. With this, you can identify dependencies, evaluate the impact of changes, and trace issues back to their origin. Two types of lineage exist on DataHub: dataset and job lineage.

Dataset lineage

Dataset-level lineage displays a schema and its table.

A dataset lineage

To view it, take the following steps:

Search for the dataset name, schema, or other identifiers indexed in DataHub.
Open the dataset overview page and select the Lineage tab.

Job lineage

Job lineage displays the dataset lineage, along with the task or pipeline that produced or transformed it. Some of the supported tasks or pipelines include:

Spark jobs, such as batch or streaming
Airflow DAGs and task-level lineage
SQL-based ETL jobs such as Trino or dbt

A job lineage

To view it, take the following steps:

Search for the job or pipeline name.
Open the job overview page and select the Lineage tab.

API usage

DataHub provides REST and GraphQL APIs for programmatic metadata updates. These APIs are typically used for automated metadata pipelines, CI/CD workflows, or custom integrations where you need to update metadata without using the UI.

REST API

The REST API supports creating, updating, and querying metadata by sending Metadata Change Events (MCE) or Metadata Aspects. An MCE is a message that describes changes to one or more assets. A Metadata Aspect is a specific piece of metadata about an asset, such as its ownership, tags, or schema. For example, update a dataset description.

curl -X POST (https://datacatalog.<client>.nx1cloud.com/api/v2/entity?action=ingest \
  -H "Content-Type: application/json" \
  -d '{
        "entityType": "dataset",
        "aspectName": "datasetProperties",
        "aspect": {
          "description": "Updated description"
        }
      }'

Replace client with your client’s name. The previous command does the following:

Targets a dataset entity
Updates the datasetProperties aspect
Replaces the description with the provided text

GraphQL API

The GraphQL API provides a flexible interface for querying metadata and performing fine-grained updates. Use cases include:

Fetching lineage, schema, or ownership programmatically
Adding tags or ownership to datasets
Automating glossary term assignments

For example, you might use the GraphQL API to fetch dataset profiles for an Iceberg dataset. The following API request and response, are an example of this. API request:

curl -X POST https://datacatalog.<client>.nx1cloud.com/api/v2/graphql \
  -H "Content-Type: application/json" \
  -d '{
    "operationName": "getDataProfiles",
    "variables": {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:iceberg,retail_banking.completed_accounts,PROD)"
    },
    "query": "query getDataProfiles($urn: String!, $limit: Int, $startTime: Long, $endTime: Long, $filters: FilterInput) {
      dataset(urn: $urn) {
        urn
        type
        datasetProfiles(
          limit: $limit
          startTimeMillis: $startTime
          endTimeMillis: $endTime
          filter: $filters
        ) {
          rowCount
          columnCount
          sizeInBytes
          timestampMillis
          partitionSpec {
            type
            partition
            timePartition {
              startTimeMillis
              durationMillis
              __typename
            }
            __typename
          }
          fieldProfiles {
            fieldPath
            uniqueCount
            uniqueProportion
            nullCount
            nullProportion
            min
            max
            mean
            median
            stdev
            sampleValues
            quantiles {
              quantile
              value
              __typename
            }
            distinctValueFrequencies {
              value
              frequency
              __typename
            }
            __typename
          }
          __typename
        }
        __typename
      }
    }"
  }'

Replace client with your client’s name. The previous command does the following:

Requests data profiles for the dataset retail_banking.completed_accounts in the PROD environment
Attempts to retrieve: row count, column count, dataset size, and more

API response:

{
	"data": {
		"dataset": {
			"urn": "urn:li:dataset:(urn:li:dataPlatform:iceberg,retail_banking.completed_accounts,PROD)",
			"type": "DATASET",
			"datasetProfiles": [],
			"__typename": "Dataset"
		}
	},
	"extensions": {}
}

When to use APIs

Use API-based updates when you are trying to achieve the following:

Integrate DataHub with external systems
Automate updates via pipelines using Airflow, Jenkins, or GitHub Actions
Enforce metadata standards programmatically
Bulk update metadata at scale

However, most users prefer to manage metadata in the UI.

Data quality

DataHub provides capabilities for capturing, monitoring, and visualizing data quality rules and test results across datasets. These rules ensure that you can trust the data consumed and quickly identify issues affecting downstream products, models, or dashboards.

Data quality details

When viewing the data quality of a dataset, DataHub displays the following:

List of all assertions/tests containing automated rules that validate the correctness of data
Test results containing pass/fail status with run timestamps
Timestamp of the latest execution
Column-level and table-level checks
Integrated external testing tools such as custom Spark jobs, or checks in Airflow jobs
Associated tags

Data quality details

You can click a specific assertion to view the following:

Full assertion definition
Historical pass/fail graph
Execution logs or failure summaries

Details about a successful passing assertion

Sometimes, assertions fail.

Details about a failing assertion

Data quality ingestion

Data quality metadata is typically ingested through scheduled pipelines. Supported integrations in NX1 include:

Custom Spark or SQL scripts sending results through the API
Airflow DAGs producing test assertions
Other integrations that can create assertions and send the results using the DataHub API

Troubleshooting data quality issues

When a test is failing, perform the following sequence of actions:

Search for the dataset name.
Open the dataset overview page and select the Quality tab.
Identify the failed assertion.
Review the failure summary and logs.
Use the Lineage tab to identify the root cause of the issue Ask yourself, “Is it an upstream table, or an upstream job?.”
Contact the dataset owners or pipeline owners.
Address the issue by either fixing the schema, repairing upstream data, or adjusting the transformation logic.
Re-run the test and confirm that it passes.

DataHub best practices

To maintain a high-quality and trustworthy data catalog, follow these recommended best practices:

Assign owners to every dataset: Ensure each dataset has a clearly identified owner responsible for quality, access, and documentation.
Keep descriptions up to date: Maintain accurate descriptions at both the table and column levels so users can easily understand the dataset’s purpose and contents.
Use standardized glossary terms: Apply approved business terms consistently across datasets to promote shared understanding and improve searchability.
Tag datasets with relevant classifications: Use tags and classifications to support governance, discovery, and compliance workflows.
Review stale or deprecated datasets: Periodically audit unused or superseded datasets and mark them as deprecated when appropriate.
Monitor and maintain ingestion pipelines: Monitor metadata ingestion pipelines and ensure they run reliably and without errors, so the catalog remains accurate and current.
Define and maintain data quality tests: Implement table-level and column-level tests for critical datasets to validate schema, freshness, null values, ranges, or business rules.
Automate test execution within pipelines: Run data quality tests automatically as part of ETL/ELT workflows or orchestration jobs to ensure consistent and reliable validation.
Investigate and resolve failures promptly: Use lineage and test failure details to diagnose root causes and coordinate remediation with upstream dataset owners.
Monitor historical data quality trends: Review test history and recurring failures to detect long-term quality issues and prevent downstream impact.

Additional resources

For more details about DataHub, refer to the DataHub official documentation.
If you are using the NexusOne portal and want to learn how to launch DataHub, refer to the Govern page.

​Accessing DataHub

​Accessing the UI

​Authentication and authorization

​Browsing and discovering assets

​Searching for assets

​Dataset overview page

​Metadata details

​Dataset metadata summary

​Schema detail

​Lineage details

​Dataset lineage

​Job lineage

​API usage

​REST API

​GraphQL API

​When to use APIs

​Data quality

​Data quality details

​Data quality ingestion

​Troubleshooting data quality issues

​DataHub best practices

​Additional resources

Accessing DataHub

Accessing the UI

Authentication and authorization

Browsing and discovering assets

Searching for assets

Dataset overview page

Metadata details

Dataset metadata summary

Schema detail

Lineage details

Dataset lineage

Job lineage

API usage

REST API

GraphQL API

When to use APIs

Data quality

Data quality details

Data quality ingestion

Troubleshooting data quality issues

DataHub best practices

Additional resources