Skip to main content
DataHub is a central metadata platform used for discovering datasets, understanding lineage, documenting tables, and managing ownership across an organization. It provides unified visibility into the following:
  • Dataset metadata
  • Schema structure
  • Data lineage
  • Dashboard and pipeline dependencies
  • Business glossary and governance information
DataHub integrates with multiple systems, including data lakes, data warehouses, BI tools, and orchestration frameworks.

Accessing DataHub

DataHub provides a web-based interface that allows users to browse datasets, view lineage, and manage metadata. You can access it through a standard web browser.

Accessing the UI

DataHub is available at the following designated URL:
https://datacatalog.<client>.nx1cloud.com/
When you purchase NexusOne, you receive a client name. Replace client with your assigned client name.

Authentication and authorization

DataHub supports multiple authentication mechanisms, such as:
  • Single Sign-On (SSO)
  • OAuth 2.0 or OpenID Connect
After navigating to the previously mentioned URL, you must enter the credentials assigned to you when you purchased NexusOne to be successfully authenticated. For authorization, DataHub controls user access to perform specific actions using role-based permissions in KeyCloak. The typical roles include:
  • Viewer: Read-only access to datasets, lineage, and documentation
  • Editor: Read and write access to update descriptions, add tags, and modify owners
  • Admin: Authority over all actions, such as ingestion pipelines, glossary terms, and policies

Browsing and discovering assets

Assets represent items such as datasets, dashboards, pipelines, or domains. DataHub indexes these assets and records who owns them, how they connect to other assets, applied tags, and documentation.

Searching for assets

You can search across all metadata types from the main search bar, such as:
  • Exact names or partial names, for example sales_daily or sales
  • Column names, for example, customer_id
  • Tags or glossary terms
  • The name of a task or pipeline that produces or transforms a dataset
01-search-assets

Searching for an asset
While searching, you can also filter by the following categories:
  • Platforms: Spark, Airflow, Hive/Iceberg, Trino, or Superset
  • Domains: Sales, Marketing, Finance, or Operations
  • Tags: PII, Sensitive, or Analytics

Dataset overview page

Opening a dataset displays an overview page with several tabs containing details of the metadata. Some of these include:
  • Columns: Column names, data types, descriptions
  • Description: Purpose and usage notes
  • Owners: Responsible team or users
  • Tags: Labels that classify or categorize the dataset
  • Lineage: Upstream and downstream graph
  • Properties: Storage location and system-specific details
  • Data Preview: If enabled, it displays sample rows for quick inspection
02-dataset-overview-page

A dataset overview page

Metadata details

The Metadata details page provides a comprehensive view of a dataset’s technical, business, and operational metadata. It’s the central place where you can understand what a dataset contains, how it’s used, where it came from, and who is responsible for it.

Dataset metadata summary

The summary sidebar provides a high-level summary of a dataset, so you can quickly determine its purpose and context.
03-dataset-metadata

A dataset metadata summary
Some of the details the metadata contains include:
  • Documentation: Human-readable explanation of what the dataset contains
  • Owners: Person or team responsible for maintaining the dataset
  • Domain: Organizational unit or functional area the dataset belongs to, such as Finance, Marketing, or Engineering
  • Data Product: Data managed by a domain
  • Tags: Informational labels such as PII, Finance, Sensitive, or Analytics
  • Composed of: Platform-specific representation of the dataset, such as Iceberg, or Trino
  • Status: Timestamp indicating when someone last modified the metadata

Schema detail

The table that appears when you open a dataset describes a schema detail, as previously described. However, the focus is on the schema in the “Columns” tab. Understanding the schema helps you know what data exists, how to use it, and how to interpret it correctly.
04-schema-details

A dataset metadata summary
Some of these details include the following:
  • Column name
  • Data types such as STRING, INTEGER, or TIMESTAMP
  • Business-friendly description containing meanings or definitions
  • Tags containing labels for classification

Lineage details

The Lineage tab reveals what produced a dataset and which systems consume it. With this, you can identify dependencies, evaluate the impact of changes, and trace issues back to their origin. Two types of lineage exist on DataHub: dataset and job lineage.

Dataset lineage

Dataset-level lineage displays a schema and its table.
05-dataset-lineage

A dataset lineage
To view it, take the following steps:
  1. Search for the dataset name, schema, or other identifiers indexed in DataHub.
  2. Open the dataset overview page and select the Lineage tab.

Job lineage

Job lineage displays the dataset lineage, along with the task or pipeline that produced or transformed it. Some of the supported tasks or pipelines include:
  • Spark jobs, such as batch or streaming
  • Airflow DAGs and task-level lineage
  • SQL-based ETL jobs such as Trino or dbt
06-job-lineage

A job lineage
To view it, take the following steps:
  1. Search for the job or pipeline name.
  2. Open the job overview page and select the Lineage tab.

API usage

DataHub provides REST and GraphQL APIs for programmatic metadata updates. These APIs are typically used for automated metadata pipelines, CI/CD workflows, or custom integrations where you need to update metadata without using the UI.

REST API

The REST API supports creating, updating, and querying metadata by sending Metadata Change Events (MCE) or Metadata Aspects. An MCE is a message that describes changes to one or more assets. A Metadata Aspect is a specific piece of metadata about an asset, such as its ownership, tags, or schema. For example, update a dataset description.
curl -X POST (https://datacatalog.<client>.nx1cloud.com/api/v2/entity?action=ingest \
  -H "Content-Type: application/json" \
  -d '{
        "entityType": "dataset",
        "aspectName": "datasetProperties",
        "aspect": {
          "description": "Updated description"
        }
      }'
Replace client with your client’s name. The previous command does the following:
  1. Targets a dataset entity
  2. Updates the datasetProperties aspect
  3. Replaces the description with the provided text

GraphQL API

The GraphQL API provides a flexible interface for querying metadata and performing fine-grained updates. Use cases include:
  • Fetching lineage, schema, or ownership programmatically
  • Adding tags or ownership to datasets
  • Automating glossary term assignments
For example, you might use the GraphQL API to fetch dataset profiles for an Iceberg dataset. The following API request and response, are an example of this. API request:
curl -X POST https://datacatalog.<client>.nx1cloud.com/api/v2/graphql \
  -H "Content-Type: application/json" \
  -d '{
    "operationName": "getDataProfiles",
    "variables": {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:iceberg,retail_banking.completed_accounts,PROD)"
    },
    "query": "query getDataProfiles($urn: String!, $limit: Int, $startTime: Long, $endTime: Long, $filters: FilterInput) {
      dataset(urn: $urn) {
        urn
        type
        datasetProfiles(
          limit: $limit
          startTimeMillis: $startTime
          endTimeMillis: $endTime
          filter: $filters
        ) {
          rowCount
          columnCount
          sizeInBytes
          timestampMillis
          partitionSpec {
            type
            partition
            timePartition {
              startTimeMillis
              durationMillis
              __typename
            }
            __typename
          }
          fieldProfiles {
            fieldPath
            uniqueCount
            uniqueProportion
            nullCount
            nullProportion
            min
            max
            mean
            median
            stdev
            sampleValues
            quantiles {
              quantile
              value
              __typename
            }
            distinctValueFrequencies {
              value
              frequency
              __typename
            }
            __typename
          }
          __typename
        }
        __typename
      }
    }"
  }'
Replace client with your client’s name. The previous command does the following:
  1. Requests data profiles for the dataset retail_banking.completed_accounts in the PROD environment
  2. Attempts to retrieve: row count, column count, dataset size, and more
API response:
{
	"data": {
		"dataset": {
			"urn": "urn:li:dataset:(urn:li:dataPlatform:iceberg,retail_banking.completed_accounts,PROD)",
			"type": "DATASET",
			"datasetProfiles": [],
			"__typename": "Dataset"
		}
	},
	"extensions": {}
}

When to use APIs

Use API-based updates when you are trying to achieve the following:
  • Integrate DataHub with external systems
  • Automate updates via pipelines using Airflow, Jenkins, or GitHub Actions
  • Enforce metadata standards programmatically
  • Bulk update metadata at scale
However, most users prefer to manage metadata in the UI.

Data quality

DataHub provides capabilities for capturing, monitoring, and visualizing data quality rules and test results across datasets. These rules ensure that you can trust the data consumed and quickly identify issues affecting downstream products, models, or dashboards.

Data quality details

When viewing the data quality of a dataset, DataHub displays the following:
  • List of all assertions/tests containing automated rules that validate the correctness of data
  • Test results containing pass/fail status with run timestamps
  • Timestamp of the latest execution
  • Column-level and table-level checks
  • Integrated external testing tools such as custom Spark jobs, or checks in Airflow jobs
  • Associated tags
07-data-quality-details

Data quality details
You can click a specific assertion to view the following:
  • Full assertion definition
  • Historical pass/fail graph
  • Execution logs or failure summaries
08-passing-assertion

Details about a successful passing assertion
Sometimes, assertions fail.
09-failing-assertion

Details about a failing assertion

Data quality ingestion

Data quality metadata is typically ingested through scheduled pipelines. Supported integrations in NX1 include:
  • Custom Spark or SQL scripts sending results through the API
  • Airflow DAGs producing test assertions
  • Other integrations that can create assertions and send the results using the DataHub API

Troubleshooting data quality issues

When a test is failing, perform the following sequence of actions:
  1. Search for the dataset name.
  2. Open the dataset overview page and select the Quality tab.
  3. Identify the failed assertion.
  4. Review the failure summary and logs.
  5. Use the Lineage tab to identify the root cause of the issue Ask yourself, “Is it an upstream table, or an upstream job?.”
  6. Contact the dataset owners or pipeline owners.
  7. Address the issue by either fixing the schema, repairing upstream data, or adjusting the transformation logic.
  8. Re-run the test and confirm that it passes.

DataHub best practices

To maintain a high-quality and trustworthy data catalog, follow these recommended best practices:
  1. Assign owners to every dataset: Ensure each dataset has a clearly identified owner responsible for quality, access, and documentation.
  2. Keep descriptions up to date: Maintain accurate descriptions at both the table and column levels so users can easily understand the dataset’s purpose and contents.
  3. Use standardized glossary terms: Apply approved business terms consistently across datasets to promote shared understanding and improve searchability.
  4. Tag datasets with relevant classifications: Use tags and classifications to support governance, discovery, and compliance workflows.
  5. Review stale or deprecated datasets: Periodically audit unused or superseded datasets and mark them as deprecated when appropriate.
  6. Monitor and maintain ingestion pipelines: Monitor metadata ingestion pipelines and ensure they run reliably and without errors, so the catalog remains accurate and current.
  7. Define and maintain data quality tests: Implement table-level and column-level tests for critical datasets to validate schema, freshness, null values, ranges, or business rules.
  8. Automate test execution within pipelines: Run data quality tests automatically as part of ETL/ELT workflows or orchestration jobs to ensure consistent and reliable validation.
  9. Investigate and resolve failures promptly: Use lineage and test failure details to diagnose root causes and coordinate remediation with upstream dataset owners.
  10. Monitor historical data quality trends: Review test history and recurring failures to detect long-term quality issues and prevent downstream impact.

Additional resources

  • For more details about DataHub, refer to the DataHub official documentation.
  • If you are using the NexusOne portal and want to learn how to launch DataHub, refer to the Govern page.