Skip to main content
This quickstart guide gets you up to speed with the NexusOne platform. It walks you through how to ingest a file, check quality, transform it, and track its lineage. The ingested file has a table containing a small sample of people from various cities and states in the United States of America. It contains the following column names:
  • ID
  • name
  • age
  • address
  • city
  • state
  • email
If multiple users are following this guide, check whether a domain or ingest details with the names used here already exist. If it does, choose a different name when creating your own or delete the existing ones.

Prerequisite

  • Appropriate permissions: datahub-admin, nx1_engineer, nx1_ingest, nx1_monitor, nx1_s3_admin, and nx1_quality

Create a domain on DataHub

A DataHub domain categorizes related ingested data. When categorizing related ingested data or discovering insights, NexusOne requires that you have an existing domain. Follow these steps to create a domain:
  1. Log in to NexusOne.
  2. On the top navigation bar, hover the mouse over Govern and then click Data Catalog. This opens the DataHub app hosted on NexusOne in a new tab.
  3. If asked to log in, enter your NexusOne credentials or click log in with SSO.
  4. Navigate to the Domains tab in the far left, and then click Create.
  5. Enter quickstart_domain in the Name field.
  6. Click Create.
    01-quickstart-domain

    A created domain on DataHub
  7. Click the browser tab that has NexusOne to return to the NexusOne homepage.

Ingest a source and destination file

After creating the domain as previously described, ingest a source and destination file. The source file contains data in each row, while the destination file has no rows.
  1. On the NexusOne homepage, navigate to Ingest > File.
  2. In the File Details section, click Public File URL.
  3. In the File URL field, enter the following URL to a CSV file:
    https://rapid-file-tutorial.s3.us-east-1.amazonaws.com/customers.csv
    
  4. In the Ingest Details section, add the following information in the fields:
    • Name: csv_quickstart_source
    • Schema: csv_quickstart_schema_source
    • Table: csv_quickstart_table_source
    • Schedule: Every 3 hours
    • Mode: append
    • Domain: quickstart_domain
    • Tags: Don’t add any tags
  5. Click Ingest. Wait for a few minutes until you see a success message appear.
  6. Click View Jobs. Wait for a few minutes, you should see your job name, csv_quickstart_source, in the list, and its current status. Ensure that the status is in a Completed state. If the status is ‘New’, then wait a little longer.
  7. Return to the NexusOne homepage, and then click Discover to launch Superset.
  8. Click New and then SQL query.
  9. Paste the following SQL command and then click the play icon to see the table from the source file.
    SELECT * FROM csv_quickstart_schema_source.csv_quickstart_table_source
    
    It should show you a result similar to the following image. Rows 6 and 11 are duplicates.
    02-source-table

    Table from the source file

Check data quality

After ingesting all files, follow these steps to check for data quality:
  1. On the NexusOne homepage, click Quality.
  2. Click Select catalogs, and then select the iceberg checkbox.
  3. Select a schema and table.
    • schema: iceberg.csv_quickstart_schema_source
    • table: iceberg.csv_quickstart_table_source
  4. In the Describe a rule in detail field, enter the following rule:
    Check if multiple rows with the same name exist in the table.
    
  5. Click Send. NexusOne generates one or more SQL commands.
  6. Pick one command, copy the SQL command, and then click Accept.
  7. On the NexusOne homepage, click Discover to launch Superset.
  8. Click New and then SQL query.
  9. Paste the SQL command and then click the play icon to execute the data quality rule. It should show you a result similar to the following image, indicating that duplicate rows exist.
03-duplicate-rows

Result from a data quality check

Transform the data

Because multiple rows with the same name exist in the table, you can perform deduplication to remove the duplicate rows and send the transformed result into a new table.
  1. On the NexusOne homepage, click Engineer.
  2. Click Lakehouse.
  3. Select a schema and table.
    • schema: csv_quickstart_schema_source
    • table: csv_quickstart_table_source
  4. Click Next: Transform.
  5. In the Job Name field, enter transform_csv_quickstart.
  6. In the Transform Prompt field, enter the following rule:
    Perform deduplication by using SELECT to remove any row with a duplicate name.
    Keep all columns, and for duplicates, keep the row with the smallest ID.
    
  7. Select the Show preview? checkbox to preview the result from the executed generated query.
  8. Click Transform. NexusOne displays a data preview.
  9. Click Finalize to proceed to the next step.
  10. Enter the following information in the fields:
    • Destination Schema: csv_quickstart_schema_destination
    • Destination Table: csv_quickstart_table_destination
    • Domain: quickstart_domain
    • Tags: Don’t select any tags.
    • Schedule: Every 3 hours
    • Mode: append
  11. Click Schedule. Apache runs on the first schedule, which means the data transformation job executes.
  12. Go back to the NexusOne homepage, click Discover to launch Superset.
  13. Click New and then SQL query.
  14. Paste the following SQL command and then click the play icon to see the transformed result.
    SELECT * FROM csv_quickstart_schema_destination.csv_quickstart_table_destination
    
    It should show you a result similar to the following image. Notice how only one record from the duplicate rows exist.
    04-transformed-table

    Results from the transformed source table appearing in the destination table

Trace data lineage on DataHub

In NexusOne, the Iceberg catalog uses a Hive Metastore. You can see how the data flowed across your pipeline, using the following steps:
  1. Navigate to the NexusOne homepage, click Connect.
  2. On the top navigation bar, hover the mouse over Connect and then click Hosted Apps.
  3. Click Global Data Catalog to lauch DataHub.
  4. In DataHub, enter csv_quickstart_schema_destination in the top search bar and then press the Enter or return key in your keyboard to search for your transformed table.
  5. Click Hive. This applies the filter, Platform equals Hive.
  6. Click the transformed table, csv_quickstart_schema_destination.csv_quickstart_table_destination.
  7. Click Lineage.
  8. When the diagrams load, click the left-arrow icon at the top left box to open the full lineage. You should see an image similar to the following:
05-data-lineage

Data lineage of the transformed table
A breakdown of the data lineage image:
  1. From the left, you see an S3 source bucket storing the raw ingested CSV file.
  2. The metadata of the CSV file is available using a Hive Metastore, which uses an Iceberg table format.
  3. That table is then transformed and written back out to an S3 destination bucket.
  4. The metadata from the destination bucket is available using a Hive Metastore, which uses an Iceberg table format.

Clean up

You need to delete the previously scheduled jobs that are recurring, and also the new domain. Follow these steps to delete the jobs:
  1. On the NexusOne homepage, click Monitor.
  2. For each job name, csv_quickstart_source and csv_quickstart_schema_destination, click the three dots ... menu, and then click Delete job.
Follow these steps to delete the domain:
  1. On the top navigation bar, hover the mouse over Govern and then click Data Catalog. This opens the DataHub app hosted on NexusOne in a new tab.
  2. If asked to log in, enter your NexusOne credentials or click log in with SSO.
  3. Navigate to the Domains tab in the far left.
  4. On the domain name,quickstart_domain, click the three dots ... menu, and then select Delete. Confirm your slection by clicking Yes.
Congratulations, you have successfuly completed the guide.

Additional details

For more information about each of these features, see the following: NexusOne offers many more features, feel free to browse through the documentation.