Skip to main content
Apache Iceberg is an open table format for large-scale analytical datasets in data lakes. It provides ACID transaction guarantees, schema evolution, and time travel capabilities. Traditional table formats, like Hive or Impala, combine metadata and read/write transactions to a single compute engine, like Spark or Flink. Only that engine can manage them while maintaining consistency. Unlike traditional table formats, Iceberg separates the metadata from the compute engine, allowing multiple engines to read and write data concurrently while ensuring a consistent table state. Because NexusOne requires multi-compute engine access, strong consistency, and long-term schema evolution in its data lakehouse, the NexusOne team adopts Iceberg as its foundational table format. NexusOne also stores Iceberg metadata in a Hive Metastore and data files in Amazon S3.

Key features

  • ACID transactions: Keep data consistent using serializable or snapshot isolation. It defaults to a serializable isolation and ensures that multiple jobs can read from and write to a table at the same time, with conflicting writes detected at commit and retried.
  • Schema evolution: Add, drop, or rename columns without rewriting data, with schema changes tracked in table metadata, so all readers see the updated schema.
  • Partitioning flexibility: Allows you to change partitioning schemes, for example, by date instead of by region. You can do this without touching existing files. New data uses the new partitioning scheme, old data remains accessible, and queries scan only relevant partitions.
  • Time travel: Query historical table states using snapshot IDs or timestamps for auditing, debugging, and reproducible analytics.

Iceberg components

Iceberg’s architecture comprises the following components from top to bottom:
  1. Catalog: Contains a pointer to an Iceberg table’s metadata file
  2. Metadata file: Contains all information about a table. It’s a metadata.json file, and it comprises the following:
    1. Snapshots: Records the table’s content at specific points in time
    2. Manifest lists: Tracks all manifest files in a snapshot
    3. Manifest files: Tracks data files
  3. Data files: Parquet, ORC, or Avro files storing the table’s data
  4. Storage layer: S3, HDFS, or other object stores holding the metadata and data files

Environment configuration

This section describes how the NexusOne team configured Iceberg. The current version information supported by the NexusOne environment includes the following details:
  • Iceberg version: 1.8
  • Table format version: 2.0
  • File format: Defaults to Parquet, but also supports Avro and ORC
  • Catalog type: Hive Metastore

Additional resources

  • To learn about best practices when using Iceberg in the NexusOne environment, refer to the Iceberg page.
  • For more details about Apache Iceberg, refer to the Apache Iceberg official documentation.
  • For more details about how NexusOne integrates Apache Spark, refer to the Apache Spark.