Key features
- ACID transactions: Keep data consistent using serializable or snapshot isolation. It defaults to a serializable isolation and ensures that multiple jobs can read from and write to a table at the same time, with conflicting writes detected at commit and retried.
- Schema evolution: Add, drop, or rename columns without rewriting data, with schema changes tracked in table metadata, so all readers see the updated schema.
- Partitioning flexibility: Allows you to change partitioning schemes, for example, by date instead of by region. You can do this without touching existing files. New data uses the new partitioning scheme, old data remains accessible, and queries scan only relevant partitions.
- Time travel: Query historical table states using snapshot IDs or timestamps for auditing, debugging, and reproducible analytics.
Iceberg components
Iceberg’s architecture comprises the following components from top to bottom:- Catalog: Contains a pointer to an Iceberg table’s metadata file
- Metadata file: Contains all information about a table. It’s a
metadata.jsonfile, and it comprises the following:- Snapshots: Records the table’s content at specific points in time
- Manifest lists: Tracks all manifest files in a snapshot
- Manifest files: Tracks data files
- Data files: Parquet, ORC, or Avro files storing the table’s data
- Storage layer: S3, HDFS, or other object stores holding the metadata and data files
Environment configuration
This section describes how the NexusOne team configured Iceberg. The current version information supported by the NexusOne environment includes the following details:- Iceberg version:
1.8 - Table format version:
2.0 - File format: Defaults to Parquet, but also supports Avro and ORC
- Catalog type: Hive Metastore
Additional resources
- To learn about best practices when using Iceberg in the NexusOne environment, refer to the Iceberg page.
- For more details about Apache Iceberg, refer to the Apache Iceberg official documentation.
- For more details about how NexusOne integrates Apache Spark, refer to the Apache Spark.