The Spark best practices page describes several efficient ways to use Spark. These are guidelines for building efficient, maintainable ETL pipelines, and it includes some of the following:Documentation Index
Fetch the complete documentation index at: https://docs.nx1cloud.com/llms.txt
Use this file to discover all available pages before exploring further.
- Incremental loading - preferred:
- Load only new or changed data since the last run.
- Use watermarks or timestamp columns.
- Dramatically reduces processing time and costs.
- Example:
WHERE updated_at > '2025-01-01'.
- Deduplication:
- Raw data often contains duplicates.
- Use window functions with
row_number()to identify the latest records. - Or use
dropDuplicates()for simple cases. - Always deduplicate before final table writes.
- Iceberg
MERGEfor upserts:- Recommended approach for slowly changing dimensions.
- Provides ACID guarantees and efficient updates.
- Handles both inserts and updates in a single operation.
- Better than the delete and insert pattern.
- Partitioning Strategy:
- Partition by common filter columns such as date, region, or category.
- Avoid over-partitioning,
<1 GBper partition is ideal. - Use partition pruning in queries for performance.
- Consider bucketing for high-cardinality columns.
Additional resources
- To get an overview of Spark, refer to the Spark in NexusOne page.
- To learn practical ways to use Spark in the NexusOne environment, refer to the Spark hands-on examples page.
- For more details about Spark, refer to the Spark and Iceberg official documentation.