Spark in NexusOne

Apache Spark is a distributed analytics engine and framework for large-scale data processing. The engine executes high-performance data transformations and analytics by leveraging in-memory computing for processing and falls back to disk when required. The framework provides unified APIs for SQL, streaming, machine learning, and graph processing. NexusOne deploys Spark as the primary execution engine for data platform workloads spanning ingestion pipelines, complex ETL transformations, interactive analytics, and ad-hoc reporting.

Deployment context

Within NexusOne, Spark jobs can execute in the following environments:

Jupyter Notebooks: An interactive REPL environment with a pre-initialized SparkSession. Optimal for exploratory data analysis, prototyping, and collaborative development. If you are using the NexusOne portal, you can launch a Jupyter Notebook using the NexusOne Hosted Apps feature.
Apache Airflow: Provides production orchestration via Apache Airflow using PySpark decorators.
- Batch scheduling: Jobs submitted run as scheduled cron tasks, monitored continuously, and have alerting configured.
- Execution method: Jobs run as containerized Spark apps in Kubernetes pods with full lifecycle management.
If you are using the NexusOne portal, you can launch Apache Airflow using the NexusOne Monitor feature.
Command line via Kyuubi: Provides command-line submission for external job execution. This is available for jobs submitted from outside the Kubernetes cluster.

Current version information

Python version: 3.12
Spark version: 3.5.6

Supported Spark interfaces

PySpark via the Python API: Programmatic access to DataFrame and SQL APIs for pipeline development and data engineering workflows.
Spark SQL: Write standard SQL queries that execute on Spark. Perfect for analysts and data warehouse teams who prefer SQL syntax for querying and transforming data.

Supported library and table format

Apache Iceberg: 1.8 The Apache Iceberg table format provides snapshot isolation and schema evolution.

Default Spark configuration

spark.dynamicAllocation.enabled=true
spark.executor.memory=8g
spark.driver.memory=12g
spark.sql.catalog.iceberg.type=hive

Additional resources

To learn practical ways to use Spark in the NexusOne environment, refer to the Spark hands-on examples page.
To learn about best practices when using Spark, refer to the Spark best practices page.
For more details about Spark, refer to the Spark and Iceberg official documentation.

​Deployment context

​Current version information

​Supported Spark interfaces

​Supported library and table format

​Default Spark configuration

​Additional resources