Environments and runtime specification
- Platform-managed environment with pre-installed packages
- Python version:
3.x
NexusOne ships with the latest supported version.
Capabilities
JupyterHub offers several core features and use cases that are popular today.Key features
- A browser-based notebook interface that requires no local installation.
- An integrated Spark session that’s available directly from Python notebooks.
- Built-in support for Trino SQL queries.
- Ability to mix documentation, code, and output in a single file.
- Access to shared utilities,
utils, and an Airflow DAGs folder,dags. - Optional terminal access for scripting and running batch workflows.
Use cases
JupyterHub is often recommended for the following use cases:- Building or testing PySpark transformations
- Data quality validation workflows
- Rapid prototyping before building an Airflow DAG
- Small-scale analytics and interactive SQL queries
- Investigating raw data stored in Iceberg, Hive, and S3
- Trying out Python libraries, utilities, or custom logic
- Debugging table issues, lineage, or schema mismatches
Authentication
JupyterHub uses NexusOne’s standard authentication flow:- Single Sign-On (SSO)
- OAuth 2.0 or OpenID Connect
User access and permissions
NexusOne’s security layer manages access to the following:- IAM groups grant JupyterHub access.
- Ranger policies grant access to datasets, Hive metadata, Iceberg tables, and S3 object storage.
- IAM users have access to view or interact with the data sources and storage locations when authorized.
Exploring the JupyterHub UI
This section explores the JupyterHub UI and how to navigate it. Before exploring these, you can launch JupyterHub using the following designated URL:When you purchase NexusOne, you receive a client name.
Replace
client with your assigned client name.
JupyterHub homepage layout
File browser
As seen in the previous image, the top left panel displays your workspace directory. When you launch a JupyterHub environment in NexusOne, you should see the following folders:-
dags/- This folder is automatically synced with Airflow.
- Any
.pyfile placed here becomes visible to the Airflow scheduler. - Use this location to develop, test, and maintain Airflow DAGs.
- Don’t store temporary notebooks or test files here.
-
utils/- This is a shared utilities folder, mounted into every Jupyter session.
- Contains common helper functions, shared logic, configuration parsers, and reusable modules.
-
You can import utilities from here directly into notebooks or DAGs:
- Creating new folders
- Uploading or downloading files
- Managing files by renaming, copying, or deleting them
Main launcher
As seen in the previous image, the center panel displays the main launcher. It lets you create notebooks, terminals, and other file types. The Launcher contains the following sections, Notebook, Console, and Other.Notebook section
This section displays the available notebook kernels.- Python 3 ipykernel: The primary kernel used for the following use cases:
- Data exploration
- ETL validation workflows
- PySpark development
- Trino queries through Python libraries
- R Kernel: Used for R-based analysis and visualizations
Console section
An interface that sends commands directly to the Python or R kernel and returns output immediately. It does this without creating notebook cells or storing the interaction as a document. It’s useful for:- Quick variable checks
- Running isolated code without creating notebooks
- Testing Python or Spark one-liners
Other section
This includes tools and file templates inside JupyterHub.- Terminal: Opens a real shell session on the Jupyter server for the following:
- Navigating directories
- Running scripts
- Installing user-level Python packages
- Text file: Creates an empty file with a
.txtfile type - Markdown file: Creates an empty file having a
.mdfile type It’s useful for writing documentation or notes. - Python file: Creates an empty file with a
.pyfile type - R file: Creates an empty file with a
.rfile type - Show contextual help: Opens a help panel with documentation for JupyterLab features
Status bar
As seen in the previous image, the bottom panel provides information about the current status of the environment, such as:- Active terminal count
- CPU usage
- Current kernel
- Memory usage
- Notebook mode
Sidebar tools
As seen in the previous image, the top left sidebar provides quick access to all major tools available in your environment. Each icon represents a specific feature, and understanding these helps you navigate and manage your work efficiently.File browser
You can view your workspace directory and all the files and folders available in your Jupyter environment. Some of the key features include the following:- Access mounted platform folders such as:
- The Airflow DAGs folder,
dags/ - The shared helper library,
utils/
- The Airflow DAGs folder,
- Upload and download files
- Create, rename, delete, or move files
- Use the drag-and-drop to add files from your local machine or organize files into respective folders within JupyterHub.
- Creating folders for your projects
- Editing or placing DAG files
- Opening notebooks
Running terminals and kernels
You can display all currently running processes, such as:- Active kernels
- Active notebooks
- Open tabs
- Open terminals

- Restarting a stuck notebook kernel
- Cleaning up forgotten terminals
- Monitoring your active background jobs
Table of contents
You can automatically generate a structured outline of your notebook based on markdown headers. Some of the key features include the following:- Navigating large notebooks easily
- Jumping quickly between sections
- Helps you maintain readable notebooks with proper documentation
- Notebooks with heavy documentation.
- Presentation-style notebook flows.
Object storage browser for S3
You can access S3 buckets or your object store using the UI. Some of the key features include the following:- Browsing all authorized buckets
- Downloading files
- Previewing CSV, JSON, and text files
- Uploading new objects
- Navigating through large datasets
- Copying full S3 URIs used directly in PySpark, Trino, or spark-submit jobs

- Ranger policies control access to the bucket.
- If you don’t see a bucket/path, that means you aren’t authorized.
- Checking raw vs. processed data
- Inspecting files before running Spark jobs
- Validating daily partitions or incremental loads
Extension manager
You can turn on or turn off JupyterLab extensions, which are UI plugins. Some key sections include:- WARNING: Alerts for extensions that are incompatible, blocked, or require server-side changes
- INSTALLED: Lists all extensions currently installed and enabled in your Jupyter environment
- DISCOVER: Browse discoverable extensions that you can request for administrator approval

- Enabling additional syntax themes
- Enabling development helpers
- Checking installed UI extensions
Jupyter AI chat
You can chat with an AI assistant directly inside JupyterHub. Some of the key features include the following:- Generating code snippets
- Explaining Python/Spark errors
- Assisting with documentation
- Helping with SQL queries
- Providing inline suggestions while working

It doesn’t access your data unless you paste it.
- Getting quick help or code explanations
- Producing sample transformations
JupyterHub hands-on examples
This section describes several hands-on examples of using JupyterHub.Jupyter Notebook general usage and features
Jupyter notebooks combine code, output, and documentation in a single document. They’re the primary way to work with Spark, Trino, and Python in JupyterHub.Create, rename, and organize a Jupyter Notebook
To create a Notebook, see Launching a Jupyter Notebook. Follow these steps to rename a Notebook file.- On the file browser, click the notebook name.
- Click File at the top right corner of the JupyterHub page, then click Rename Notebook.
- Enter a new name and then click Rename.
- Create folders for projects.
- Drag and drop notebooks between folders.
- Delete obsolete notebooks.
Writing and running code cells
A notebook contains cells. You use these cells for inserting Python, Spark, or Trino client code. To run a cell, press Shift + Enter or Shift + Return on your keyboard. Outputs such as DataFrames, logs, plots, and query results appear directly under the cell. Within a cell, you can freely mix:- Pure Python logic
- PySpark transformations
- Trino queries
- Utility imports from
utils
Markdown support for documentation
You can add Markdown cells to document your work. To do this, you can:- Change the cell type to Markdown using the toolbar or Esc + M on your keyboard.
- Use headings, bullet points, and code fences to explain:
- Purpose of the notebook
- Input datasets and assumptions
- Key checks and results
Save a Jupyter Notebook
Jupyter automatically saves your Notebook periodically, but you can also manually save it at any time. You can do this in either of the following ways:- Click File > Save Notebook at the top left corner of the page.
- Use the keyboard shortcut:
- Windows users: Ctrl + S
- macOS users: Command + S
Export a Jupyter Notebook
Jupyter Notebook exports Notebooks in multiple formats. To do this: Click File > Save and Export Notebook As at the top left corner of the page. You can export notebooks as the following file types:- Executable Script: Python file
- HTML: Static, view-only format
- LaTeX, ReStructuredText, AsciiDoc, Reveal.js Slides, other advanced formats
- Markdown: Raw markdown representation
- Notebook: Default editable format
- PDF: Printable format
.ipynb, format:
Click File > Download at the top left corner of the page.
Share a Jupyter Notebook
JupyterHub currently doesn’t support a built-in read-only notebook sharing within the platform. To share notebooks, you can do this in either of the following ways:- Export as a
.ipynband share via email or shared workspace. - Export as a
HTMLorPDFfor static, non-editable viewing.
Jupyter Notebook terminal
You can interact with the terminal to run PySpark, submit a Spark job, or manage a virtual environment.Launch the terminal
JupyterHub includes an integrated terminal that provides a Linux shell inside your Jupyter workspace. You can open it in either of the following ways:- From the Launcher tab, click Terminal.
- Click + on the left sidebar to open the Launcher tab, and then click Terminal.
- Inspecting files and directory structures
- Viewing log files and debugging output
- Running Python scripts
- Interacting with
pyspark, if available - Installing Python packages at the user level
- Managing virtual environments
- Moving, copying, and organizing notebook files
- Quick data exploration using shell tools
Check your Spark version
Before submitting jobs, you can verify the Spark distribution installed on the system by running:Submit a PySpark job with spark-submit
For long-running or heavy batch workloads, submit PySpark jobs to the cluster using
spark-submit rather than using an interactive Notebook.
With spark-submit available on your Spark client/edge node, a typical command looks
like the following:
spark-submit options with example usage:
-master <master-url>: Specifies the cluster manager to connect to. A few examples:-master yarn: Submits the job to the YARN cluster.-master local[2]: Runs Spark locally inside the Jupyter container using 2 CPU threads. Doesn’t use a cluster or YARN.
-deploy-mode <mode>: Specifies where to run the Spark driver. For example:-deploy-mode cluster: Runs the Spark driver inside the cluster. Recommended for production and long-running ETL jobs.
-name <app-name>: Sets a human-readable name for the app. For example:-name "DailySalesETL": Appears in Spark UI or the YARN UI if running on YARN.
--conf <key>=<value>: Sets a Spark configuration property for the job. For example:--conf spark.sql.shuffle.partitions=200: Configures shuffle partitions for Spark SQL.
/path/to/etl_job.py: The PySpark script that contains your ETL logic.-input <input-path>/-output <output-path>: Script arguments, typically pointing to raw and processed locations in object storage.
count_from_s3.py:

Sample output from a
spark-submit commandRun PySpark
If you installed PySpark in your environment, you can launch it directly in the terminal:run_spark_direct.py with the following code:

An output after running PySpark from the terminal
Install packages for your user
Use the Terminal in Jupyter to install packages for your user account. This makes them available to all notebooks you run, but doesn’t affect other users or system-wide packages. It’s recommended for environment-wide installations across all notebooks within the same user workspace. Open the Terminal icon in Jupyter and run:- This makes the package available to all notebooks in your workspace.
- Supports installing multiple dependencies at once using
requirements.txt.
- System-level Python
- Preinstalled libraries
- Platform-managed packages
Install an isolated package in the terminal
Users who need isolated dependencies can create a dedicated virtual environment inside their workspace. Follow these steps to install an isolated package.-
Create a virtual environment.
-
Activate the environment in the terminal.
-
Install packages inside the venv. For example, Pandas.
Using Jupyter Notebook UI
This section focuses on using Jupyter Notebooks to run PySpark and query Trino.Launch a Jupyter Notebook
You can use either the Launcher or file browser to launch a Jupyter Notebook. Steps when using the Launcher:- From the Launcher tab, click Python 3 under Notebook.
- A notebook opens in a new tab with an empty code cell. Enter something in the cell.
The Python 3 kernel is pre-configured with Spark and Trino client libraries.
Install an isolated package using Jupyter Notebook
Enter the following code in a new notebook cell and run it using Shift + Enter or Shift + Return on your keyboard.
- Virtual environments are optional but recommended for complex dependencies.
- Package installations are persistent across notebook sessions.
- If a package installation fails, restarting the kernel usually resolves path issues.
- If two versions of a library conflict, use a virtual environment to isolate them.
Start a Spark session in Jupyter Notebook
-
Enter the following code in a new notebook cell and run it using Shift + Enter
or Shift + Return on your keyboard.
-
Add a new cell and enter the following code to validate the session. Run it using
Shift + Enter or Shift + Return on your keyboard.

- The Java Virtual Machine (JVM)
- Hadoop and HDFS client libraries
- Py4J - Python and Java bridge
- Spark environment detection
- Cluster configurations, such as YARN, executors, or memory settings
Connect to Trino from Jupyter Notebook
You can run SQL queries against Trino directly from the same notebook. To do this, create a Trino Connection and execute a sample query such as the one in the following code.
Run a Spark SQL command in Jupyter Notebook
You can use the notebook to explore Hive/Iceberg namespaces and run a Spark SQL command.
An output after running a Spark SQL command
Shared folders: utils and dags
In NexusOne, JupyterHub mounts two shared directories inside every user pod: utils and dags.
These folders support code reuse and Airflow integration across the platform.
utils/: A shared utility code
In NexusOne, the utils directory contains shared helper modules that you can import directly
into any notebook or PySpark job. It also provides the following benefits:
- Available as a mount in every Jupyter environment
- A central place for storing reusable Python utilities
- Standardized common logic across teams
- Automatic availability to all users and sessions
dags/: An Airflow DAG directory
In NexusOne, the dags folder is a live mount connected to an Airflow instance.
Any valid .py file placed here becomes an Airflow DAG automatically.
It provides the following benefits:
- Writes DAGs directly from Jupyter
- Automatic syncing of changes to the Airflow scheduler
- Enforcement of unique
dag_idvalues - Alignment with NexusOne naming conventions and validation rules

Executed DAG in Airflow
Keep DAG code clean, modular, and backed by helpers in utils/.
Best practices for users
To maintain a reproducible and performant workflow, follow these recommended JupyterHub best practices:Use small and modular cells
Separate imports, transformations, visualizations, and outputs. This helps with the following:- Easier debugging and re-running individual steps
- Improved clarity and code reuse.
Manage Spark resources efficiently
To ensure your Spark jobs run reliably and make optimal use of cluster resources, follow these caching best practices:-
Cache only when necessary using,
cacheorcount: -
Clear the cache once finished.
- Avoid caching large DataFrames unless repeatedly reused.
Document workflows
Use Markdown cells to annotate notebooks using the following:- Problem statements.
- Assumptions.
- Data sources, such as S3 paths, SQL queries, and catalog entries.
- Important decisions or validation logic.
Use spark-submit for heavy jobs
For long-running ETL or large datasets, adhere to the following:
- Avoid running inside a notebook
- Use
spark-submitfrom the Jupyter terminal - This ensures better resource handling and cleaner logs
Keep the code reusable
To keep the code reusable, adhere to the following:- Move repeated logic into
utils/ - Import helpers in notebooks and DAGs
- Avoid duplicating transformation code across projects