JupyterHub - NexusOne

JupyterHub is an interactive development environment for running Python, PySpark, and Trino SQL workloads. It’s designed for data engineers, analysts, and data scientists who need a flexible workspace for exploration, testing, validation, and development of data workflows.

Environments and runtime specification

Platform-managed environment with pre-installed packages
Python version: 3.x

NexusOne ships with the latest supported version.

Capabilities

JupyterHub offers several core features and use cases that are popular today.

Key features

A browser-based notebook interface that requires no local installation.
An integrated Spark session that’s available directly from Python notebooks.
Built-in support for Trino SQL queries.
Ability to mix documentation, code, and output in a single file.
Access to shared utilities, utils, and an Airflow DAGs folder, dags.
Optional terminal access for scripting and running batch workflows.

Use cases

JupyterHub is often recommended for the following use cases:

Building or testing PySpark transformations
Data quality validation workflows
Rapid prototyping before building an Airflow DAG
Small-scale analytics and interactive SQL queries
Investigating raw data stored in Iceberg, Hive, and S3
Trying out Python libraries, utilities, or custom logic
Debugging table issues, lineage, or schema mismatches

Authentication

JupyterHub uses NexusOne’s standard authentication flow:

Single Sign-On (SSO)
OAuth 2.0 or OpenID Connect

After navigating to the URL, you are automatically redirected to the configured login provider and granted access upon successful authentication.

User access and permissions

NexusOne’s security layer manages access to the following:

IAM groups grant JupyterHub access.
Ranger policies grant access to datasets, Hive metadata, Iceberg tables, and S3 object storage.
IAM users have access to view or interact with the data sources and storage locations when authorized.

Exploring the JupyterHub UI

This section explores the JupyterHub UI and how to navigate it. Before exploring these, you can launch JupyterHub using the following designated URL:

https://jupyter.<client>.nx1cloud.com/

When you purchase NexusOne, you receive a client name. Replace client with your assigned client name.

When you launch JupyterHub using the previous URL, you should see an image similar to the following image.

JupyterHub homepage layout

File browser

As seen in the previous image, the top left panel displays your workspace directory. When you launch a JupyterHub environment in NexusOne, you should see the following folders:

dags/
- This folder is automatically synced with Airflow.
- Any .py file placed here becomes visible to the Airflow scheduler.
- Use this location to develop, test, and maintain Airflow DAGs.
- Don’t store temporary notebooks or test files here.
utils/
- This is a shared utilities folder, mounted into every Jupyter session.
- Contains common helper functions, shared logic, configuration parsers, and reusable modules.
- You can import utilities from here directly into notebooks or DAGs:
  from utils.common_helpers import clean_dataframe

Beyond the default folders, the file browser has additional features, such as:

Creating new folders
Uploading or downloading files
Managing files by renaming, copying, or deleting them

Main launcher

As seen in the previous image, the center panel displays the main launcher. It lets you create notebooks, terminals, and other file types. The Launcher contains the following sections, Notebook, Console, and Other.

Notebook section

This section displays the available notebook kernels.

Python 3 ipykernel: The primary kernel used for the following use cases:
- Data exploration
- ETL validation workflows
- PySpark development
- Trino queries through Python libraries
R Kernel: Used for R-based analysis and visualizations

Clicking one of these icons creates a new notebook associated with the clicked Kernel.

Console section

An interface that sends commands directly to the Python or R kernel and returns output immediately. It does this without creating notebook cells or storing the interaction as a document. It’s useful for:

Quick variable checks
Running isolated code without creating notebooks
Testing Python or Spark one-liners

Other section

This includes tools and file templates inside JupyterHub.

Terminal: Opens a real shell session on the Jupyter server for the following:
- Navigating directories
- Running scripts
- Installing user-level Python packages
Text file: Creates an empty file with a .txt file type
Markdown file: Creates an empty file having a .md file type It’s useful for writing documentation or notes.
Python file: Creates an empty file with a .py file type
R file: Creates an empty file with a .r file type
Show contextual help: Opens a help panel with documentation for JupyterLab features

Status bar

As seen in the previous image, the bottom panel provides information about the current status of the environment, such as:

Active terminal count
CPU usage
Current kernel
Memory usage
Notebook mode

This helps users confirm if they’re using the correct kernel and monitor memory during Spark-heavy notebooks. As seen in the previous image, the top left sidebar provides quick access to all major tools available in your environment. Each icon represents a specific feature, and understanding these helps you navigate and manage your work efficiently.

File browser

You can view your workspace directory and all the files and folders available in your Jupyter environment. Some of the key features include the following:

Access mounted platform folders such as:
- The Airflow DAGs folder, dags/
- The shared helper library, utils/
Upload and download files
Create, rename, delete, or move files
Use the drag-and-drop to add files from your local machine or organize files into respective folders within JupyterHub.

Some of the typical uses are for:

Creating folders for your projects
Editing or placing DAG files
Opening notebooks

Running terminals and kernels

You can display all currently running processes, such as:

Active kernels
Active notebooks
Open tabs
Open terminals

Running terminals and kernels sidebar

This allows you to stop unused kernels to free up memory, prevent accidental resource consumption, and close runaway Spark sessions or long-running Python cells. Some of the typical uses are for:

Restarting a stuck notebook kernel
Cleaning up forgotten terminals
Monitoring your active background jobs

You can automatically generate a structured outline of your notebook based on markdown headers. Some of the key features include the following:

Navigating large notebooks easily
Jumping quickly between sections
Helps you maintain readable notebooks with proper documentation

Some of the typical uses are for:

Notebooks with heavy documentation.
Presentation-style notebook flows.

Object storage browser for S3

You can access S3 buckets or your object store using the UI. Some of the key features include the following:

Browsing all authorized buckets
Downloading files
Previewing CSV, JSON, and text files
Uploading new objects
Navigating through large datasets
Copying full S3 URIs used directly in PySpark, Trino, or spark-submit jobs

Object storage browser for S3

Access control:

Ranger policies control access to the bucket.
If you don’t see a bucket/path, that means you aren’t authorized.

Some of the typical uses are for:

Checking raw vs. processed data
Inspecting files before running Spark jobs
Validating daily partitions or incremental loads

Extension manager

You can turn on or turn off JupyterLab extensions, which are UI plugins. Some key sections include:

WARNING: Alerts for extensions that are incompatible, blocked, or require server-side changes
INSTALLED: Lists all extensions currently installed and enabled in your Jupyter environment
DISCOVER: Browse discoverable extensions that you can request for administrator approval

Extension manager

Some of the typical uses are for:

Enabling additional syntax themes
Enabling development helpers
Checking installed UI extensions

Jupyter AI chat

You can chat with an AI assistant directly inside JupyterHub. Some of the key features include the following:

Generating code snippets
Explaining Python/Spark errors
Assisting with documentation
Helping with SQL queries
Providing inline suggestions while working

Jupyter AI chat

It doesn’t access your data unless you paste it.

Some of the typical uses are for:

Getting quick help or code explanations
Producing sample transformations

JupyterHub hands-on examples

This section describes several hands-on examples of using JupyterHub.

Jupyter Notebook general usage and features

Jupyter notebooks combine code, output, and documentation in a single document. They’re the primary way to work with Spark, Trino, and Python in JupyterHub.

Create, rename, and organize a Jupyter Notebook

To create a Notebook, see Launching a Jupyter Notebook. Follow these steps to rename a Notebook file.

On the file browser, click the notebook name.
Click File at the top right corner of the JupyterHub page, then click Rename Notebook.
Enter a new name and then click Rename.

To organize Notebooks, you can use the file browser to:

Create folders for projects.
Drag and drop notebooks between folders.
Delete obsolete notebooks.

Writing and running code cells

A notebook contains cells. You use these cells for inserting Python, Spark, or Trino client code. To run a cell, press Shift + Enter or Shift + Return on your keyboard. Outputs such as DataFrames, logs, plots, and query results appear directly under the cell. Within a cell, you can freely mix:

Pure Python logic
PySpark transformations
Trino queries
Utility imports from utils

Markdown support for documentation

You can add Markdown cells to document your work. To do this, you can:

Change the cell type to Markdown using the toolbar or Esc + M on your keyboard.
Use headings, bullet points, and code fences to explain:
- Purpose of the notebook
- Input datasets and assumptions
- Key checks and results

An example of a Markdown cell:

## Daily Load Check

This notebook validates row counts and key fields for
a table against the expected partition.

Save a Jupyter Notebook

Jupyter automatically saves your Notebook periodically, but you can also manually save it at any time. You can do this in either of the following ways:

Click File > Save Notebook at the top left corner of the page.
Use the keyboard shortcut:
- Windows users: Ctrl + S
- macOS users: Command + S

Jupyter Notebook creates a checkpoint, which is a saved snapshot of your Notebook, each time you manually save it.

Export a Jupyter Notebook

Jupyter Notebook exports Notebooks in multiple formats. To do this: Click File > Save and Export Notebook As at the top left corner of the page. You can export notebooks as the following file types:

Executable Script: Python file
HTML: Static, view-only format
LaTeX, ReStructuredText, AsciiDoc, Reveal.js Slides, other advanced formats
Markdown: Raw markdown representation
Notebook: Default editable format
PDF: Printable format

To export as a Notebook, .ipynb, format: Click File > Download at the top left corner of the page. JupyterHub currently doesn’t support a built-in read-only notebook sharing within the platform. To share notebooks, you can do this in either of the following ways:

Export as a .ipynb and share via email or shared workspace.
Export as a HTML or PDF for static, non-editable viewing.

Jupyter Notebook terminal

You can interact with the terminal to run PySpark, submit a Spark job, or manage a virtual environment.

Launch the terminal

JupyterHub includes an integrated terminal that provides a Linux shell inside your Jupyter workspace. You can open it in either of the following ways:

From the Launcher tab, click Terminal.
Click + on the left sidebar to open the Launcher tab, and then click Terminal.

Either of the previous steps opens a shell session in your user environment, similar to a lightweight VM. The terminal is useful for tasks, such as:

Inspecting files and directory structures
Viewing log files and debugging output
Running Python scripts
Interacting with pyspark, if available
Installing Python packages at the user level
Managing virtual environments
Moving, copying, and organizing notebook files
Quick data exploration using shell tools

Check your Spark version

Before submitting jobs, you can verify the Spark distribution installed on the system by running:

spark-submit --version

Submit a PySpark job with `spark-submit`

For long-running or heavy batch workloads, submit PySpark jobs to the cluster using spark-submit rather than using an interactive Notebook. With spark-submit available on your Spark client/edge node, a typical command looks like the following:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --name "DailySalesETL" \
  --conf spark.sql.shuffle.partitions=200 \
  /path/to/etl_job.py \
  --input s3://datalake/raw/sales/ \
  --output s3://datalake/processed/daily_sales/

These are some key spark-submit options with example usage:

-master <master-url>: Specifies the cluster manager to connect to. A few examples:
- -master yarn: Submits the job to the YARN cluster.
- -master local[2]: Runs Spark locally inside the Jupyter container using 2 CPU threads. Doesn’t use a cluster or YARN.
-deploy-mode <mode>: Specifies where to run the Spark driver. For example:
- -deploy-mode cluster: Runs the Spark driver inside the cluster. Recommended for production and long-running ETL jobs.
-name <app-name>: Sets a human-readable name for the app. For example:
- -name "DailySalesETL": Appears in Spark UI or the YARN UI if running on YARN.
--conf <key>=<value>: Sets a Spark configuration property for the job. For example:
- --conf spark.sql.shuffle.partitions=200: Configures shuffle partitions for Spark SQL.
/path/to/etl_job.py: The PySpark script that contains your ETL logic.
-input <input-path>/-output <output-path>: Script arguments, typically pointing to raw and processed locations in object storage.

Below is a sample code example of how to run a Spark job in local mode, read a CSV file stored in S3, and perform a simple row count. In your JupyterHub workspace, save the following code in count_from_s3.py:

from pyspark.sql import SparkSession

def main():
    spark = (
        SparkSession.builder
        .appName("CountFromS3")
        .getOrCreate()
    )

    # Replace with your S3 path
    input_path = "s3a://rapid-file-tutorial/customers.csv"

    df = (
        spark.read
        .option("header", "true")
        .option("inferSchema", "true")
        .csv(input_path)
    )

    total = df.count()
    print(f">>>>>>>>Total rows: {total}")

    spark.stop()

if __name__ == "__main__":
    main()

Now open a Jupyter terminal and run:

spark-submit \
  --master local[2] \
  --name "SampleSparkSubmitJob" \
  count_from_s3.py

Sample output from a spark-submit command

Run PySpark

If you installed PySpark in your environment, you can launch it directly in the terminal:

pyspark

This opens an interactive Spark shell using the same configuration as your Jupyter kernel. You may still see standard Spark initialization logs in your output. It’s expected. You can also run it using a Python file. For example, create a file called run_spark_direct.py with the following code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("TerminalSparkDemo").getOrCreate()

df = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
print("Count:", df.count())

spark.stop()

Now run the file directly from the terminal:

python3 run_spark_direct.py

An output after running PySpark from the terminal

Install packages for your user

Use the Terminal in Jupyter to install packages for your user account. This makes them available to all notebooks you run, but doesn’t affect other users or system-wide packages. It’s recommended for environment-wide installations across all notebooks within the same user workspace. Open the Terminal icon in Jupyter and run:

pip install --user numpy

This makes the package available to all notebooks in your workspace.
Supports installing multiple dependencies at once using requirements.txt.

All installations go to:

/opt/conda/lib/python3.13/site-packages/

This prevents conflicts with:

System-level Python
Preinstalled libraries
Platform-managed packages

Install an isolated package in the terminal

Users who need isolated dependencies can create a dedicated virtual environment inside their workspace. Follow these steps to install an isolated package.

Create a virtual environment.
```
python -m venv myenv
```
Activate the environment in the terminal.
```
source myenv/bin/activate
```
Install packages inside the venv. For example, Pandas.
```
pip install pandas
```

Using Jupyter Notebook UI

This section focuses on using Jupyter Notebooks to run PySpark and query Trino.

Launch a Jupyter Notebook

You can use either the Launcher or file browser to launch a Jupyter Notebook. Steps when using the Launcher:

From the Launcher tab, click Python 3 under Notebook.
A notebook opens in a new tab with an empty code cell. Enter something in the cell.

The Python 3 kernel is pre-configured with Spark and Trino client libraries.

There is only one step when using the file browser: Click File > New > Notebook > Python 3 at the top left corner of the page.

Install an isolated package using Jupyter Notebook

Enter the following code in a new notebook cell and run it using Shift + Enter or Shift + Return on your keyboard.

!python3 -m venv myenv

A venv folder created using Jupyter Notebook

A few things to note:

Virtual environments are optional but recommended for complex dependencies.
Package installations are persistent across notebook sessions.
If a package installation fails, restarting the kernel usually resolves path issues.
If two versions of a library conflict, use a virtual environment to isolate them.

Start a Spark session in Jupyter Notebook

Enter the following code in a new notebook cell and run it using Shift + Enter or Shift + Return on your keyboard.
```
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()
```
Add a new cell and enter the following code to validate the session. Run it using Shift + Enter or Shift + Return on your keyboard.
```
spark.range(10).show()
```

You should see a similar output.

A range output after starting a Spark session in a Notebook

When starting a Spark session inside Jupyter, you may see multiple information or warning messages printed to the notebook output. These messages are normal, and the following underlying components generate them during Spark initialization:

The Java Virtual Machine (JVM)
Hadoop and HDFS client libraries
Py4J - Python and Java bridge
Spark environment detection
Cluster configurations, such as YARN, executors, or memory settings

These warnings don’t indicate errors or affect the capability of the session.

Connect to Trino from Jupyter Notebook

You can run SQL queries against Trino directly from the same notebook. To do this, create a Trino Connection and execute a sample query such as the one in the following code.

import trino

conn = trino.dbapi.connect(
host='trino-app.alliant.nx1cloud.com',
port=443,
user='admin',
catalog='iceberg',
schema='default',
http_scheme='https',
auth=trino.auth.BasicAuthentication('admin', '<your-password>'))

cur = conn.cursor()

cur.execute('show catalogs')
print(cur.fetchall())

cur.execute("set session authorization '<user>'")
print(cur.fetchall())
cur.execute("show catalogs")
print(cur.fetchall())

The code reads data through Trino and returns Python objects that you can explore or convert into DataFrames.

A Python object output after connecting to Trino from a Notebook

Run a Spark SQL command in Jupyter Notebook

You can use the notebook to explore Hive/Iceberg namespaces and run a Spark SQL command.

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("MyApp")
    .getOrCreate()
)

# List tables in a namespace
spark.sql("SHOW TABLES IN cosmo").show()

An output after running a Spark SQL command

Shared folders: `utils` and `dags`

In NexusOne, JupyterHub mounts two shared directories inside every user pod: utils and dags. These folders support code reuse and Airflow integration across the platform.

`utils/`: A shared utility code

In NexusOne, the utils directory contains shared helper modules that you can import directly into any notebook or PySpark job. It also provides the following benefits:

Available as a mount in every Jupyter environment
A central place for storing reusable Python utilities
Standardized common logic across teams
Automatic availability to all users and sessions

For example:

from utils.common_helpers import clean_dataframe

`dags/`: An Airflow DAG directory

In NexusOne, the dags folder is a live mount connected to an Airflow instance. Any valid .py file placed here becomes an Airflow DAG automatically. It provides the following benefits:

Writes DAGs directly from Jupyter
Automatic syncing of changes to the Airflow scheduler
Enforcement of unique dag_id values
Alignment with NexusOne naming conventions and validation rules

For example: A file location to a DAG in JupyterHub:

/home/user/dags/asample_dag.py

Executed DAG in Airflow

Keep DAG code clean, modular, and backed by helpers in utils/.

Best practices for users

To maintain a reproducible and performant workflow, follow these recommended JupyterHub best practices:

Use small and modular cells

Separate imports, transformations, visualizations, and outputs. This helps with the following:

Easier debugging and re-running individual steps
Improved clarity and code reuse.

Manage Spark resources efficiently

To ensure your Spark jobs run reliably and make optimal use of cluster resources, follow these caching best practices:

Cache only when necessary using, cache or count:
```
df.cache()
df.count()
```
Clear the cache once finished.
```
spark.catalog.clearCache()
```
Avoid caching large DataFrames unless repeatedly reused.

Document workflows

Use Markdown cells to annotate notebooks using the following:

Problem statements.
Assumptions.
Data sources, such as S3 paths, SQL queries, and catalog entries.
Important decisions or validation logic.

Readable notebooks improve handoffs and review cycles.

Use `spark-submit` for heavy jobs

For long-running ETL or large datasets, adhere to the following:

Avoid running inside a notebook
Use spark-submit from the Jupyter terminal
This ensures better resource handling and cleaner logs

For example:

spark-submit --master local[2] my_job.py

Keep the code reusable

To keep the code reusable, adhere to the following:

Move repeated logic into utils/
Import helpers in notebooks and DAGs
Avoid duplicating transformation code across projects

Additional resource

For more details about JupyterHub, refer to the JupyterHub official documentation.

​Environments and runtime specification

​Capabilities

​Key features

​Use cases

​Authentication

​User access and permissions

​Exploring the JupyterHub UI

​File browser

​Main launcher

​Notebook section

​Console section

​Other section

​Status bar

​Sidebar tools

​File browser

​Running terminals and kernels

​Table of contents

​Object storage browser for S3

​Extension manager

​Jupyter AI chat

​JupyterHub hands-on examples

​Jupyter Notebook general usage and features

​Create, rename, and organize a Jupyter Notebook

​Writing and running code cells

​Markdown support for documentation

​Save a Jupyter Notebook

​Export a Jupyter Notebook

​Share a Jupyter Notebook

​Jupyter Notebook terminal

​Launch the terminal

​Check your Spark version

​Submit a PySpark job with spark-submit

​Run PySpark

​Install packages for your user

​Install an isolated package in the terminal

​Using Jupyter Notebook UI

​Launch a Jupyter Notebook

​Install an isolated package using Jupyter Notebook

​Start a Spark session in Jupyter Notebook

​Connect to Trino from Jupyter Notebook

​Run a Spark SQL command in Jupyter Notebook

​Shared folders: utils and dags

​utils/: A shared utility code

​dags/: An Airflow DAG directory

​Best practices for users

​Use small and modular cells

​Manage Spark resources efficiently

​Document workflows

​Use spark-submit for heavy jobs

​Keep the code reusable

​Additional resource

Environments and runtime specification

Capabilities

Key features

Use cases

Authentication

User access and permissions

Exploring the JupyterHub UI

File browser

Main launcher

Notebook section

Console section

Other section

Status bar

Sidebar tools

File browser

Running terminals and kernels

Table of contents

Object storage browser for S3

Extension manager

Jupyter AI chat

JupyterHub hands-on examples

Jupyter Notebook general usage and features

Create, rename, and organize a Jupyter Notebook

Writing and running code cells

Markdown support for documentation

Save a Jupyter Notebook

Export a Jupyter Notebook

Share a Jupyter Notebook

Jupyter Notebook terminal

Launch the terminal

Check your Spark version

Submit a PySpark job with `spark-submit`

Run PySpark

Install packages for your user

Install an isolated package in the terminal

Using Jupyter Notebook UI

Launch a Jupyter Notebook

Install an isolated package using Jupyter Notebook

Start a Spark session in Jupyter Notebook

Connect to Trino from Jupyter Notebook

Run a Spark SQL command in Jupyter Notebook

Shared folders: `utils` and `dags`

`utils/`: A shared utility code

`dags/`: An Airflow DAG directory

Best practices for users

Use small and modular cells

Manage Spark resources efficiently

Document workflows

Use `spark-submit` for heavy jobs

Keep the code reusable

Additional resource