Skip to main content
JupyterHub is an interactive development environment for running Python, PySpark, and Trino SQL workloads. It’s designed for data engineers, analysts, and data scientists who need a flexible workspace for exploration, testing, validation, and development of data workflows.

Environments and runtime specification

  • Platform-managed environment with pre-installed packages
  • Python version: 3.x
NexusOne ships with the latest supported version.

Capabilities

JupyterHub offers several core features and use cases that are popular today.

Key features

  • A browser-based notebook interface that requires no local installation.
  • An integrated Spark session that’s available directly from Python notebooks.
  • Built-in support for Trino SQL queries.
  • Ability to mix documentation, code, and output in a single file.
  • Access to shared utilities, utils, and an Airflow DAGs folder, dags.
  • Optional terminal access for scripting and running batch workflows.

Use cases

JupyterHub is often recommended for the following use cases:
  • Building or testing PySpark transformations
  • Data quality validation workflows
  • Rapid prototyping before building an Airflow DAG
  • Small-scale analytics and interactive SQL queries
  • Investigating raw data stored in Iceberg, Hive, and S3
  • Trying out Python libraries, utilities, or custom logic
  • Debugging table issues, lineage, or schema mismatches

Authentication

JupyterHub uses NexusOne’s standard authentication flow:
  • Single Sign-On (SSO)
  • OAuth 2.0 or OpenID Connect
After navigating to the URL, you are automatically redirected to the configured login provider and granted access upon successful authentication.

User access and permissions

NexusOne’s security layer manages access to the following:
  • IAM groups grant JupyterHub access.
  • Ranger policies grant access to datasets, Hive metadata, Iceberg tables, and S3 object storage.
  • IAM users have access to view or interact with the data sources and storage locations when authorized.

Exploring the JupyterHub UI

This section explores the JupyterHub UI and how to navigate it. Before exploring these, you can launch JupyterHub using the following designated URL:
https://jupyter.<client>.nx1cloud.com/
When you purchase NexusOne, you receive a client name. Replace client with your assigned client name.
When you launch JupyterHub using the previous URL, you should see an image similar to the following image.
01-jupyterhub-homepage

JupyterHub homepage layout

File browser

As seen in the previous image, the top left panel displays your workspace directory. When you launch a JupyterHub environment in NexusOne, you should see the following folders:
  • dags/
    • This folder is automatically synced with Airflow.
    • Any .py file placed here becomes visible to the Airflow scheduler.
    • Use this location to develop, test, and maintain Airflow DAGs.
    • Don’t store temporary notebooks or test files here.
  • utils/
    • This is a shared utilities folder, mounted into every Jupyter session.
    • Contains common helper functions, shared logic, configuration parsers, and reusable modules.
    • You can import utilities from here directly into notebooks or DAGs:
      from utils.common_helpers import clean_dataframe
      
Beyond the default folders, the file browser has additional features, such as:
  • Creating new folders
  • Uploading or downloading files
  • Managing files by renaming, copying, or deleting them

Main launcher

As seen in the previous image, the center panel displays the main launcher. It lets you create notebooks, terminals, and other file types. The Launcher contains the following sections, Notebook, Console, and Other.

Notebook section

This section displays the available notebook kernels.
  • Python 3 ipykernel: The primary kernel used for the following use cases:
    • Data exploration
    • ETL validation workflows
    • PySpark development
    • Trino queries through Python libraries
  • R Kernel: Used for R-based analysis and visualizations
Clicking one of these icons creates a new notebook associated with the clicked Kernel.

Console section

An interface that sends commands directly to the Python or R kernel and returns output immediately. It does this without creating notebook cells or storing the interaction as a document. It’s useful for:
  • Quick variable checks
  • Running isolated code without creating notebooks
  • Testing Python or Spark one-liners

Other section

This includes tools and file templates inside JupyterHub.
  • Terminal: Opens a real shell session on the Jupyter server for the following:
    • Navigating directories
    • Running scripts
    • Installing user-level Python packages
  • Text file: Creates an empty file with a .txt file type
  • Markdown file: Creates an empty file having a .md file type It’s useful for writing documentation or notes.
  • Python file: Creates an empty file with a .py file type
  • R file: Creates an empty file with a .r file type
  • Show contextual help: Opens a help panel with documentation for JupyterLab features

Status bar

As seen in the previous image, the bottom panel provides information about the current status of the environment, such as:
  • Active terminal count
  • CPU usage
  • Current kernel
  • Memory usage
  • Notebook mode
This helps users confirm if they’re using the correct kernel and monitor memory during Spark-heavy notebooks. As seen in the previous image, the top left sidebar provides quick access to all major tools available in your environment. Each icon represents a specific feature, and understanding these helps you navigate and manage your work efficiently.

File browser

You can view your workspace directory and all the files and folders available in your Jupyter environment. Some of the key features include the following:
  • Access mounted platform folders such as:
    • The Airflow DAGs folder, dags/
    • The shared helper library, utils/
  • Upload and download files
  • Create, rename, delete, or move files
  • Use the drag-and-drop to add files from your local machine or organize files into respective folders within JupyterHub.
Some of the typical uses are for:
  • Creating folders for your projects
  • Editing or placing DAG files
  • Opening notebooks

Running terminals and kernels

You can display all currently running processes, such as:
  • Active kernels
  • Active notebooks
  • Open tabs
  • Open terminals

Running terminals and kernels sidebar
This allows you to stop unused kernels to free up memory, prevent accidental resource consumption, and close runaway Spark sessions or long-running Python cells. Some of the typical uses are for:
  • Restarting a stuck notebook kernel
  • Cleaning up forgotten terminals
  • Monitoring your active background jobs

Table of contents

You can automatically generate a structured outline of your notebook based on markdown headers. Some of the key features include the following:
  • Navigating large notebooks easily
  • Jumping quickly between sections
  • Helps you maintain readable notebooks with proper documentation
Some of the typical uses are for:
  • Notebooks with heavy documentation.
  • Presentation-style notebook flows.

Object storage browser for S3

You can access S3 buckets or your object store using the UI. Some of the key features include the following:
  • Browsing all authorized buckets
  • Downloading files
  • Previewing CSV, JSON, and text files
  • Uploading new objects
  • Navigating through large datasets
  • Copying full S3 URIs used directly in PySpark, Trino, or spark-submit jobs

Object storage browser for S3
Access control:
  • Ranger policies control access to the bucket.
  • If you don’t see a bucket/path, that means you aren’t authorized.
Some of the typical uses are for:
  • Checking raw vs. processed data
  • Inspecting files before running Spark jobs
  • Validating daily partitions or incremental loads

Extension manager

You can turn on or turn off JupyterLab extensions, which are UI plugins. Some key sections include:
  • WARNING: Alerts for extensions that are incompatible, blocked, or require server-side changes
  • INSTALLED: Lists all extensions currently installed and enabled in your Jupyter environment
  • DISCOVER: Browse discoverable extensions that you can request for administrator approval

Extension manager
Some of the typical uses are for:
  • Enabling additional syntax themes
  • Enabling development helpers
  • Checking installed UI extensions

Jupyter AI chat

You can chat with an AI assistant directly inside JupyterHub. Some of the key features include the following:
  • Generating code snippets
  • Explaining Python/Spark errors
  • Assisting with documentation
  • Helping with SQL queries
  • Providing inline suggestions while working

Jupyter AI chat
It doesn’t access your data unless you paste it.
Some of the typical uses are for:
  • Getting quick help or code explanations
  • Producing sample transformations

JupyterHub hands-on examples

This section describes several hands-on examples of using JupyterHub.

Jupyter Notebook general usage and features

Jupyter notebooks combine code, output, and documentation in a single document. They’re the primary way to work with Spark, Trino, and Python in JupyterHub.

Create, rename, and organize a Jupyter Notebook

To create a Notebook, see Launching a Jupyter Notebook. Follow these steps to rename a Notebook file.
  1. On the file browser, click the notebook name.
  2. Click File at the top right corner of the JupyterHub page, then click Rename Notebook.
  3. Enter a new name and then click Rename.
To organize Notebooks, you can use the file browser to:
  • Create folders for projects.
  • Drag and drop notebooks between folders.
  • Delete obsolete notebooks.

Writing and running code cells

A notebook contains cells. You use these cells for inserting Python, Spark, or Trino client code. To run a cell, press Shift + Enter or Shift + Return on your keyboard. Outputs such as DataFrames, logs, plots, and query results appear directly under the cell. Within a cell, you can freely mix:
  • Pure Python logic
  • PySpark transformations
  • Trino queries
  • Utility imports from utils

Markdown support for documentation

You can add Markdown cells to document your work. To do this, you can:
  • Change the cell type to Markdown using the toolbar or Esc + M on your keyboard.
  • Use headings, bullet points, and code fences to explain:
    • Purpose of the notebook
    • Input datasets and assumptions
    • Key checks and results
An example of a Markdown cell:
## Daily Load Check

This notebook validates row counts and key fields for
a table against the expected partition.

Save a Jupyter Notebook

Jupyter automatically saves your Notebook periodically, but you can also manually save it at any time. You can do this in either of the following ways:
  • Click File > Save Notebook at the top left corner of the page.
  • Use the keyboard shortcut:
    • Windows users: Ctrl + S
    • macOS users: Command + S
Jupyter Notebook creates a checkpoint, which is a saved snapshot of your Notebook, each time you manually save it.

Export a Jupyter Notebook

Jupyter Notebook exports Notebooks in multiple formats. To do this: Click File > Save and Export Notebook As at the top left corner of the page. You can export notebooks as the following file types:
  • Executable Script: Python file
  • HTML: Static, view-only format
  • LaTeX, ReStructuredText, AsciiDoc, Reveal.js Slides, other advanced formats
  • Markdown: Raw markdown representation
  • Notebook: Default editable format
  • PDF: Printable format
To export as a Notebook, .ipynb, format: Click File > Download at the top left corner of the page.

Share a Jupyter Notebook

JupyterHub currently doesn’t support a built-in read-only notebook sharing within the platform. To share notebooks, you can do this in either of the following ways:
  • Export as a .ipynb and share via email or shared workspace.
  • Export as a HTML or PDF for static, non-editable viewing.

Jupyter Notebook terminal

You can interact with the terminal to run PySpark, submit a Spark job, or manage a virtual environment.

Launch the terminal

JupyterHub includes an integrated terminal that provides a Linux shell inside your Jupyter workspace. You can open it in either of the following ways:
  • From the Launcher tab, click Terminal.
  • Click + on the left sidebar to open the Launcher tab, and then click Terminal.
Either of the previous steps opens a shell session in your user environment, similar to a lightweight VM. The terminal is useful for tasks, such as:
  • Inspecting files and directory structures
  • Viewing log files and debugging output
  • Running Python scripts
  • Interacting with pyspark, if available
  • Installing Python packages at the user level
  • Managing virtual environments
  • Moving, copying, and organizing notebook files
  • Quick data exploration using shell tools

Check your Spark version

Before submitting jobs, you can verify the Spark distribution installed on the system by running:
spark-submit --version

Submit a PySpark job with spark-submit

For long-running or heavy batch workloads, submit PySpark jobs to the cluster using spark-submit rather than using an interactive Notebook. With spark-submit available on your Spark client/edge node, a typical command looks like the following:
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --name "DailySalesETL" \
  --conf spark.sql.shuffle.partitions=200 \
  /path/to/etl_job.py \
  --input s3://datalake/raw/sales/ \
  --output s3://datalake/processed/daily_sales/
These are some key spark-submit options with example usage:
  • -master <master-url>: Specifies the cluster manager to connect to. A few examples:
    • -master yarn: Submits the job to the YARN cluster.
    • -master local[2]: Runs Spark locally inside the Jupyter container using 2 CPU threads. Doesn’t use a cluster or YARN.
  • -deploy-mode <mode>: Specifies where to run the Spark driver. For example:
    • -deploy-mode cluster: Runs the Spark driver inside the cluster. Recommended for production and long-running ETL jobs.
  • -name <app-name>: Sets a human-readable name for the app. For example:
    • -name "DailySalesETL": Appears in Spark UI or the YARN UI if running on YARN.
  • --conf <key>=<value>: Sets a Spark configuration property for the job. For example:
    • --conf spark.sql.shuffle.partitions=200: Configures shuffle partitions for Spark SQL.
  • /path/to/etl_job.py: The PySpark script that contains your ETL logic.
  • -input <input-path>/-output <output-path>: Script arguments, typically pointing to raw and processed locations in object storage.
Below is a sample code example of how to run a Spark job in local mode, read a CSV file stored in S3, and perform a simple row count. In your JupyterHub workspace, save the following code in count_from_s3.py:
from pyspark.sql import SparkSession

def main():
    spark = (
        SparkSession.builder
        .appName("CountFromS3")
        .getOrCreate()
    )

    # Replace with your S3 path
    input_path = "s3a://rapid-file-tutorial/customers.csv"

    df = (
        spark.read
        .option("header", "true")
        .option("inferSchema", "true")
        .csv(input_path)
    )

    total = df.count()
    print(f">>>>>>>>Total rows: {total}")

    spark.stop()

if __name__ == "__main__":
    main()
Now open a Jupyter terminal and run:
spark-submit \
  --master local[2] \
  --name "SampleSparkSubmitJob" \
  count_from_s3.py
10-spark-submit

Sample output from a spark-submit command

Run PySpark

If you installed PySpark in your environment, you can launch it directly in the terminal:
pyspark
This opens an interactive Spark shell using the same configuration as your Jupyter kernel. You may still see standard Spark initialization logs in your output. It’s expected. You can also run it using a Python file. For example, create a file called run_spark_direct.py with the following code:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("TerminalSparkDemo").getOrCreate()

df = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
print("Count:", df.count())

spark.stop()
Now run the file directly from the terminal:
python3 run_spark_direct.py
06-run-spark-from-terminal

An output after running PySpark from the terminal

Install packages for your user

Use the Terminal in Jupyter to install packages for your user account. This makes them available to all notebooks you run, but doesn’t affect other users or system-wide packages. It’s recommended for environment-wide installations across all notebooks within the same user workspace. Open the Terminal icon in Jupyter and run:
pip install --user numpy
  • This makes the package available to all notebooks in your workspace.
  • Supports installing multiple dependencies at once using requirements.txt.
All installations go to:
/opt/conda/lib/python3.13/site-packages/
This prevents conflicts with:
  • System-level Python
  • Preinstalled libraries
  • Platform-managed packages

Install an isolated package in the terminal

Users who need isolated dependencies can create a dedicated virtual environment inside their workspace. Follow these steps to install an isolated package.
  1. Create a virtual environment.
    python -m venv myenv
    
  2. Activate the environment in the terminal.
    source myenv/bin/activate
    
  3. Install packages inside the venv. For example, Pandas.
    pip install pandas
    

Using Jupyter Notebook UI

This section focuses on using Jupyter Notebooks to run PySpark and query Trino.

Launch a Jupyter Notebook

You can use either the Launcher or file browser to launch a Jupyter Notebook. Steps when using the Launcher:
  1. From the Launcher tab, click Python 3 under Notebook.
  2. A notebook opens in a new tab with an empty code cell. Enter something in the cell.
The Python 3 kernel is pre-configured with Spark and Trino client libraries.
There is only one step when using the file browser: Click File > New > Notebook > Python 3 at the top left corner of the page.

Install an isolated package using Jupyter Notebook

Enter the following code in a new notebook cell and run it using Shift + Enter or Shift + Return on your keyboard.
!python3 -m venv myenv

A venv folder created using Jupyter Notebook
A few things to note:
  • Virtual environments are optional but recommended for complex dependencies.
  • Package installations are persistent across notebook sessions.
  • If a package installation fails, restarting the kernel usually resolves path issues.
  • If two versions of a library conflict, use a virtual environment to isolate them.

Start a Spark session in Jupyter Notebook

  1. Enter the following code in a new notebook cell and run it using Shift + Enter or Shift + Return on your keyboard.
    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.appName("Example").getOrCreate()
    
  2. Add a new cell and enter the following code to validate the session. Run it using Shift + Enter or Shift + Return on your keyboard.
    spark.range(10).show()
    
You should see a similar output.

A range output after starting a Spark session in a Notebook
When starting a Spark session inside Jupyter, you may see multiple information or warning messages printed to the notebook output. These messages are normal, and the following underlying components generate them during Spark initialization:
  • The Java Virtual Machine (JVM)
  • Hadoop and HDFS client libraries
  • Py4J - Python and Java bridge
  • Spark environment detection
  • Cluster configurations, such as YARN, executors, or memory settings
These warnings don’t indicate errors or affect the capability of the session.

Connect to Trino from Jupyter Notebook

You can run SQL queries against Trino directly from the same notebook. To do this, create a Trino Connection and execute a sample query such as the one in the following code.
import trino

conn = trino.dbapi.connect(
host='trino-app.alliant.nx1cloud.com',
port=443,
user='admin',
catalog='iceberg',
schema='default',
http_scheme='https',
auth=trino.auth.BasicAuthentication('admin', '<your-password>'))

cur = conn.cursor()

cur.execute('show catalogs')
print(cur.fetchall())

cur.execute("set session authorization '<user>'")
print(cur.fetchall())
cur.execute("show catalogs")
print(cur.fetchall())
The code reads data through Trino and returns Python objects that you can explore or convert into DataFrames.

A Python object output after connecting to Trino from a Notebook

Run a Spark SQL command in Jupyter Notebook

You can use the notebook to explore Hive/Iceberg namespaces and run a Spark SQL command.
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("MyApp")
    .getOrCreate()
)

# List tables in a namespace
spark.sql("SHOW TABLES IN cosmo").show()
09-running-a-sql-command

An output after running a Spark SQL command

Shared folders: utils and dags

In NexusOne, JupyterHub mounts two shared directories inside every user pod: utils and dags. These folders support code reuse and Airflow integration across the platform.

utils/: A shared utility code

In NexusOne, the utils directory contains shared helper modules that you can import directly into any notebook or PySpark job. It also provides the following benefits:
  • Available as a mount in every Jupyter environment
  • A central place for storing reusable Python utilities
  • Standardized common logic across teams
  • Automatic availability to all users and sessions
For example:
from utils.common_helpers import clean_dataframe

dags/: An Airflow DAG directory

In NexusOne, the dags folder is a live mount connected to an Airflow instance. Any valid .py file placed here becomes an Airflow DAG automatically. It provides the following benefits:
  • Writes DAGs directly from Jupyter
  • Automatic syncing of changes to the Airflow scheduler
  • Enforcement of unique dag_id values
  • Alignment with NexusOne naming conventions and validation rules
For example: A file location to a DAG in JupyterHub:
/home/user/dags/asample_dag.py
12-executed-dag-in-airflow

Executed DAG in Airflow
Keep DAG code clean, modular, and backed by helpers in utils/.

Best practices for users

To maintain a reproducible and performant workflow, follow these recommended JupyterHub best practices:

Use small and modular cells

Separate imports, transformations, visualizations, and outputs. This helps with the following:
  • Easier debugging and re-running individual steps
  • Improved clarity and code reuse.

Manage Spark resources efficiently

To ensure your Spark jobs run reliably and make optimal use of cluster resources, follow these caching best practices:
  • Cache only when necessary using, cache or count:
    df.cache()
    df.count()
    
  • Clear the cache once finished.
    spark.catalog.clearCache()
    
  • Avoid caching large DataFrames unless repeatedly reused.

Document workflows

Use Markdown cells to annotate notebooks using the following:
  • Problem statements.
  • Assumptions.
  • Data sources, such as S3 paths, SQL queries, and catalog entries.
  • Important decisions or validation logic.
Readable notebooks improve handoffs and review cycles.

Use spark-submit for heavy jobs

For long-running ETL or large datasets, adhere to the following:
  • Avoid running inside a notebook
  • Use spark-submit from the Jupyter terminal
  • This ensures better resource handling and cleaner logs
For example:
spark-submit --master local[2] my_job.py

Keep the code reusable

To keep the code reusable, adhere to the following:
  • Move repeated logic into utils/
  • Import helpers in notebooks and DAGs
  • Avoid duplicating transformation code across projects

Additional resource

For more details about JupyterHub, refer to the JupyterHub official documentation.