Skip to main content
The JupyterHub hands-on examples page demonstrates how to use JupyterHub.

Jupyter Notebook general usage and features

Jupyter notebooks combine code, output, and documentation in a single document. They’re the primary way to work with Spark, Trino, and Python in JupyterHub.

Create, rename, and organize a Jupyter Notebook

To create a Notebook, see Launching a Jupyter Notebook. Follow these steps to rename a Notebook file.
  1. On the file browser, click the notebook name.
  2. Click File at the top right corner of the JupyterHub page, then click Rename Notebook.
  3. Enter a new name and then click Rename.
To organize Notebooks, you can use the file browser to:
  • Create folders for projects.
  • Drag and drop notebooks between folders.
  • Delete obsolete notebooks.

Writing and running code cells

A notebook contains cells. You use these cells for inserting Python, Spark, or Trino client code. To run a cell, press Shift + Enter or Shift + Return on your keyboard. Outputs such as DataFrames, logs, plots, and query results appear directly under the cell. Within a cell, you can freely mix:
  • Pure Python logic
  • PySpark transformations
  • Trino queries
  • Utility imports from utils

Markdown support for documentation

You can add Markdown cells to document your work. To do this, you can:
  • Change the cell type to Markdown using the toolbar or Esc + M on your keyboard.
  • Use headings, bullet points, and code fences to explain:
    • Purpose of the notebook
    • Input datasets and assumptions
    • Key checks and results
An example of a Markdown cell:
## Daily Load Check

This notebook validates row counts and key fields for
a table against the expected partition.

Save a Jupyter Notebook

Jupyter automatically saves your Notebook periodically, but you can also manually save it at any time. You can do this in either of the following ways:
  • Click File > Save Notebook at the top left corner of the page.
  • Use the keyboard shortcut:
    • Windows users: Ctrl + S
    • macOS users: Command + S
Jupyter Notebook creates a checkpoint, which is a saved snapshot of your Notebook, each time you manually save it.

Export a Jupyter Notebook

Jupyter Notebook exports Notebooks in multiple formats. To do this: Click File > Save and Export Notebook As at the top left corner of the page. You can export notebooks as the following file types:
  • Executable Script: Python file
  • HTML: Static, view-only format
  • LaTeX, ReStructuredText, AsciiDoc, Reveal.js Slides, other advanced formats
  • Markdown: Raw markdown representation
  • Notebook: Default editable format
  • PDF: Printable format
To export as a Notebook, .ipynb, format: Click File > Download at the top left corner of the page.

Share a Jupyter Notebook

JupyterHub currently doesn’t support a built-in read-only notebook sharing within the platform. To share notebooks, you can do this in either of the following ways:
  • Export as a .ipynb and share via email or shared workspace.
  • Export as a HTML or PDF for static, non-editable viewing.

Jupyter Notebook terminal

You can interact with the terminal to run PySpark, submit a Spark job, or manage a virtual environment.

Launch the terminal

JupyterHub includes an integrated terminal that provides a Linux shell inside your Jupyter workspace. You can open it in either of the following ways:
  • From the Launcher tab, click Terminal.
  • Click + on the left sidebar to open the Launcher tab, and then click Terminal.
Either of the previous steps opens a shell session in your user environment, similar to a lightweight VM. The terminal is useful for tasks, such as:
  • Inspecting files and directory structures
  • Viewing log files and debugging output
  • Running Python scripts
  • Interacting with pyspark, if available
  • Installing Python packages at the user level
  • Managing virtual environments
  • Moving, copying, and organizing notebook files
  • Quick data exploration using shell tools

Check your Spark version

Before submitting jobs, you can verify the Spark distribution installed on the system by running:
spark-submit --version

Submit a PySpark job with spark-submit

For long-running or heavy batch workloads, submit PySpark jobs to the cluster using spark-submit rather than using an interactive Notebook. With spark-submit available on your Spark client/edge node, a typical command looks like the following:
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --name "DailySalesETL" \
  --conf spark.sql.shuffle.partitions=200 \
  /path/to/etl_job.py \
  --input s3://datalake/raw/sales/ \
  --output s3://datalake/processed/daily_sales/
These are some key spark-submit options with example usage:
  • -master <master-url>: Specifies the cluster manager to connect to. A few examples:
    • -master yarn: Submits the job to the YARN cluster.
    • -master local[2]: Runs Spark locally inside the Jupyter container using 2 CPU threads. Doesn’t use a cluster or YARN.
  • -deploy-mode <mode>: Specifies where to run the Spark driver. For example:
    • -deploy-mode cluster: Runs the Spark driver inside the cluster. Recommended for production and long-running ETL jobs.
  • -name <app-name>: Sets a human-readable name for the app. For example:
    • -name "DailySalesETL": Appears in Spark UI or the YARN UI if running on YARN.
  • --conf <key>=<value>: Sets a Spark configuration property for the job. For example:
    • --conf spark.sql.shuffle.partitions=200: Configures shuffle partitions for Spark SQL.
  • /path/to/etl_job.py: The PySpark script that contains your ETL logic.
  • -input <input-path>/-output <output-path>: Script arguments, typically pointing to raw and processed locations in object storage.
Below is a sample code example of how to run a Spark job in local mode, read a CSV file stored in S3, and perform a simple row count. In your JupyterHub workspace, save the following code in count_from_s3.py:
from pyspark.sql import SparkSession

def main():
    spark = (
        SparkSession.builder
        .appName("CountFromS3")
        .getOrCreate()
    )

    # Replace with your S3 path
    input_path = "s3a://rapid-file-tutorial/customers.csv"

    df = (
        spark.read
        .option("header", "true")
        .option("inferSchema", "true")
        .csv(input_path)
    )

    total = df.count()
    print(f">>>>>>>>Total rows: {total}")

    spark.stop()

if __name__ == "__main__":
    main()
Now open a Jupyter terminal and run:
spark-submit \
  --master local[2] \
  --name "SampleSparkSubmitJob" \
  count_from_s3.py
10-spark-submit

Sample output from a spark-submit command

Run PySpark

If you installed PySpark in your environment, you can launch it directly in the terminal:
pyspark
This opens an interactive Spark shell using the same configuration as your Jupyter kernel. You may still see standard Spark initialization logs in your output. It’s expected. You can also run it using a Python file. For example, create a file called run_spark_direct.py with the following code:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("TerminalSparkDemo").getOrCreate()

df = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
print("Count:", df.count())

spark.stop()
Now run the file directly from the terminal:
python3 run_spark_direct.py
06-run-spark-from-terminal

An output after running PySpark from the terminal

Install packages for your user

Use the Terminal in Jupyter to install packages for your user account. This makes them available to all notebooks you run, but doesn’t affect other users or system-wide packages. It’s recommended for environment-wide installations across all notebooks within the same user workspace. Open the Terminal icon in Jupyter and run:
pip install --user numpy
  • This makes the package available to all notebooks in your workspace.
  • Supports installing multiple dependencies at once using requirements.txt.
All installations go to:
/opt/conda/lib/python3.13/site-packages/
This prevents conflicts with:
  • System-level Python
  • Preinstalled libraries
  • Platform-managed packages

Install an isolated package in the terminal

Users who need isolated dependencies can create a dedicated virtual environment inside their workspace. Follow these steps to install an isolated package.
  1. Create a virtual environment.
    python -m venv myenv
    
  2. Activate the environment in the terminal.
    source myenv/bin/activate
    
  3. Install packages inside the venv. For example, Pandas.
    pip install pandas
    

Using Jupyter Notebook UI

This section focuses on using Jupyter Notebooks to run PySpark and query Trino.

Launch a Jupyter Notebook

You can use either the Launcher or file browser to launch a Jupyter Notebook. Steps when using the Launcher:
  1. From the Launcher tab, click Python 3 under Notebook.
  2. A notebook opens in a new tab with an empty code cell. Enter something in the cell.
The Python 3 kernel is pre-configured with Spark and Trino client libraries.
There is only one step when using the file browser: Click File > New > Notebook > Python 3 at the top left corner of the page.

Install an isolated package using Jupyter Notebook

Enter the following code in a new notebook cell and run it using Shift + Enter or Shift + Return on your keyboard.
!python3 -m venv myenv

A venv folder created using Jupyter Notebook
A few things to note:
  • Virtual environments are optional but recommended for complex dependencies.
  • Package installations are persistent across notebook sessions.
  • If a package installation fails, restarting the kernel usually resolves path issues.
  • If two versions of a library conflict, use a virtual environment to isolate them.

Start a Spark session in Jupyter Notebook

  1. Enter the following code in a new notebook cell and run it using Shift + Enter or Shift + Return on your keyboard.
    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.appName("Example").getOrCreate()
    
  2. Add a new cell and enter the following code to validate the session. Run it using Shift + Enter or Shift + Return on your keyboard.
    spark.range(10).show()
    
You should see a similar output.

A range output after starting a Spark session in a Notebook
When starting a Spark session inside Jupyter, you may see multiple information or warning messages printed to the notebook output. These messages are normal, and the following underlying components generate them during Spark initialization:
  • The Java Virtual Machine (JVM)
  • Hadoop and HDFS client libraries
  • Py4J - Python and Java bridge
  • Spark environment detection
  • Cluster configurations, such as YARN, executors, or memory settings
These warnings don’t indicate errors or affect the capability of the session.

Connect to Trino from Jupyter Notebook

You can run SQL queries against Trino directly from the same notebook. To do this, create a Trino Connection and execute a sample query such as the one in the following code.
import trino

conn = trino.dbapi.connect(
host='trino-app.alliant.nx1cloud.com',
port=443,
user='admin',
catalog='iceberg',
schema='default',
http_scheme='https',
auth=trino.auth.BasicAuthentication('admin', '<your-password>'))

cur = conn.cursor()

cur.execute('show catalogs')
print(cur.fetchall())

cur.execute("set session authorization '<user>'")
print(cur.fetchall())
cur.execute("show catalogs")
print(cur.fetchall())
The code reads data through Trino and returns Python objects that you can explore or convert into DataFrames.

A Python object output after connecting to Trino from a Notebook

Run a Spark SQL command in Jupyter Notebook

You can use the notebook to explore Hive/Iceberg namespaces and run a Spark SQL command.
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("MyApp")
    .getOrCreate()
)

# List tables in a namespace
spark.sql("SHOW TABLES IN cosmo").show()
09-running-a-sql-command

An output after running a Spark SQL command

Shared folders: utils and dags

In NexusOne, JupyterHub mounts two shared directories inside every user pod: utils and dags. These folders support code reuse and Airflow integration across the platform.

utils/: A shared utility code

In NexusOne, the utils directory contains shared helper modules that you can import directly into any notebook or PySpark job. It also provides the following benefits:
  • Available as a mount in every Jupyter environment
  • A central place for storing reusable Python utilities
  • Standardized common logic across teams
  • Automatic availability to all users and sessions
For example:
from utils.common_helpers import clean_dataframe

dags/: An Airflow DAG directory

In NexusOne, the dags folder is a live mount connected to an Airflow instance. Any valid .py file placed here becomes an Airflow DAG automatically. It provides the following benefits:
  • Writes DAGs directly from Jupyter
  • Automatic syncing of changes to the Airflow scheduler
  • Enforcement of unique dag_id values
  • Alignment with NexusOne naming conventions and validation rules
For example: A file location to a DAG in JupyterHub:
/home/user/dags/asample_dag.py
12-executed-dag-in-airflow

Executed DAG in Airflow
Keep DAG code clean, modular, and backed by helpers in utils/.

Additional resource