Skip to main content
The JupyterHub best practices page describes several efficient ways to use JupyterHub to maintain a reproducible and performant workflow.

Use small and modular cells

Separate imports, transformations, visualizations, and outputs. This helps with the following:
  • Easier debugging and re-running individual steps
  • Improved clarity and code reuse.

Manage Spark resources efficiently

To ensure your Spark jobs run reliably and make optimal use of cluster resources, follow these caching best practices:
  • Cache only when necessary using, cache or count:
    df.cache()
    df.count()
    
  • Clear the cache once finished.
    spark.catalog.clearCache()
    
  • Avoid caching large DataFrames unless repeatedly reused.

Document workflows

Use Markdown cells to annotate notebooks using the following:
  • Problem statements.
  • Assumptions.
  • Data sources, such as S3 paths, SQL queries, and catalog entries.
  • Important decisions or validation logic.
Readable notebooks improve handoffs and review cycles.

Use spark-submit for heavy jobs

For long-running ETL or large datasets, adhere to the following:
  • Avoid running inside a notebook
  • Use spark-submit from the Jupyter terminal
  • This ensures better resource handling and cleaner logs
For example:
spark-submit --master local[2] my_job.py

Keep the code reusable

To keep the code reusable, adhere to the following:
  • Move repeated logic into utils/
  • Import helpers in notebooks and DAGs
  • Avoid duplicating transformation code across projects

Additional resource