Use small and modular cells
Separate imports, transformations, visualizations, and outputs. This helps with the following:- Easier debugging and re-running individual steps
- Improved clarity and code reuse.
Manage Spark resources efficiently
To ensure your Spark jobs run reliably and make optimal use of cluster resources, follow these caching best practices:-
Cache only when necessary using,
cacheorcount: -
Clear the cache once finished.
- Avoid caching large DataFrames unless repeatedly reused.
Document workflows
Use Markdown cells to annotate notebooks using the following:- Problem statements.
- Assumptions.
- Data sources, such as S3 paths, SQL queries, and catalog entries.
- Important decisions or validation logic.
Use spark-submit for heavy jobs
For long-running ETL or large datasets, adhere to the following:
- Avoid running inside a notebook
- Use
spark-submitfrom the Jupyter terminal - This ensures better resource handling and cleaner logs
Keep the code reusable
To keep the code reusable, adhere to the following:- Move repeated logic into
utils/ - Import helpers in notebooks and DAGs
- Avoid duplicating transformation code across projects
Additional resource
- To get an overview of JupyterHub, refer to the JupyterHub in NexusOne page.
- To learn practical ways to use JupyterHub in the NexusOne environment, refer to the JupyterHub hands-on examples page.
- For more details about JupyterHub, refer to the JupyterHub official documentation.