Introduction
There are not many things that can dramatically improve your code quality overnight, but a
.pre-commit-config.yaml
file is one of them!Read this post to find out why a
.pre-commit-config.yaml
is the first file I create for my data science projects.See the bottom of this post for an example
.pre-commit-config.yaml
file I use as a template for my data science projects.
Working collaboratively with others on data science codebases is challenging.
Particularly, if like me, you are not from a software engineering background and finding your way around collaboration and code quality tools like git
, pylint
and pytest
.
When working with others on data science projects, it is not enough that your code ‘works’ or that you can produce beautiful data visualisations and performant ML models.
Your code needs to be readable and reproducible if you want to be efficient and complete projects with lasting impact.
The Challenge
When developing collaboratively, we typically use git for version control and commit any changes to a feature branch that is reviewed before merging into the main codebase.
The problem is - nothing stops us from committing low-quality code to the git repository.
Low-quality code could include:
- code that is non-pep8 compliant
- contains unused imports
- references to variables before assignment
- code that is failing tests
While you might have other tools like a CI/CD pipeline to run these checks, the CI/CD pipeline only runs after the bad code has been committed to the repository.
In data science projects, there is an even bigger risk of committing buggy code to the repository when using Jupyter notebooks. Jupyter notebooks are great for prototyping models and visualisations, however, they are notorious for breeding bad coding habits and it is difficult to spot bugs across many different notebook cells.
So what steps can we take to improve the quality of code, enforce consistent code formatting and catch these errors before even adding a commit with bad code to the git history?
The solution
Git hooks using pre-commit – a framework for managing multi-language pre-commit hooks.
Pre-commit
Pre-commit is a multi-language package manager for pre-commit hooks. You specify a list of checks (hooks) in a configuration file which will be automatically executed when the git commit
command is called. If any of these checks fail, the commit will be aborted allowing you to fix the errors before the code is committed to the repository’s history.
The pre-commit package manages the installation of any required dependencies for your hooks and can even auto-fix any errors (e.g. code formatting) after running the scripts.
What are git hooks?
Git hooks are scripts that run automatically every time a commit request is made to a Git repository.
Why are git hooks useful?
For many projects you work on professionally, there may (should) already be a CI/CD pipeline to check the code quality which runs after committing the new code to the repository. However, it can be very useful to also specify a standard set of checks to run before committing the code.
Some benefits of git hooks include:
- 🛑 Identify broken/bad code immediately: stop bad code being added to the commit history all together and catch errors before running time consuming CI/CD pipelines
- 🎯 Improve consistency: ensure all code committed to the repo (including notebook code) is PEP8 compliant and follows the same standards
- ⌛️ Save time at code reviews: allow your code reviewer to focus on changes to the code logic rather than spotting bugs or deciphering inconsistent code formatting
- 🤟 Project agnostic: Not just for Python projects - support multiple languages and file types
Most importantly, using the pre-commit package, all of these checks are done automatically so you don’t need to remember to run the scripts each time before committing to the repository.
What git hooks are available with pre-commit?
In-built git hooks
Pre-commit comes with several in-built hooks. These include a number of standard checks you might want to apply to your repo, such as:
- checking the correct syntax for configuration files (e.g. yaml, toml, json etc.)
- checking for large files (you should try avoid adding large files [e.g >1MB] to version control)
- checking for artefacts accidentally left from development (e.g. breakpoints)
- and many more…
A full list is available in the pre-commit documentation .
Custom hooks
In addition to the in-built checks, you can specify your own custom hooks .
Custom hooks are defined in a configuration file (.pre-commit-config.yaml
– see an example at the end of this post).
Pre-commit supports custom hooks written in many different languages. As long as the program run by a hook can be installed as a package (either from a public git repo or locally) and exposes an executable it can be used. This is extremely powerful as it means you can run pretty much any type of check you want.
Let’s see pre-commit in action!
Demo: Pre-Commit in action
1. Installation
To get started with pre-commit we need to install the pre-commit
package.
# using pip
pip install pre-commit
# using homebrew
brew install pre-commit
# using conda
conda install -c conda-forge pre-commit
2. Create a configuration file
The configuration for pre-commit hooks are defined in a .pre-commit-config.yaml
file.
Instructions on how to format the file are described in the documentation . A very simple configuration file could look something like this:
# example .pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: requirements-txt-fixer
- id: trailing-whitespace
In this example, pre-commit will check all yaml files for correct syntax, ensure the requirements.txt
file is alphabetically sorted (helps with readability) and checks each file for unnecessary trailing whitespace. All of these hooks are taken directly from the pre-commit in-built hooks and do not require any additional installation.
3. Install the git hooks (❗)
This step is very important (and easy to forget). In order to run the git hooks specified in the .pre-commit-config.yaml
file automatically on each commit, you need to install the git hooks.
Ensure you are in the project directory and then run the following command:
pre-commit install
This will install the hooks in the .git/hooks/pre-commit
folder.
Note, this only needs to be done once or when cloning the git repo to a new machine. After installing for the first time, any changes to the pre-commit configuration file will automatically be applied on your next commit.
4. Run the hooks
Now the pre-commit hooks have been installed, you can use your normal git add
and git commit
workflow. The checks will run automatically upon calling the git commit command.
You will see which checks have been successful in the terminal output. Note that the code changes will not be committed unless you pass all checks. If a check fails, you should inspect the terminal output to see what failed and then make the appropriate fixes before trying to add the updated code to the repo.
# normal git workflow
git add .
git commit -m "Informative message about important changes"
In the animation above, I have made some changes to a requirements.txt
file and added a .pre-commit-config.yaml
file to the repo for the first time. After installing and running the hooks, two issues are found:
- The
requirements.txt
file was not alphabetically sorted - Trailing whitespace was identified in the pre-commit configuration file
In this case pre-commit automatically makes these fixes for me (shown by the newly modified files when running the git status
command). Then I run through the commit workflow again to successfully pass the checks and commit the code to the repo.
Note that the hooks are only run against files that have been changed since the last commit. To run against all files in a repo you can use the following command:
# run against all files in the repo
pre-commit run --all-files
The pre-commit run
command can also be useful for running the checks without having to stage the files first (e.g. running git add
).
(Overriding the checks)
It is possible to override and ignore the checks by passing the --no-verify
tag at the end of your message. For example:
git add .
git commit -m "This code does not pass checks" --no-verify
However, you should try and avoid overriding the checks - they are there for a good reason! Only use this if you are certain the files you are committing do not need to be checked.
Example pre-commit configuration file
Below is a template .pre-commit-config.yaml
file which I tend to use as a basic template for new Python projects. It includes the following checks:
Standard file checks
Utilises many of the ‘out-of-the-box’ checks provided by pre-commit. I find the requirements-txt-fixer
and check-json
checks the most useful.
- ✅ Requirements.txt fixer - automatically orders requirements in alphabetical order to make it consistent across projects and easier to read
- ✅ JSON check - auto-formats JSON files and checks they are correctly formatted
- ✅ Large file check - you should try and avoid adding large files (e.g. images) to a git repo
Python file checks
Standard checks on .py
files. Ensure you have isort
, black
, mypy
and flake8
installed.
- ✅ isort: automatically sorts import statements in .py files
- ✅ black: checks code formatting and automatically applies pep8 compliant code formatting
- ✅ mypy: type checking
- ✅ flake8: checks for style consistency and errors (e.g. variable used before assignment)
Jupyter Notebook checks (nbqa )
Similar to the Python file checks. These checks use nbqa
to apply black
, flake8
and isort
to Jupyter notebooks! I find these extremely useful. It is surprisingly easy to keep unused imports or use variables before assignment in a Jupyter notebook.
- ✅ nbqa-black: code formatter for Jupyter Notebook
- ✅ nbqa-flake8: style consistency and error checking for Jupyter Notebook
- ✅ nbqa-isort: consistent import statement for Jupyter Notebook
By using the --nbqa-mutate
command, nbqa will automatically reformat the notebooks.
Happy coding!
Resources
- Pre-commit
- nbqa - auto-formatting for Jupyter notebooks, highly recommended!!
Further Reading
- How to extract bucket and file name from a Google Cloud Storage URI
- How to set up an amazing terminal for data science with oh-my-zsh plugins
- Data Science Setup on MacOS
- Do programmers need to be able to type fast?
- Voilà! Interactive Python Dashboards Straight from your Jupyter Notebook
- Visualising Asset Price Correlations
- Matplotlib: Plotting Subplots in a Loop
- Gitmoji: Add Emojis to Your Git Commit Messages!
- Which Python String Formatting Method Should You Be Using in Your Data Science Project