Introduction

There are not many things that can dramatically improve your code quality overnight, but a .pre-commit-config.yaml file is one of them!

Read this post to find out why a .pre-commit-config.yaml is the first file I create for my data science projects.

See the bottom of this post for an example .pre-commit-config.yaml file I use as a template for my data science projects.

Working collaboratively with others on data science codebases is challenging.

Particularly, if like me, you are not from a software engineering background and finding your way around collaboration and code quality tools like git, pylint and pytest.

When working with others on data science projects, it is not enough that your code ‘works’ or that you can produce beautiful data visualisations and performant ML models.

Your code needs to be readable and reproducible if you want to be efficient and complete projects with lasting impact.

The Challenge

When developing collaboratively, we typically use git for version control and commit any changes to a feature branch that is reviewed before merging into the main codebase.

The problem is - nothing stops us from committing low-quality code to the git repository.

Low-quality code could include:

  • code that is non-pep8 compliant
  • contains unused imports
  • references to variables before assignment
  • code that is failing tests

While you might have other tools like a CI/CD pipeline to run these checks, the CI/CD pipeline only runs after the bad code has been committed to the repository.

In data science projects, there is an even bigger risk of committing buggy code to the repository when using Jupyter notebooks. Jupyter notebooks are great for prototyping models and visualisations, however, they are notorious for breeding bad coding habits and it is difficult to spot bugs across many different notebook cells.

So what steps can we take to improve the quality of code, enforce consistent code formatting and catch these errors before even adding a commit with bad code to the git history?

The solution

Git hooks using pre-commit – a framework for managing multi-language pre-commit hooks.

Pre-commit

Pre-commit is a multi-language package manager for pre-commit hooks. You specify a list of checks (hooks) in a configuration file which will be automatically executed when the git commit command is called. If any of these checks fail, the commit will be aborted allowing you to fix the errors before the code is committed to the repository’s history.

The pre-commit package manages the installation of any required dependencies for your hooks and can even auto-fix any errors (e.g. code formatting) after running the scripts.

What are git hooks?

Git hooks are scripts that run automatically every time a commit request is made to a Git repository.

Why are git hooks useful?

For many projects you work on professionally, there may (should) already be a CI/CD pipeline to check the code quality which runs after committing the new code to the repository. However, it can be very useful to also specify a standard set of checks to run before committing the code.

Some benefits of git hooks include:

  • 🛑 Identify broken/bad code immediately: stop bad code being added to the commit history all together and catch errors before running time consuming CI/CD pipelines
  • 🎯 Improve consistency: ensure all code committed to the repo (including notebook code) is PEP8 compliant and follows the same standards
  • ⌛️ Save time at code reviews: allow your code reviewer to focus on changes to the code logic rather than spotting bugs or deciphering inconsistent code formatting
  • 🤟 Project agnostic: Not just for Python projects - support multiple languages and file types

Most importantly, using the pre-commit package, all of these checks are done automatically so you don’t need to remember to run the scripts each time before committing to the repository.

What git hooks are available with pre-commit?

In-built git hooks

Pre-commit comes with several in-built hooks. These include a number of standard checks you might want to apply to your repo, such as:

  • checking the correct syntax for configuration files (e.g. yaml, toml, json etc.)
  • checking for large files (you should try avoid adding large files [e.g >1MB] to version control)
  • checking for artefacts accidentally left from development (e.g. breakpoints)
  • and many more…

A full list is available in the pre-commit documentation .

Custom hooks

In addition to the in-built checks, you can specify your own custom hooks .

Custom hooks are defined in a configuration file (.pre-commit-config.yaml – see an example at the end of this post).

Pre-commit supports custom hooks written in many different languages. As long as the program run by a hook can be installed as a package (either from a public git repo or locally) and exposes an executable it can be used. This is extremely powerful as it means you can run pretty much any type of check you want.

Let’s see pre-commit in action!

Demo: Pre-Commit in action

1. Installation

To get started with pre-commit we need to install the pre-commit package.

# using pip
pip install pre-commit

# using homebrew
brew install pre-commit

# using conda
conda install -c conda-forge pre-commit

2. Create a configuration file

The configuration for pre-commit hooks are defined in a .pre-commit-config.yaml file.

Instructions on how to format the file are described in the documentation . A very simple configuration file could look something like this:

# example .pre-commit-config.yaml
repos:
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v2.3.0
    hooks:
    -   id: check-yaml
    -   id: requirements-txt-fixer
    -   id: trailing-whitespace

In this example, pre-commit will check all yaml files for correct syntax, ensure the requirements.txt file is alphabetically sorted (helps with readability) and checks each file for unnecessary trailing whitespace. All of these hooks are taken directly from the pre-commit in-built hooks and do not require any additional installation.

3. Install the git hooks (❗)

This step is very important (and easy to forget). In order to run the git hooks specified in the .pre-commit-config.yaml file automatically on each commit, you need to install the git hooks.

Ensure you are in the project directory and then run the following command:

pre-commit install

This will install the hooks in the .git/hooks/pre-commit folder.

Note, this only needs to be done once or when cloning the git repo to a new machine. After installing for the first time, any changes to the pre-commit configuration file will automatically be applied on your next commit.

4. Run the hooks

Now the pre-commit hooks have been installed, you can use your normal git add and git commit workflow. The checks will run automatically upon calling the git commit command.

You will see which checks have been successful in the terminal output. Note that the code changes will not be committed unless you pass all checks. If a check fails, you should inspect the terminal output to see what failed and then make the appropriate fixes before trying to add the updated code to the repo.

# normal git workflow
git add .
git commit -m "Informative message about important changes"

pre-commit-workflow
Example pre-commit workflow

In the animation above, I have made some changes to a requirements.txt file and added a .pre-commit-config.yaml file to the repo for the first time. After installing and running the hooks, two issues are found:

  1. The requirements.txt file was not alphabetically sorted
  2. Trailing whitespace was identified in the pre-commit configuration file

In this case pre-commit automatically makes these fixes for me (shown by the newly modified files when running the git status command). Then I run through the commit workflow again to successfully pass the checks and commit the code to the repo.

Note that the hooks are only run against files that have been changed since the last commit. To run against all files in a repo you can use the following command:

# run against all files in the repo
pre-commit run --all-files

The pre-commit run command can also be useful for running the checks without having to stage the files first (e.g. running git add).

(Overriding the checks)

It is possible to override and ignore the checks by passing the --no-verify tag at the end of your message. For example:

git add .
git commit -m "This code does not pass checks" --no-verify

However, you should try and avoid overriding the checks - they are there for a good reason! Only use this if you are certain the files you are committing do not need to be checked.


Example pre-commit configuration file

Below is a template .pre-commit-config.yaml file which I tend to use as a basic template for new Python projects. It includes the following checks:

Standard file checks

Utilises many of the ‘out-of-the-box’ checks provided by pre-commit. I find the requirements-txt-fixer and check-json checks the most useful.

  • ✅ Requirements.txt fixer - automatically orders requirements in alphabetical order to make it consistent across projects and easier to read
  • ✅ JSON check - auto-formats JSON files and checks they are correctly formatted
  • ✅ Large file check - you should try and avoid adding large files (e.g. images) to a git repo

Python file checks

Standard checks on .py files. Ensure you have isort, black, mypy and flake8 installed.

  • ✅ isort: automatically sorts import statements in .py files
  • ✅ black: checks code formatting and automatically applies pep8 compliant code formatting
  • ✅ mypy: type checking
  • ✅ flake8: checks for style consistency and errors (e.g. variable used before assignment)

Jupyter Notebook checks (nbqa )

Similar to the Python file checks. These checks use nbqa to apply black, flake8 and isort to Jupyter notebooks! I find these extremely useful. It is surprisingly easy to keep unused imports or use variables before assignment in a Jupyter notebook.

  • ✅ nbqa-black: code formatter for Jupyter Notebook
  • ✅ nbqa-flake8: style consistency and error checking for Jupyter Notebook
  • ✅ nbqa-isort: consistent import statement for Jupyter Notebook

By using the --nbqa-mutate command, nbqa will automatically reformat the notebooks.

Happy coding!

Resources

  • Pre-commit
  • nbqa - auto-formatting for Jupyter notebooks, highly recommended!!

Further Reading