Python’s logging library is very powerful but generally under-utilised in data science projects.

Most developers default to using standard print statements to track important events in their applications or data pipelines.

It makes sense. ‘Print’ does the job, it’s easy to implement, requires no boiler plate code and no external libraries to understand.

But as the code base gets larger and when you want to move the code into ‘production’, you can quickly run into some issues due to the inflexibility of print statements and miss out on some great features from Python’s logging library that can help with debugging.

In principle it is simple to use the logging library. Particularly for single scripts.

But in my experience, I found it difficult to clearly understand how to set up logging for more complex applications with multiple modules and files.

Admittedly, this might just be due to my chronic inability to read documentation before starting to use a library. But maybe you have had the same challenges which is why you probably clicked on this article.

It turns out, there is a simple way to set up logging for complex projects without lots of boiler plate code in each file. But I couldn’t find a single source that distilled the information in the context of data science projects. Hopefully, this post can provide just that.

Here is how I set up logging for my data science projects with minimal boiler plate code and a simple configuration file.

Why use logging instead of printing?

First of all, let’s discuss the argument for using Python’s logging library in your projects.

Logging is primarily for your benefit as the developer

Print/logging statements are for the developer’s benefit. Not the computer’s.

Logging statements help diagnose and audit events and issues related to the proper functioning of the application.

The easier it is for you to include/exclude relevant information in your log statements, the more efficient you can be in monitoring your application.

Print statements are inflexible

They can’t be ‘turned off’. If you want to stop printing a statement you have to change the source code to delete or comment out the line.

In a large code base, it can be easy to forget to remove all the random print statements you used for debugging.

Logging allows you to add context

Python’s logging library allows you to easily add metadata to your logs, such as timestamp, module location and severity level (DEBUG, INFO, ERROR etc.). This metadata is automatically added without having to hard code it into your statement.

The metadata is also structured to provide consistency throughout your project, which can make the logs much easier to read when debugging.

Send logs to different places and formats

Print statements send the output to the terminal. When you close your terminal session the print statements are lost forever.

The logging library allows you to save logs in different formats including to a file. Useful for recording the logs for future analyses.

You can also send the logs to multiple locations at the same time. This might be useful if you need logging for multiple use cases. For example, general debugging from the terminal output as well as recording of critical log events in a file for auditing purposes.

Control behaviour via configuration

Logging can be controlled using a configuration file. Having a configuration file ensures consistency across the project and separation of config from code.

This also allows you to easily maintain different configurations depending on the environment (e.g. dev vs production) without needing to change any of the source code.

Logging 101

Before working through an example, there are three key concepts from the logging module to explain: loggers, formatters and handlers.

Logger

The object used to generate the logs is instantiated via:

import logging

logger = logging.getLogger(__name__)

The ‘logger’ object creates and controls logging statements in the project.

You can name the logger anything you want, but it is a good practice to instantiate a new logger for each module and use __name__ for the logger’s name (as demonstrated above).

This means that logger names track the package/module hierarchy, which helps developers quickly find where in the codebase the log was generated.

Formatters

Formatter objects determine the order, structure, and contents of the log message.

Every time you call the logger object, a LogRecord is generated. A LogRecord object contains a number of attributes including when it was created, the module where it was created and the message itself.

We can define which attributes to include in the final log statement output and any formatting using the Formatter object.

For example:

# formatter definition
'%(asctime)s - %(name)s - %(levelname)s - %(message)s'

# example log output
2022-09-25 14:10:55,922 - INFO - __main__ - Program Started

Handlers

Handlers are responsible for sending the logs to different destinations.

Log messages can be sent to multiple locations. For example to stdout (e.g the terminal) and to a file.

The most common handlers are StreamHandler, which sends log messages to the terminal, and FileHandler which sends messages to a file.

The logging library also comes with a number of powerful handlers . For example the RotatingFileHandler and TimedFileHandler save logs to files and automatically rotate which file the logs are to when the file reaches a certain size or time limit.

You can also define your own custom handlers if required.

Key Takeaways

  • Loggers are instantiated using logging.getLogger()
  • Use __name__ to automatically name your loggers
  • A logger needs a ‘formatter’ and ‘handler’ to specify the format and location of the log messages
  • If a handler is not defined, you will not see any log message outputs

Common Python Project Structure

# common project layout

├── data                        <- directory for storing local data
├── config                      <- directory for storing configs
├── logs                        <- directory for storing logs
├── requirements.txt            
├── setup.py
└── src
   ├── main.py                <- main script
   ├── data_processing        <- module for data processing
   │  ├── __init__.py
   │  └── processor.py
   └── model_training         <- module for model_training
      ├── __init__.py
      └── trainer.py

The above shows a typical project layout for a data science project.

💻 The example project layout and code is available in the e4ds-snippets GitHub repo

We have a src directory with the source code for the application. As well as directories for storing data and configurations separately from the code.

We will use this as an example project for setting up logging.

The main entry point for the program is in the src/main.py file. The main program calls code from the src/data_processing and src/model_training modules in order to preprocess the data and train the model. We will use log messages from the relevant modules to record the progress of the pipeline.

You can set up logging by either writing Python code to define your loggers or use a configuration file.

Let’s work through an example for both approaches.

Basic logging setup

We can set up a logger for the project that simply prints the log messages to the terminal. This is similar to how print statements work, however, we will enrich the messages with information from the LogRecord attributes.

Create an app for your script
Logging output on the terminal

We initiate the logger and define the handler (StreamHandler) and format of the messages in the main.py file. We only have to do this once in the main file and the settings are propagated throughout the project.

In each module that we want to use logging, we only need to import the logging library and instantiate the logger object at the top of each file.

That’s it.

Using a configuration file

The four or five lines of code at the top of the main.py file can be replaced using a configuration file.

My preferred approach for larger projects is to use configuration files:

  • able to define and use different logging configurations for development and production environments
  • separates configuration from code, making it easier to reuse the source code elsewhere with different logging requirements
  • easily add multiple loggers and formatters to the project without significantly adding more lines in the source code

We can change the code in the main.py file to load from a configuration file using logging.config.fileConfig.

I have created a function (setup_logging) which loads a configuration file depending on the value of an environment variable (e.g. dev or prod). This allows you to easily use a different configuration in development vs production without having to change any source code.

💻 The example project code is available in the e4ds-snippets GitHub repo

Example configuration file

In the configuration file we have defined two loggers. One which sends logs to the terminal and one which sends the logs to a file.

More information about the logging configuration file format can be found in the logging documentation

Debugging tips

I had quite a few issues trying to set up logging in my projects initially where I did not see any of my logs in the terminal or file outputs.

Here are a couple of tips

Ensure you have specified all your handlers in the config

If there is a misspecified handler in your configuration file you might not see any logs printed in the terminal (or other destinations). Unfortunately, the logging library seems to fail silently and doesn’t give many indicators as to why your logging setup isn’t working as expected.

Ensure you have set the correct ‘level’ setting

For example: logger.setLevel(logging.DEBUG). The default level is logging.WARNING which means only WARNING, ERROR and CRITICAL messages will be recorded. If your log messages use INFO or DEBUG you need to set the level explicitly or your messages will not show.

Don’t get confused between ‘logging’ and ‘logger’

I’m embarrassed to admit it, but I have spent a long time in the past trying to work out why messages weren’t showing. It turns out I was using logging.info() instead of logger.info(). I thought I would include it here in case it isn’t just me who has typos. Worth checking. 🤦‍♂️

Resources

Further Reading