…and how it improved the code I wrote going forwards.

One of the projects I worked on last year involved optimising the loading speed of a customer-facing analytics web application.

The existing codebase, written by a business analyst in a Jupyter notebook, utilised Voilà to turn the notebook into an interactive web application.

The application itself was relatively simple – a few tables and graphs with filters. But involved multiple expensive calls to APIs to fetch data as well as some time to render the interactive widgets. This resulted in long waiting times for the application to render leading to a poor user experience.

I came onto the project as an external consultant tasked with optimising code to reduce the initial loading time below a certain threshold defined by the business stakeholders.

This was one of the first times that I had come onto an already developed codebase and made significant changes. Here are the lessons I learned, some tips for debugging application performance and why I now write my own code differently as a result of the project.

Before you start 📚

Coming into a new codebase can be daunting. It is important to start on the right foot.

Take time to understand the context

It is very tempting to dive straight in and fiddle around with the code. But before you begin, take a step back. Make sure you understand the wider context first. For example:

  • What is the business use case for the application?

  • What optimisations have already been applied?

  • What are your constraints (e.g. limits on which external libraries you can use)?

  • What are the business reasons for using particular technology?

  • What is the final production environment (will a particular optimisation technique be feasible in production)?

Understanding the wider context of the project will enable you to make smarter decisions when designing the software application. Better to know at the start before going down a rabbit hole in vain.

Engage with stakeholders and set expectations

data eng meme
Don't be afraid to engage business stakeholders and end users from the start

Seek to understand the underlying business requirements. Both in terms of functionality and performance before you explore solutions for it.

Software engineering is all about trade-offs. Understanding the most important factors for your business users and how these might change in the future will enable you to make better decisions.

In these meetings you can also set expectations. Your job is to help the client understand the consequences of their stated requirements. For example: Will improving the loading time require a reduction in functionality? Are the stakeholders aware of this?

Define standards and ways of working with other developers

If the application is still under development, or other teams depend on your application, you will need to collaborate with other developers. Spend the time upfront to define ownership and shared coding standards.

Look if there are already coding standards in place for the project (e.g. git workflow, code linting and formatting, testing, documentation etc.) and make sure you adhere to them. If there aren’t any, create standards in collaboration with the other contributors.

Shared coding standards improve collaboration and code maintainability in the future. Using frameworks such as pre commit can be a great way to define and automate compliance with the project standards.

Tackling the optimisation task 👨‍🔬

Don’t be afraid to refactor before trying to optimise it

Refactoring is a great way to get to grips with a new codebase and even improve the performance. I’m not talking about large scale changes at first. Start with simple techniques such as improving variable names or extracting smaller functions from larger ones as you get more familiar with the code. Martin Fowler calls this ‘Comprehension refactoring’ in his book Refactoring .

But a note of caution ⚠️. Make sure there are unit tests in place before making significant changes. Proceed with care and put a safety net in place. Adding tests will also help later when you add optimisations as the tests will help verify the expected behaviour of the application hasn’t changed.

Most bottlenecks in performance are found in a very small part of the overall code. These bottlenecks are normally caused by workarounds from poorly designed code. Refactoring the code and making use of software design patterns will reduce the likelihood of poor-performing code in the first place.

If there is still poor performing code after refactoring it should be much easier to change and optimise later.

You can’t optimise what you can’t measure

Before you can optimise the code you should set up appropriate logging and monitoring techniques to record the performance of the code/application.

This can include simple logging ‘print’ statements with timestamps for local development, but also setting other more advanced monitoring tools to evaluate the performance in the production environment.

For example, if you will be deploying your application as a containerised application in AWS Fargate you could use AWS CloudWatch to collect metrics over a longer period of time.

Find the bottlenecks

As mentioned above, poor application performance is highly likely to be caused by only small part of the code.

Use Python profilers to help debug and find the biggest culprit to focus your attention on. Is the performance issue due to your own code or a call made from another library you are using?

Another good place to start is to identify whether your program is I/O or CPU bound :

  • If your program makes multiple calls to external APIs and has to wait around for the data, it is likely to be I/O bound - multithreading or caching might be your answer.

  • If your program does some heavy data calculations you are likely to be CPU bound and implementing multiprocessing could help.

“Software is not magic!”

I read this quote somewhere in a blog post a while ago (sorry I can’t remember the original source) and it has stuck with me.

Don’t just assume a command is magic. Delve into the source code of the libraries you are using. Investigating what is going on under the hood really helps your understanding of the complete system and can help track down inefficiencies.

Contribute and engage with open source

If you suspect your performance issues are due to an external dependency like a Python library, don’t be afraid to ask questions and contribute to open forums.

Take the opportunity to learn from others who may have faced or solved a similar problem.

I spent weeks suspicious of a bug before engaging the maintainers of the library. Turns out they had solved a similar issue in another part of the project. Within a couple days they provided a great explanation and created an updated pull request to alleviate the issue.

Improving your development experience 👨‍💻

Invest time in creating a productive and automated development environment

It is worth spending time to set up an efficient development workspace which enables rapid iteration. For an optimisation task, you are likely to want to try a few different approaches and quickly see the effects of the changes on the application’s performance.

Optimising and automating the process to update and deploy your application will pay huge dividends and speed up experimentation. For example, optimising Dockerfiles , using Makefiles and CI/CD pipelines to quickly redeploy changes. Automation also helps with reproducibility.

Ensure your dev environment is as close as possible to production

Make sure you are testing your optimisations in an environment which is equivalent to the final production environment. Otherwise you cannot be sure your optimisations will have the desired effect or reported performance improvements in production.

Testing on your local computer in a virtual environment is unlikely to be the same as deploying your application in a container on a cloud platform.

Carefully track external dependencies

Like any science experiment, you should try and control all possible variables. The performance of external libraries can vary surprisingly between releases.

Using tools like Poetry can help pin down dependencies and avoid unexpected surprises.

Learn how to use Git efficiently

As data scientists, it can be daunting to learn Git and many teams do not use it to its full potential.

When optimising code it is very important to keep your commits clean and clearly defined. You want to be able to make small changes and see their impact, but also keep a clear record of changes to roll back to previous versions if the change doesn’t work.

Learn how to split up commits, rebase old commits to clean the history and use git worktrees to directly compare performance of two different versions at the same time.

Using git efficiently beneficial for both tracking your own work but also for collaborators inspecting the code in the future.

I highly recommend reading Ryan Hodson’s comprehensive guide to Git. The Kindle version is free on Amazon !

Be flexible and adapt

There might be constraints on development, particularly working in an enterprise environment. This means you might not be able to access your favourite code editor, operating system or certain open source libraries.

It is important to be flexible and adapt to the constraints of your environment. It is a good idea to be familiar with a couple different ways of working so you are able to adapt to the different environment.

How I write my code differently now 🚀

Reading someone else’s code really makes you reflect on your own coding practices. As such, I have changed many of my coding practices since working on this project. Mainly centering on readability for others and long term maintainability.

tl;dr: Be explicit. Think about your future self and collaborators.

Follow consistent style formatting

Consistent style formatting across a project greatly aids readability. I was vaguely aware of PEP8 formatting rules but did not religiously follow them.

This is OK for your own projects where you are still likely to follow similar formatting across files depending on the habits you have built up. But with multiple collaborators, without an agreed framework there will be a lot of different styles in use across the project which makes it harder to read.

Since working on the shared codebase I always ensure my code is PEP8 compliant using black as my auto-code formatter. This makes my code much more consistent as easier to read. Code that is easier to read, is easier to understand, which is easier to maintain.

You can use configuration files to define the code formatting guidelines for the project and automate/enforce formatting using frameworks such as pre-commit .

Type hinting

Python is dynamically typed, which means you don’t have a specify the variable type (e.g. str, int, list, dict etc.) before runtime. This is kind of convenient but it is less explicit when trying to understand what data structures are being passed around between functions.

However, there is an optional feature in Python called type hinting that let’s you document the type of each variable. I never realised just how useful this feature was until I had to decipher the data structures being passed through a large codebase.

# type hinting example
def greeting(name: str) -> str:
    return 'Hello ' + name

When writing a bit of code it will be obvious to you what data structures are being passed around the application. But 6 months down the line, or to other collaborators, it will not be so clear. You will save so much time, head scratching and endless print statements later on by being explicit about the data structures passing through your functions.

Happy Coding!

Further Reading