tl;dr

Use f-strings as a first choice, but consider str.format() and Jinja templating for interpolation of SQL queries

See a flowchart at the end of this post which summarises the decision process

Python String Formatting

String formatting (also known as string interpolation) is the process of inserting a custom string or variable into a predefined ‘template’ string.

In Python, there are four methods for formatting strings:

  • % operator
  • str.format()
  • f-strings
  • Templates

The fact there are four different ways to format strings is a little confusing given that Python’s own manifesto - the Zen of Python - states:

Zen of Python - “There should be one– and preferably only one –obvious way to do it."

If there should only be one obvious way to format strings, why are there four native methods for formatting strings inbuilt into Python?

Each of these methods has their own trade-offs and benefits when it comes to simplicity, flexibility and/or extensibility. But what are the differences, which one should you use for which purpose and why?

In the context of data science there are two common use cases for formatting strings:

  1. Logging (print statements)
  2. SQL queries

In this post, we will go through each of these use cases and describe which string formatting method might be most appropriate for your situation.

1. Logging (print statements)

In Data Science, string interpolation is particularly useful for logging; creating dynamic chart titles and printing statistics. For example, printing out the number of rows in your dataframe or printing important model training metrics to the console. This can be very useful for keeping track of key variables as they change or keeping track of how model training is progressing.

The three most common methods for string intepolation in logging are % operator, str.format() or f-strings. Additionally, there is the Template method, however, this is seldom used.

Let’s briefly discuss each of these methods.

% operator - ‘Old method’

We will start with the % operator method which was the original way to format strings in Python versions <2.7. This method is sometimes referred to as printf-style string formatting.

String placeholders are denoted by a % symbol, followed by a character or characters which specify the desired formatting.

Example:

rows = 10
columns = 4

# print string representation
print("My data has %s rows and %s columns" % (rows, columns))
My data has 10 rows and 4 columns

It is also possible to use named placeholders and supply a dictionary which can make the statement more readable.

data = {"rows": rows, "columns": columns}

# print with named placeholders
print("My data has %(rows)s rows and %(columns)s columns" % data)
My data has 10 rows and 4 columns

The % operator method is generally seen as a legacy method for string interpolation and should not be used in favour of the format or f-string methods described next.

Common grievances with the % operator method include:

Unless you are working with legacy code bases using a Python version less than 2.7, you should try and avoid using this method.

So, let’s quickly move on…

str.format() - ‘Newer method’

Since Python 3 (and now backported to Python 2.7), you can format strings by calling the .format() method on the string object.

The functionality is very similar to the previous % operator formatting, however, the string placeholders are denoted by curly brackets, {}, which can be more readable.

A full list of formatting functionality is available at pyformat.info which provides a great ‘cheat sheet’ for all the various ways to format a string (e.g. rounding, date time formatting, etc.) - I would highly recommend checking it out.

Example:

# print string representation
print("My data has {} rows and {} columns".format(rows, columns))
My data has 10 rows and 4 columns
# print with named placeholders
print(
    "My data has {rows} rows and {columns} columns".format(rows=rows, columns=columns)
)

# also equivalent to
print("My data has {rows} rows and {columns} columns".format(**data))
My data has 10 rows and 4 columns
My data has 10 rows and 4 columns

str.format() is an improvement on % as it is more readable, however, the syntax can be a bit verbose, particularly if you have a lot of variables to substitute. For example, if you are using named placeholders in the string, it is implicit which variables you wish to substitute. So why should you need to repeat yourself explicitly by calling the .format method after the string as well?

We can reduce some repetition and avoid typing out all the variable names twice by unpacking a dictionary within the .format() statement (e.g. .format(**data)), however, this is still an unnecessary overhead.

This is where our next method, f-strings, offers a further improvement.

f-strings - ‘Newest method’

Since Python 3.6, there is a third method called string literals or ‘f-strings’ which lets you use embedded Python expressions inside string constants.

This can be really useful as it removes some of the verbose syntax overhead of the previous methods, which reduces the amount of code you need to write.

With this method, you only need to precede the string with the letter f or F and specify the variables you wish to substitute in {}, exactly the same as before.

Example

print(f"My data has {rows} rows and {columns} columns")
My data has 10 rows and 4 columns

Note here how the syntax looks very similar to the str.format() method described above, however, we did not need to call the .format(**data) or .format(rows=rows, columns=columns) at the end. We simply added a f to the beginning of the statement.

Template

Finally, the Python programming language also comes with a standard library called string which has a useful method called Template.

Perhaps, the least common string formatting method, the Template syntax denotes a placeholder using $ followed by the variable name to replace:

from string import Template

temp = Template("There are $rows rows and $columns in the data")
temp.substitute(rows=rows, columns=columns)
'There are 10 rows and 4 in the data'

The Template method is not commonly used as it is much easier to take advantage of the other, more simple, methods previously described. However, it can be particularly useful for validating user inputs and protecting your application from malicious actors if you require user input (e.g. public facing interactive application, public dashboard, etc.).

Which method should you use?

That was a very brief intro to the main methods of string formatting in Python. I recommend checking out RealPython and pyformat for more detailed information on each method and the various different ways to customise the formatting.

For logging statements, generally, we can exclude the % and Template methods unless using legacy code bases (e.g. Python <2.7) or dealing with user inputs with the potential for malicious activity.

Therefore, the choice is normally between str.format() or f-strings.

For logging, use f-strings for most use cases.

The syntax is very easy to remember and is less verbose than the str.format() method which makes it easier to read. You can also include calculations and expressions within the string which can be useful for making on the fly calculations. For example:

input_list = [1.3, 4.98, 32, 5.32, 3.98, 6.1, 2.4, 10.4]

# calculate average of input list and round to 2 decimal places
print(f"The average value of the input list is {sum(input_list)/len(input_list):.2f}")
The average value of the input list is 8.31

However, there are a couple of cases where str.format() can be more practical. The main example being when you are using a dictionary as the input source for your substitution variables.

For example, if you want to pass a dictionary containing the configuration or metadata for a particular model into a string which logs the training to the console.

Using an f-string, you have to specify the name of the dictionary each time you want to access a key. This involves a lot of repeated typing. It also reduces the flexibility of your statement if you wanted to pass a dictionary with a different name into the statement. You can also get in a mess with single and double quotes when referencing dictionary keys inside the wider print statement which also uses quotes.

metadata = {"model": "xgboost", "model_dir": "models/", "data_dir": "data/"}

# interpolation using f-strings
print(
    f"Training {metadata['model']} model on data in the "
    f"'{metadata['data_dir']}' directory)..."
)
Training xgboost model on data in the 'data/' directory)...

A better and more flexible approach in this scenario would be to use the str.format() method and unpack the input dictionary containing the metadata.

print(
    "Training {model} model on data in the '{data_dir}' directory...".format(**metadata)
)
Training xgboost model on data in the 'data/' directory...

2. SQL Queries

The second major use case is string interpolation for SQL queries. This is probably the least trivial use case as there can be added complexity, especially if you want to generate long queries dynamically.

There are two general cases where you will be working with SQL queries in Python:

  1. ‘In-line’ in a Notebook or Python script
  2. Importing from a .sql file

Both scenarios can be treated in a similar way, because when you import from a .sql file you are essentially just reading a string.

It is common to deal with dynamic sql querys by developing a ‘base’ sql query with placeholders and substituting the placeholders with the required values for your particular analysis.

For example, if we wanted to get the daily value of orders for a particular city we might have a base query defined as follows:

base_sql_query = """
SELECT
    date,
    SUM(order_value)
FROM orders
WHERE city = '{city}'
GROUP BY date
"""

We could then apply string formatting using the str.format() method and build the query for a particular city dynamically.

print(base_sql_query.format(city="London"))
SELECT
    date,
    SUM(order_value)
FROM orders
WHERE city = 'London'
GROUP BY date

We could make this more generalisable by creating a function to build any query from a base query and input dictionary containing the variables we want to substitute into that query.

def build_query(variables: dict, base_sql_query: str = base_sql_query) -> str:
    return base_sql_query.format(**variables)


# repeat original query using the function
print(build_query(variables={"city": "London"}, base_sql_query=base_sql_query))
SELECT
    date,
    SUM(order_value)
FROM orders
WHERE city = 'London'
GROUP BY date
# same function, different query and variables
city_and_date_base_sql_query = """
SELECT
    date,
    SUM(order_value)
FROM orders
WHERE city = '{city}' AND date > '{start_date}'
GROUP BY date
"""

variables = {"city": "Cambridge", "start_date": "2021-04-01"}
print(build_query(variables, city_and_date_base_sql_query))
SELECT
    date,
    SUM(order_value)
FROM orders
WHERE city = 'Cambridge' AND date > '2021-04-01'
GROUP BY date

Here we have been able to extend the initial base query by adding an additional filter (start_date) to the input dictionary.

Note that we could have used the f-string method, however, this is a use case where we are likely to be dealing with dictionary inputs so the str.format() method is preferable as it allows us to take advantage of unpacking many variables from a dictionary input at once. If using f-strings, we would have to specify the name of the input dictionary ahead of time in the query string, this is much less flexible as it couples the dictionary name to a specific query string and would require changing the query string code if the name of the dictionary was changed.

Unpacking variables in dictionaries using the str.format() method works fine for small queries where the structure of the query is ‘static’ - i.e. you always want to make the same substitutions and apply the same query logic each time.

However, what happens if we want to make a longer and more complex query? For example, we want a query which, depending on the situation, requires multiple filters or no filters at all or if we want to dynamically unpivot certain rows depending on their value.

With the current approach we have to specify a fairly rigid base query ahead of time which is inflexible to any change in the query logic.

Luckily, there is a fifth approach to string interpolation for SQL queries - Jinja templates.

Jinja

Jinja is a fast, expressive and extensible templating engine which allows us to incorporate simple logic such as ‘for loops’ and ‘if statements’ into our string expressions.

Jinja’s main application is for rendering HTML templates for web applications, however, it comes in handy for building SQL queries as well.

I won’t go into the syntax details of Jinja too much in this post (there is a lot of great documentation on Jinja’s website ), rather, just demonstrate how it is a very powerful templating engine which allows you to program simple loops and if statements into your strings.

Going back to the previous example, we can create the following Jinja template which will generalise to our needs.

jinja_base_sql_query = """
SELECT
    date,
    SUM(order_value)
FROM orders
WHERE
    {%- for city in filter_cities %}
    city = '{{city}}'
    {% if not loop.last -%}
    OR
    {%- endif -%}
    {%- endfor %}
GROUP BY date
"""

This Jinja templated query includes a ‘for loop’ in the WHERE clause which will dynamically create a filter for each city provided in a list called filter_cities. The Jinja logical statements are denoted within {% %} braces.

In order to ‘render’ this template and create the new query, we import the Jinja library and specify the list of cities to filter by.

from jinja2 import Template

# filter one city
filter_cities = ["London"]
print(Template(jinja_base_sql_query).render(filter_cities=filter_cities))

print("*" * 50)

# filter three cities
filter_cities = ["London", "Cardiff", "Edinburgh"]
print(Template(jinja_base_sql_query).render(filter_cities=filter_cities))
SELECT
    date,
    SUM(order_value)
FROM orders
WHERE
    city = 'London'
    
GROUP BY date

**************************************************

SELECT
    date,
    SUM(order_value)
FROM orders
WHERE
    city = 'London'
    OR
    city = 'Cardiff'
    OR
    city = 'Edinburgh'
    
GROUP BY date

We have improved from the previous examples as we now have the ability to filter by an arbitrary list of cities. This means we could filter by a very long list of cities very easily - imagine if we had to write a query with 20+ cities manually!

We can take this further by applying logic to the columns we want to select as well as the cities we want to filter by.

jinja_base_sql_query2 = """
SELECT
    date
    {%- for product in target_products %}
    , SUM(CASE WHEN product_name = '{{product}}' THEN order_value END) AS sum_{{product}}_value
    {%- endfor %}
FROM orders
{% if cities_filter -%}
WHERE
    {%- for city in cities_filter %}
    city = '{{city}}'
    {% if not loop.last -%}
    OR
    {%- endif -%}
    {%- endfor %}
{% endif -%}
GROUP BY date
"""  # noqa: E501
query_data = {
    "target_products": ["book", "pen", "paper"],
    "cities_filter": ["London", "Cardiff", "Edinburgh"],
}

print(Template(jinja_base_sql_query2).render(query_data))
SELECT
    date
    , SUM(CASE WHEN product_name = 'book' THEN order_value END) AS sum_book_value
    , SUM(CASE WHEN product_name = 'pen' THEN order_value END) AS sum_pen_value
    , SUM(CASE WHEN product_name = 'paper' THEN order_value END) AS sum_paper_value
FROM orders
WHERE
    city = 'London'
    OR
    city = 'Cardiff'
    OR
    city = 'Edinburgh'
    
GROUP BY date

Here we have pivoted the product_name column to get the daily value of three products we are most interested in and also applied some city filters. If we want to change the number of cities to filter or remove them completely, we don’t need to make any changes to the base query, only to the input dictionary (query_data).

# removed cities_filter key
query_data = {"target_products": ["book", "pen", "paper"]}

print(Template(jinja_base_sql_query2).render(query_data))
SELECT
    date
    , SUM(CASE WHEN product_name = 'book' THEN order_value END) AS sum_book_value
    , SUM(CASE WHEN product_name = 'pen' THEN order_value END) AS sum_pen_value
    , SUM(CASE WHEN product_name = 'paper' THEN order_value END) AS sum_paper_value
FROM orders
GROUP BY date

As you can see, by removing the cities_filter key from the query_data input dictionary we have completely removed the WHERE clause without having to make any changes to the original base query string.

These examples are slightly contrived, but I hope they demonstrate the power of Jinja templating for your SQL queries to make them more expressive and generalisable.

The great thing about Jinja templates is that they are portable. You could save the Jinja templates as a .sql file and they can be reused across multiple projects. An alternative would be to create your own custom Python function to build up the complex query string dynamically, however, you would have to transport that function around with the SQL file. With Jinja, you just need to import jinja2 and away you go.

This approach using Jinja, is used by common SQL pipeline tools such as dbt .

Key Takeaways

There are many different ways to format strings in Python. In data science you are most likely going to be requiring string interpolation for creating descriptive print statements (logging) or building dynamic SQL queries.

Which string formatting method you use will depend on your use case, however, my rule of thumb is:

  • Use f-strings as the first choice for print statements
  • Resort to str.format() if dealing with dictionaries as inputs (useful for building simple SQL queries)
  • Use Jinja templating for reproducible and generalisable SQL queries

I have summarised these rules in a flowchart:

python-string-formatting-flowchart

All code snippets in this blog post are available in the e4ds-snippets Github repository.

Happy coding!

Resources

Further Reading