It is common for objects in Google Cloud Storage (GCS) to be referenced by their URI – i.e. gs://bucket-name/file-path
.
However, when loading objects using a client library (e.g. the Python SDK ), you need to supply the bucket name and object name (file path) separately rather than directly using the URI.
Strangely – given how common it is for objects to referenced by their URI – the GCS Python client library does not provide any convenience methods to automatically extract bucket and file name information from raw URIs.
Therefore, it is up to you to implement your own method to extract bucket and file names from raw URIs if you want to load objects using the Python SDK.
You can extract the bucket name and file path of an object in GCS from the URI using Python’s ‘split’ method or using regular expression lookups.
Below I will provide examples of each method and how to download files from Google Cloud Storage from a URI using Python.
But first, let’s go into a bit more background to explain why you might need to extract information from URIs when interacting with GCS.
Already know the background? Skip to the extraction methods
What is a Google Cloud Storage URI?
A URI is a ‘Unique Resource Identifier’.
Therefore, the Google Cloud Storage URI is a unique reference to your object stored in GCS.
There is a surprising lack of documentation on Google’s website about Google Cloud Storage URIs. The only reference I can find is on the Cloud Storage Transfer Service documentation page .
Essentially, the Cloud Storage URI comprises of a Google specific prefix (gs://
), the name of the bucket and the object name:
# structure of Google Cloud Storage URI
gs://bucket-name/my/object/name.file_extension
prefix = gs://
bucket = bucket-name
object_name = my/object/name.file_extension
Google Cloud Storage utilises a flat namespace which means there isn’t really the concept of ‘folders’.
However, to help with human readability, you can add ‘folders’ (delimited by /
) to your object names to mimic file paths on a traditional file system.
This feature improves the Cloud console UI experience as you can navigate your objects as you would using a normal file explorer. It can also help with string manipulation tasks (e.g. filtering lists of objects).
The important thing to note is that the object file name in GCS comprises of the entire string after the bucket name, not just the end file name and extension.
For those more familiar with Amazon Web Services,
gs://...
is equivalent tos3://...
for objects stored in S3.
Why is it necessary to extract bucket and file names from Google Cloud Storage URIs?
Unfortunately, you cannot download a file from GCS directly from the URI when using the Google Cloud Storage Python client.
Instead you have to pass the bucket name, to load a bucket object. And then pass the object name to the newly created bucket object in order to load the blob object.
If the input to your program or function is a GCS URI or list of URIs you will have to extract the bucket name and object name from the URI yourself.
Example: Download a file from Google Cloud Storage (GCS) using Python
An example of how to download a file in GCS from a bucket name and object name:
from google.cloud import storage
bucket_name = "bucket-name"
object_name = "folder/file.txt"
client = storage.Client()
bucket = client.bucket(bucket_name)
blob = bucket.blob(object_name)
# download to memory
contents = blob.download_as_bytes()
# download to a local file (with same folder structure as blob storage)
blob.download_to_filename(object_name)
Note: The
download_as_string()
method is commonly referenced in tutorials for downloading files from GCS. However, it is now deprecated in favour of thedownload_as_bytes()
method.
So how can we extract the bucket name and object name from a raw URI in order to feed into the code above?
How to extract bucket name and file name from GCS URI?
Method 1: Python’s ‘split’ method
We can use standard Python string manipulation techniques using the inbuilt split
and join
methods to extract information from the string.
uri = "gs://bucket-name/folder/file.txt"
# extract bucket name by splitting string by '/'
# take the 3rd item in the list (index position 2) which is the bucket name
bucket = uri.split("/")[2]
# extract file name by splitting string to remove gs:// prefix and bucket name
# rejoin to rebuild the file path
object_name = "/".join(uri.split("/")[3:])
>>> bucket
'bucket-name'
>>> object_name
'folder/file.txt'
Method 2: Regular expression matching
A more elegant approach, in my opinion, is using regular expressions and matching groups to extract the information.
import re
uri = "gs://bucket-name/folder/file.txt"
# match a regular expression to extract the bucket and filename
matches = re.match("gs://(.*?)/(.*)", uri)
if matches:
bucket, object_name = matches.groups()
>>> bucket
'bucket-name'
>>> object_name
'folder/file.txt'
Update đ
You can make the above code even more concise using the ‘Walrus’ operator syntax (:=). Check out my other article with an example.
This technique uses Python’s regular expressions module to match a regular expression against the input URI string. We then extract the matches (i.e. bucket and object_name) from the match
object using the groups()
method.
Let’s look at the regular expression in more detail.
# regular expression
gs://(.*?)/(.*)
The regular expression contains two ‘matching groups’ signified by each set of brackets. The characters outside of the bracketed groups are used to validate the correct GCS URI pattern, but are ignored from the extraction results.
The first bracketed group matches the bucket name. The .*
syntax matches any character and the ?
is necessary to only capture the letters up until the first /
delimiter.
The second bracketed group matches the file path and file extension. The ?
is not necessary this time as we are happy to include multiple /
delimiters which correspond to ‘folders’ within the file path.
If the input string is not a valid GCS URI (i.e. doesn’t begin with gs://
), the function will return None
. Therefore, I have added an ‘if’ statement to the Python code to check that the matches
object is not None
before trying to extract the groups.
The regular expression technique was inspired by an example in the Dataproc documentation
Which method should you use?
Ultimately it doesn’t matter, both methods work fine. Use which ever method you feel most comfortable with.
If I were to give an opinion, I would recommend using regular expressions
I believe regular expression matching comes with the following benefits:
- Improved readability: It is clear that the function is designed to extract (and only extract) two groups of information from a GCS URI.
- No ‘magic’ numbers: The use of magic numbers in the split method to record which index to keep is considered a software antipattern. The magic numbers can make it harder to interpret, particularly for developers less familiar with the structure of a GCS URI.
- Cleaner approach: The regular expression extracts both pieces of information in one go. With the split method, you need to re-join the string together after splitting it in order to rebuild the file path.
- Implicit data validation: only true GCS URIs will be matched. With the Python split method any non-URI string can be passed to the function and there will be an output from the function. This might not make sense and also could make the program harder to debug. With regex, the function will return
None
if the input is not a valid GCS URI. - Additional flexibility: You could change the regular expression pattern to be more or less specific/complex as you need, while maintaining readability.
Happy coding!
Resources
Further Reading
- The Best Way to Learn Vim
- How to set up an amazing terminal for data science with oh-my-zsh plugins
- Data Science Setup on MacOS (Homebrew, pyenv, VSCode, Docker)
- Five Tips to Elevate the Readability of your Python Code
- Automate your MacBook Development Environment Setup with Brewfile
- SQL-like Window Functions in Pandas
- Gitmoji: Add Emojis to Your Git Commit Messages!
- Do Programmers Need to be able to Type Fast?
- How to Manage Multiple Git Accounts on the Same Machine