How We Approached Google Drive Backups in Our Workspace Organization
Most of you reading this blog are likely familiar with vendor lock-in - the situation where services are so tightly integrated that replacing one becomes nearly impossible. But have you ever considered the risk of being locked out of your data? Google pinky promises that they won't use your Workspace data for advertisement or training purposes. However, due to legal obligations, Google is likely to scan your files for intellectual property (IP) infringement or CSAM. Like with any automated system, false positives are bound to happen. If your company relies on cloud-stored data, being locked out for days (or even weeks!) due to a suspension is a risk you can’t afford.
At SoftwareMill, we use Google Workspace as it provides a more intuitive and fluid collaboration experience than Microsoft 365. We are no exception when it comes to the risk of being locked out. A single misclassification or an automated flag could result in revoking our access. To mitigate this risk, we embarked on a journey to implement domain-wide Google Drive backups.
By the way, did you know that you can create new docs, sheets, and meetings by typing docs.new, sheet.new, meet.new in your address bar? Quite ingenious use of a TLD.
Current state
Before deciding to reinvent the wheel with a custom tool, we explored existing solutions:
- Google’s Data Export - Google Workspace customers can request a data export via the Data Export tool. It works very similar to Google Takeout and exports data from a wide range of Google services, going far beyond just Google Drive. Once requested, your exported data will be put into a Google-managed Cloud Storage bucket from which you can copy the data to your preferred destination within 60 days. The only caveat is that you can request the export no more often than every 30 days and it is a manual process. It is a good starting point for companies that can’t afford to implement their own backups, though.
- rclone - rclone is a well-established project that lets you interact with various remote storage systems, including Google Drive, as if they were local file systems. It handles Drive-specific features like shortcuts and exports from Google Docs, Sheets, and other tools pretty well. However, treating an object storage system like Google Drive as a traditional file system has limitations. One key drawback is that rclone cannot handle multiple files with the same name and location - it simply treats them as duplicates. At best, it can “deduplicate” them by deleting extras. This limitation arises because Google Drive functions as object storage without an actual folder structure; folders are simulated in the interface, and files are identified by unique IDs rather than by name and path. Another significant downside is that rclone’s configuration is set on a per-drive basis.
With no Google’s Export Tool automation and the need to write custom tooling around rclone to automate configuration generation, we decided to develop our own backup solution.
What does it take to create a custom backup tool?
Listing drives
The very first step was to obtain a complete list of personal and shared drives. This can be done fairly simply, as Google provides both Google Drive API
and Admin SDK Directory API
. Shared drives have unique IDs that have to be obtained from the Drive API, and IDs of personal drives match the user’s primary email. Usually, when you create an application that uses those APIs, the application is meant to be used by end-users (think of a different UI for drive i.e. a fancy file manager) and you obtain permissions by redirecting the user to Google’s OAuth screen. In our scheduled backup case, the permissions had to be preassigned and domain-wide. Google thought of this, and you can manage Domain Wide Delegation
from the Admin Dashboard. This is well documented by Google support and in our application’s README.md
. Permissions are called scopes, and the Client ID is the GCP service account’s Unique Name
(it is a 21-characters long number, rather than the usually referenced email). The scopes required in our case were:
- https://www.googleapis.com/auth/admin.directory.user.readonly
- https://www.googleapis.com/auth/drive.readonly
Fetching file list
Google Drive is an object storage and files are identified by their IDs. To download a file, you first need to obtain its ID. Google’s Drive SDK provides a method for listing files, so it’s just a matter of iterating through paginated API responses. You can also specify what metadata fields you are interested in. Beyond id
and name
, we were interested in md5Cheksum
, mimeType
, exportLinks
to decide if the file was downloadable or had to be exported (discussed later), parents
to build file path, shortcutDetails
to handle shortcuts, and permissions/permissionIds
if we ever had to restore the files.
Permissions for files in drives work differently depending on the drive type. For personal drives, the permissions
field returns what you expect - user email and access level. However, the permissions
field is not available for files in shared drives, only the permissionIds
field is. You have then to query another endpoint with the permission ID to obtain actual permissions with user email and access level. While you can query the list endpoint with a page_size
of 1000
to get a batch of files in a single response, making a thousand (or more) requests (per batch!) to obtain permissions would be a nightmare. Our solution to this was to simply cache the responses and build a known permissionIds
dictionary
Remember when I said that folders inside Google are simulated? Files are not aware of their path by themselves. The parent field is the only path-related field. Since we wanted to preserve the folder structure in our backups, we needed to manually build the path for each file by traversing its parent tree. However, finding a file’s parent by its ID in the array returned by the Google API involves O(n) time complexity. With some drives containing tens of thousands of files, path-building would be extremely time-consuming. Fortunately, there’s a simple solution: file IDs are unique. We transformed the file array into a set, using each file ID as the key. This allows us to retrieve file information by ID with O(1) complexity, significantly speeding up the process.
Downloading and exporting
Downloading binary files (the ones with md5Checksum
) is fairly straightforward - you can simply call the get_media
method with the file ID and write the response to a file. However to save on memory footprint, you might want to download the file in chunks.
Non-binary files are a bit more complex. For shortcuts (identified by the application/vnd.google-apps.shortcut
mime type), we opted to create a .txt file with a path to the original source rather than download the original file multiple times (for each shortcut). Other Google-specific applications like Docs, Sheets, and Presentations can be exported to formats according to this table with the export_media
method. That is, in theory. The export_media
has a 10MB limit. A document with a handful of images will easily exceed this limit. We had to resort to the exportLinks
field in the file. Typically it’s used in the web-ui of Google Drive and it’s just a link. This method is not a part of Google’s SDK, so we had to write our own authentication process (thank god, the SDK exposes the bearer token) and our chunker. The limits of exportLinks
are undocumented, but so far, we haven’t run into any issues with it. The only caveat is that it may be depreciated at any time, as it’s not a part of the Drive v3 API.
Oh, the lovely duplicates
Since files in Google Drive are identified by unique IDs, it was only a matter of time before we encountered files with the same name and path. This posed a significant challenge because our implementation used multiple threads to download files (and yours should, too!). In this setup, one thread could start writing to a local file just after another thread wrote a chunk of a different file with the same name, leading to overwrites and corrupted data. We solved this issue by creating a data structure with O(1) lookup time - a set that kept track of all locked (in-use) paths. Before downloading a file, we checked if its path was locked. If not, we locked the path, ensuring an uninterrupted download. If the path was already in use, we appended the file ID and a random number to the filename to avoid conflicts.
An important note: you must ensure that your data structures are thread-safe. Otherwise, the set could become corrupted, just like the files.
Compressing
Once all the files were downloaded, the compression process was straightforward. We compared two compression algorithms: lz4
and pzstd
.
For approximately 700GB of data:
- lz4 (which is always single-threaded) took about 114 minutes and produced a 599GB archive (compression ratio of 1.16x).
- pzstd (using 2 cores) completed the task in 88 minutes - 22% faster than lz4 - and produced a 587GB archive (compression ratio of 1.19x).
Since 700GB was just one of our drives and compute resources for a few hours are cheaper than storing large amount of data for several months, we chose pzstd.
Deployment and first gotcha
Since we wanted to store data in S3, deploying the application within AWS made sense to avoid data transfer costs. We chose to run the backup application as an ECS Task, scheduled periodically with CloudWatch. Given that the entire backup process took around 6 hours with multi-drive processing, Fargate was the ideal choice over a constantly running EC2 instance.
Our initial implementation required downloading all data first, which meant the container needed a scratch disk of at least a few terabytes. Fortunately, since January 2024, AWS has allowed the creation of an EBS volume during deployment (when calling the RunTask API method). This is configured via the volumeConfigurations
parameter, alongside other settings like cluster
and taskDefinition
.
However, we faced an unexpected challenge: over a year later, the official Terraform aws_scheduler_schedule
resource still didn’t support any volume configuration within the ecs_parameters
field, even though it’s supposed to mirror the RunTask API.
This put us in a difficult situation. We either had to give up on IaC or work around the 200GB-max ephemeral_storage
configured at the task definition level. As an IaC-first company, giving up on Terraform was not an option. We scrambled and modified our code, so that files would be uploaded to S3 and deleted from local storage as soon as they are downloaded. This greatly reduced the disk space required at any given point just to the currently handled files.
The code
We are open-sourcing our backup application code. It’s written in Python, supports black & whitelisting of drives, parallel processing of drives, and has multithreaded downloads. You can also configure whether to include Shared-with-me files. Currently, it exports Google-specific application files to typical Microsoft Office files, but with a little help from the community, it might soon support other formats. You can also specify whether you want to download all files and compress them or, just like us, upload files as soon as they are downloaded. Any errors are logged and uploaded to S3 along with the files as well.
You can find the repository here: softwaremill/org-gdrive-backup.
Conclusion
In conclusion, while Google Workspace offers a powerful and convenient collaboration environment, the risk of being locked out of essential data is a serious concern. Existing backup solutions, such as Google’s Data Export and rclone, have limitations that make them impractical for fully automated, frequent backups. By developing a custom backup tool, we ensured greater control, efficiency, and resilience against potential disruptions. Our open-source solution provides a scalable and flexible approach to safeguarding Google Drive data, allowing organizations to mitigate risks and maintain uninterrupted access to critical information.