Cutting Costs by Over 90% by Replacing Google APIs With an LLM-Powered Microservice
We replaced interactions for two separate Google services with a single, small LLM-powered microservice using Gemini AI (also from Google). The result? Costs were cut by over 90% and improved flexibility with just a couple of days work.
Introduction
During a routine GCP billing account check for our client, I paid special attention to some out-of-the-box APIs provided by Google, which we use for language translations (Cloud Translation API) and sentiment analysis (Cloud Natural Language API). Because of our latest changes in some of the business logic, as well as some upcoming features we planned to implement over the summer, it was expected that the cost of those services would increase heavily, and I needed to estimate by how much to better prepare for the product discussions with the so-called ‘business’.
We originally used two Google APIs in a simple workflow: translate user text to English and then run sentiment checks. The translation step existed only because the sentiment API didn’t support the text’s original language. If the text looked suspicious (offensive, profane, or unnaturally polished), we flagged it for a human review team.
Looking ahead, we planned to use the Translation API in many more ways, and the number of interactions with those services would grow exponentially. We wanted to enhance the UX with translations into 12 languages and keep sentiment analysis. That meant calling the API thousands of times a day just to get the translations (we cache results and only refresh when the source text changes).
While analysing current and future call volumes, we realized a single LLM-based service could replace both APIs. I needed to do a quick Proof-of-Concept (PoC) and check how much that would be if we tried to get the same functionality from simple LLM calls through the API.
Business requirements for our new PoC microservice were simple:
- support 12 languages,
- support sentiment analysis (ideally as a single call instead of two),
- compare costs versus the existing setup.
Background & Previous Setup
The architecture before migration was trivial. In one of our microservices, we received a message containing the text to be checked, executed the call to the Translation Google API to get the text in English, and then called the Cloud Natural Language API twice to get different results for different sentiment analyses we were interested in.
With the new features, we needed to translate text into 11 languages, detect the source language when unknown, and continue running sentiment analysis.
That means we needed to multiply the number of requests to the API we already had, multiple times, but that was not all. The new features we were introducing over the summer incorporated changes which were causing many more (hundreds of times more) pieces of text to be pre-translated. Overall, API usage was set to grow by hundreds-fold in the near future.
Before estimating how much we will pay for the services with the current setup or how much that will be if we use LLM, we could do a simple test: we knew roughly how many times we called the Google Services, as we knew how many pieces of text we were sending with our Pub/Sub messaging system. If we introduced the new project quickly and that microservice utilises more or less the same number of messages, we could easily compare it to the old way of doing things. And this is precisely what we did. Within one day, we created a very simple, one-file, Python-based microservice, written entirely with vibe coding techniques and cursor, released it on our Kubernetes cluster, and switched the API calls from Google to our microservice. The results (without exact numbers) can be seen in the Cost Comparison section later in the article.
Why We Considered an LLM-based Solution
Translating text or finding out if the text contains swear words in a given language seems easy for an LLM. Even the first versions of ChatGPT impressed me with the language translation tasks. Those two tasks are perfect for utilizing LLMs. The only unknown factor was the cost. With new models available almost every month, we have plenty of choices. It turns out that one of the cheapest available models on Gemini was able to handle the job perfectly for us.
Estimating cost is often challenging, even with the simplest cloud services (regardless of the provider), and estimating costs on LLMs is no different. Pricing depends mainly on the number of token volumes and model complexity. This, of course, can be estimated roughly, but I believe that if you have a chance to build a small project and run it on production for a week or two to check the actual costs, that's the best option (of course, not always possible or viable).
So, we were aware that cost would be the main factor here, before we chose the final solution, but there were other important aspects to choosing the LLM solution.
With LLM, you gain a massive advantage with the ability to ask for and receive whatever response you want. This way, you can make your API as flexible as you wish, you can ask for multiple things at once, or, if you want to make it as simple as possible, with a single call to reduce the number of tokens used. There is also no need to pretranslate stuff for sentiment analysis, you can just ask LLM to detect the language for you, translate a piece of text, detect swear words. etc., just by changing your prompt.
At the end of the day, the LLM-based solution turned out to be much cheaper. The pieces of text we need to translate are usually not big (1-2 paragraphs, or even a single word sometimes), so the token usage is not big either. Also, we have tested a couple of models available, and we have found that one of the cheapest models is performing exceptionally well for the tasks we want.
Another factor is that with properly structured LLM prompts, we can perform some of the tasks in a single request instead of multiple requests like we had to do with separate Google APIs which decreased overall latency in the production system.
It's important to note that we picked Gemini AI, but any popular LLM models available now will probably do. I have even investigated whether we could implement our functionality with Ollama running in-house on one of the servers, but the cost associated with providing a powerful enough server 24/7 to run even the smallest models exceeded the costs for executing the API costs against Gemini, so this idea was scrapped (at least for now).
TL;DR
Why an LLM?
- Cost optimization: It turns out that short amounts of text we usually translate correspond to very low token usage.
- Flexibility: One prompt can combine detection, translation, and sentiment checks in a single call.
- Faster iteration: Output schemas (JSON/enums/scores) are defined by us, not by a third-party API.
- LLM provider choice: We used Gemini here but other mainstream LLMs would also work.
Designing the New Microservice
As mentioned, I have created this microservice twice. The first was a quick and dirty PoC, written entirely by the AI in Python. The interactions with the Gemini API are pretty simple, all you need is an API key. The second time we rewrote the microservice into Scala, it adhered to the style of other microservices in our codebase and ran on our cluster. Writing it in Python first seemed like a good idea to get the vibe coded solution fast, but at the end of the day, the service is so simple that we could skip that entirely, and write it straight away with Scala and save some man-hours along the way - well, you always learn.
We have one endpoint to detect language, two for translations (from one language to another but also from one language to all the others at once), the endpoint to check for excessive superlatives and another one specializing in swear words. /healthcheck
and /api/languages
are utility endpoints, first one for the Kubernetes healthcheck mechanism and the second one just to see what languages we actually support.
The new solution's interesting aspect is that we could design the output of those APIs exactly like we wanted. We didn’t need to go through documentation to find out why Google is giving one score for the given text and another one for another example. With LLM, we can easily tell the agent what it should return, e.g., between 0 and 1 or maybe an enumeration like “good”/”bad”, etc, sky is the limit.
For example, I’ll show you how easy it is to handle a request for text translation. The API for /api/translate
takes the following json request:
{
"text": String,
"source_lang": Option[String],
"target_lang": String
}
Where the source_lang
is optional, once a request like that is received in our microservice, we just modify the prompt we are going to use against Gemini AI:
val prompt: String = req.source_lang match {
case Some(src) =>
s"""Translate the following text from ${LanguageUtils.nameFor(src)} (${src.code}) to ${LanguageUtils
.nameFor(req.target_lang)} (${req.target_lang.code}).
Text: "${req.text}"
Provide only the translated text without any additional explanations or formatting."""
case None =>
s"""Detect the language of the following text and translate it to ${LanguageUtils.nameFor(
req.target_lang
)} (${req.target_lang.code}).
Text: "${req.text}"
First, detect the language and provide the language code, then provide the translation. Format your response as:
Detected language: [language_code]
Translation: [translated_text]"""
}
We explicitly ask it to either translate the text and return just the translated text when the source_lang
is defined, or detect the language automatically and return both the language detected and the translated text simultaneously.
This simple example is perfect for showing off what we can do with LLMs and how flexible they are compared to using multiple different 3rd-party services exposed through REST APIs.
Another example of superlatives detection looks like the following:
val prompt = s"""Analyze the following English text for excessive use of superlatives (words like "best", "worst", "most", "least", "greatest", "amazing", "incredible", "fantastic", "terrible", "horrible", etc.).
Text: "${req.text}"
You must respond with ONLY valid JSON in this exact format:
{
"score": <number from 1-10 where 1=no superlatives, 10=excessive superlatives>,
"analysis": "<brief explanation of the score>",
"detected_superlatives": ["<list of detected superlative words/phrases>"]
}
Do not include any text before or after the JSON. Only the JSON object."""
This prompt assumes that the given text is in English (which could be easily exchanged with the previous version of the functionality we had, so that we could hot-replace that on prod without changing much), but you can see how that works generally. We ask Gemini to return only the JSON for us, and we show it an example structure and how the score should be calculated. Easy!
The solution was finally coded as a single controller exposing the aforementioned endpoints, plus an additional one for Kubernetes pod health checks. Everything is small, smooth, and fast with http4s, tapir, and sttp oss libraries for the Gemini AI calls. The latter two are developed by the SoftwareMill team, and if you haven't heard of or used them before, I highly recommend checking them out as they are awesome little libraries to build REST-based microservices.
Once a piece of text is translated to 11 other languages, we store the results in the relational database using doobie, so that we don’t need to translate the same text over and over again (with a simple mechanism detecting updates to the original text of course).
Deployment & Scaling Strategy
One trade-off of moving to a Gemini-based LLM service is that we now run it ourselves. That’s fine for us l, as we already have all of our microservices running nicely on the Kubernetes cluster, so adding one more does no harm. It turns out that a single pod with minimum resources would do the job, but just for the sake of executing rolling updates we require at least two pods of the same instance running and serving requests.
The only extra work to ship it was to create a healthcheck endpoint which we could expose for use by the Kubernetes cluster itself.
Because the service is hosted in-house, we now have full GCP-native observability available: we can easily see exactly what is happening with the service, how much resources it needs, observe the logs and add alerts if anything happens, and react quickly like we do for all other microservices we have deployed.
Cost Comparison: Before vs After
Due to client confidentiality, we can’t share absolute figures, only percentages. As you can guess by now, by looking at the first image, we have reduced the cost associated with executing requests to Google Services by at least 90%. This, of course, is a rough estimate. On one hand we don’t include the increased cost of running the pods on our cluster (but they are minor) but at the same time, the difference would be much bigger if we would compare the usage now, when we finally introduced new features corresponding with translations to 12 languages and the number of requests we make daily is hundreds times bigger.
The red and blue are translation and sentiment services provided by Google, and the orange is the Gemini AI cost doing the same thing. As you can see, on the first of July we have removed all the calls to the previous 2 APIs completely.
This of course was a quick PoC and the way to prove (mainly to ourselves) that the LLM solution is working at least as good as the out of the box APIs provided by Google.
TL;DR
Once traffic moved from the two managed APIs to the LLM microservice, monthly spend dropped by >90% for this workload.
Early Results & Impact
After writing the first version of our PoC in Python, we saw that translation accuracy was great (at least in the languages we knew). For the other ones, some verification with 3rd-party translation agencies was needed, but overall, the translation part was a success.
The sentiment analysis part was an even bigger success and exceeded our expectations. The main reason for that was that with the Google API, we had to dig deep into the scoring we received for our data and figure out the exact threshold that was “good enough” for our needs. With the LLM solution, on the other hand, we could easily write multiple tests with real-world examples and modify the LLM prompt to behave like we wanted. On top of that, we could invent our scoring output, which was much easier to use in our business logic.
Translation to multiple languages can take some time, and we have decided on a way to make it completely async with a messaging solution we were already using. So, requests to translate a piece of text are not coming through the REST API endpoint, but are encapsulated into a Pub/Sub message, and once the translations are prepared on the Gemini AI side, the new message gets produced and is sent back to the interested parties on our cluster.
Overall, it was a small amount of work for our gains. We also proved to businesses that AI integration is not difficult, and they can start thinking about new use cases to utilize more and more AI on their platform.
With the possibility to run LLM in-house when we utilize more of it, the additional advantage would be that the data stays with us and we don’t need to send it to any 3rd-party services if we don’t want to, although that is not crucial for this use case, it can be for the others.
TL;DR
- Translation quality: Strong across supported languages (spot-checked by humans when needed).
- Sentiment & policy: Easier to tune vs. no longer dependent on fixed vendor scoring
- Latency: Fewer network hops by merging steps into single calls.
- Throughput: Async translation via Pub/Sub keeps UX responsive.
Lessons Learned
After the initial idea of utilizing LLMs to do translations for us popped into our minds, we quickly realized that it's a low-hanging fruit and something we can also use to show “the business” what LLM can do for us and how easy and quick the integration can be.
There were many “lessons learned” points along the way, even though the whole implementation took just a few days.
First, the vibe-coding a PoC is great, but remember not to overcomplicate things. The purpose of PoC is to prove that something can be done and works. We unnecessarily tried to make it as fast as possible with Python at the beginning, whereas the whole solution was so easy that we could straight away code it in Scala, and no rewrites would be needed. Of course, vibe-coding the PoC was a great idea, but at the end of the day, you still need to monitor what has been written and modify the code to meet the standards you want to keep in your code base.
The prompts you write when interacting with LLMs do not always work on the first go as you would expect. They require some tuning which can take multiple iterations.
With LLM’s, instead of monitoring our old APIs' request count we now have to monitor the token usage and there are new aspects to it the developer has to learn (context, size of the data to be processed and what needs to be returned back).
TL;DR
What went well:
- Simpler architecture → easier to maintain.
- Flexible prompt updates without code changes.
Challenges:
- Prompt tuning took multiple iterations.
- Unexpected token spikes when reviews contained HTML or weird formatting.
Tips:
- Always enforce JSON schema for LLM output.
- Monitor per-request cost in tokens.
- Start with low concurrency until confident in stability.
Future Plans
After the PoC was written, tested locally, and released on production as an additional service that we could use for specific data flows with other microservices, it was clear that the costs of running the LLM based service will be significantly lower. This was an eye-opener for many of our team members as we can now think, together with the “business”, about new ideas in a completely different way. Many people got excited, and we can surely expect new ideas to come up.
Lowering the existing costs for translation or sentiment analysis was just a small percentage of the overall success and impact that the small PoC had on our team and the company we work for.
You might be also interested in: Before You Invest in an ML Project - Case Study.
Conclusion
Vibe-coding a PoC solution in a couple of days seems like not that much work and nothing to brag about, but the impact it had overall is bigger than we had expected at the beginning. We’ve completely removed interactions and usage of two of Google’s costly APIs. Even with the relatively small number of requests we were sending there, we have reduced the costs by 90% and have the same or better quality and a much more flexible solution.
If you are a developer, CTO, or a manager, I encourage you to stop for a second and think how LLMs can help you do the work you already do, maybe it will turn out that you can make it ten times less expensive, will move your software solution to an entirely new level and slightly change the mindset of the people on your team, so that you can join the AI bandwagon and take what's best of it before your competition does.
If that's the case, and anything like that rings any bells, we are here to help you.