What are language agents?
With more prevalent applications like chatGPT, Siri, or Alexa, we slowly surround ourselves with technology we may not fully understand. This article aims to give you a glimpse into the area of language agents.
The structure of that blog post is as follows: first, we will discuss the characteristics of language agents and how to spot them, and then we will review some selection of language agent architectures.
How do you spot a language agent?
Windows Clippy is an example of a simple chatbot, source: Figma.
To answer this question, we can compare them in the following fashion:
- Simple chatbots (like Clippy from Windows) are just a Q&A system (not yet agents) to which users can only ask a question, and the system provides an answer (most often predefined).
- OpenAI’s ChatGPT-4 is a clear example of a system using a language agent that provides image generation and text2speech features while offering multimodal input.
Simply put, if you can ask it to do more than one very specific task (so chatbot++), most probably behind the scenes is working some kind of language agent.
Known examples of language agent
In this section, we will go over a few examples of systems using language agents.
ChatGPT
In its latest form, it allows for much more than just text-based tasks. ChatGPT allows:
- Provide a dataset, do basic analysis via the Data Analyst product,
- Generate images from text via Dalle.
- Generate audio from given text via OpenAI’s TTS model.
- To input audio or image and ask questions via GPT-V or Whisper.
- Generate code for the requested functionality.
- And probably much more.
Smartphone/Home Assistants
You already know tools like Siri, Google Assistant (powered by Gemini), or Alexa, which behind the scene use language agents. They provide multiple tasks like playing music, checking the weather, making appointments, or calling different people. Even though input to these systems is speech, they process it to text (using text-to-speech models) and then use various NLP models.
Devin from Cognition
A recently famous software development agent is also a great example of a language agent. From various demos, you can conclude that this tool allows you to build and complete software tasks end-to-end, mimicking human processes, like creating requirements, designing architecture, planning the project, coding, and testing the solution.
In response, other agentic solutions were introduced in that area, like OpenDevin, MetaGPT or Github Workspace.
Vapi - voice AI assistant
VAPI is a company offering end-to-end solutions that allow the creation of automated speech agents like sales agents, merging text messaging via emails or Whatsapp with real-life speech capabilities while using the LLM inference engine of your choice (e.g., Perplexity or Groq). This allows you to create your own automated sales agent or handle inbound support for your company without needing extra staff to handle these tasks.
Computer Games
Considering more research, the authors of a paper called Voyager decided to use agentic workflows to better simulate non-playable characters (NPCs) in Minecraft and make them more human-like. This was achieved by embedding human-like mechanisms like motoric and episodic memory, reflection, or even new complex actions learning autonomously into their reasoning.
Agentic architectures
Agentic architectures aim to solve many current problems with AI, such as multimodal Q&A, vague questions, multi-document answers, or reasoning mistakes.
Multi-modality architecture
Here, we have a casual structure of language agents. It starts with user instructions, which then, on the backend, are incorporated into a more complicated prompt with available tools (a limited set of actions), examples, and the task itself. Given this prompt, the agent decides which tools to use to execute the user’s task and designated tools to provide and modify the output to be consistent with the prompt.
Query translation and planning
Apart from merging tasks from different modalities, we can also consider answering vague questions like the one in the graphic below in steps. Then, instead of answering one hard question, we can decompose it into simpler, clearer tasks and answer the message correctly with the provided results.
Metadata filtering and routing
Agents can also work as metadata filters, where relevant data or databases are filtered based on the information they are regarding. In the example above, the user asks about the best pizza restaurant in Poland. By understanding that the question relates to the country of Poland, it is possible to filter the data stores that relate to cities in Poland, like Warsaw or Gdansk, and filter out irrelevant ones, like the New York guide.
Corrective RAG Agent
Another interesting agent architecture is self-correcting RAG. As seen above, it has a few steps:
- Like casual RAG, it checks for relevant documents in the data store.
- The agent is grading the relevance of the documents
- Documents are then checked to see if they are irrelevant to the question.
- After providing enough documents, the candidate's answer is generated.
- The final step has two steps: answer evaluation, checking whether the answer is relevant to the provided documents, and determining whether it actually answers the original question.
Long-term memory of agents
Last but not least, it is a mechanism to keep long-term memory about the user, as it might be important to provide better context for answers. To do so, agents extract important information from the user's questions and answers and store them in the knowledge database. Similarly, when the environment changes (i.e., our user claims he is no longer a vegan), every passage regarding that information should change accordingly.
Conclusion
In this blog post, we have covered what language agents are and where we can spot them. Most importantly, we have explored the selection of their architectures, which can help build chatbot systems that are more robust to common problems of chatbots.
We are experienced in implementing language agent workflows, so if you need one, do not hesitate to contact us.
Reviewed by Kamil Rzechowski