Reference Architecture for Website Chat Agents

A sample architecture for RAG chat agents on websites

Apr 1, 2025 by Matt Eland · 15 min read

This article serves as a reference architecture for an embedded web chat AI agent that can help visitors effectively find relevant content on a website.

Conversational AI with Content Search

A web chat AI agent offers a way to integrate conversational AI into an existing website. This gives the user a way of searching the site using conversational language and receiving factually correct responses based on actual data and documentation from the organization's business domain. Web Chat AI agents work by providing either a UI component (usually a pop-up in the lower right corner) or a dedicated page for the user to interact with the system.

For example, a website for a parts manufacturer might use an AI agent to help users find an exact part by its description while a community organization's website could help connect families to events of interest or support resources related to special challenges faced by that family.

When the user types something into the chat system it is relayed to a large language model (LLM) along with the conversation history and a guiding prompt. The LLM's response is shown to the user. Such interactions are now commonplace due to LLMs like ChatGPT, GPT-4o, and others.

While LLMs are capable of offering conversational capabilities in a variety of languages, they are not trained on your specific business or its current data and so they're unlikely to represent your business well to the user and are especially unlikely to provide factual, accurate information. To address this, we need to provide additional data and context to the LLM through the process of Retrieval Augmented Generation (RAG).

The rest of this article outlines a high-level reference architecture to accomplish this goal.

AI Chat Agent System Architecture

Let's talk about the components that go into a RAG chat agent.

An AI Chat agent consists of the following key parts:

A Web Application that hosts the rest of your content and houses the front end components of your AI Agent.
A Web API that the Web Application calls to with chat messages. The Web API accomplishes the necessary work to generate a response and returns it to the Web Application.
A File Storage container containing documents that the AI agent should consider as source information.
An Indexer (not pictured) that generates embeddings from portions of these data files and stores them in the Vector Database.
A Vector Database that stores the embeddings from the source documents and allows searching documents by their similarity to a search string or document.
One or more Large Language Model (LLM) that can generate chat responses from conversations and relevant document information

Let's see how these pieces connect with each other.

A Sequence Diagram illustrating the interaction between components

At a high level, these components communicate in the following pattern:

The web application makes an API call to the Web API using REST or some other web communication technology. This request contains the chat message and usually contains the chat history unless this history is also stored in the Web API.
The WebAPI optionally uses an embedding model to generate an embedding representing the conversation's contents. Note: this is not necessary in all solutions as some solutions use a Vector Database that can generate these automatically on search
The embedding model generates and returns an large array of numbers representing the content of the chat. This is typically called a vector.
The Web API calls the vector database with information on what to search for. This typically requires a vector, but in some systems may work with text. This step often defines the maximum number of results to return or the minimum amount of similarity needed for something to show up in the search results.
The vector database returns information on any matching documents or portions of documents. This information typically includes a similarity score, a URL to the document, and relevant text from the document.
The Web API takes its system prompt (textual instructions on how to behave), the chat history, and any relevant documents that came back from the vector search and sends them to a chat completion large language model.
The LLM returns one or more responses to the query.
The final response to the user is returned back to the web application which then displays it to the user.

Platform Specifics

Let's look at some implementation specifics for these various components.

Front End Technologies

The majority of web front-ends use JavaScript as their programming language due to JavaScript's ubiquitousness. JavaScript or a JavaScript framework like React, Vue.js, or Angular are common implementations of a front-end for a RAG chat agent.

Alternative implementations might take advantage of web assembly (WASM) or technologies like Blazor or web components.

Additionally, if you are building an application for a non-web platform, such as a desktop or mobile application, your front end usually involve the most common technologies on that platform.

Web Communication

The majority of web communications between modern applications relies on REST with JSON as the content body for requests and responses. However, alternatives exist including SOAP and gRPC.

If you find yourself in a performance critical scenario and want to have live responses between client and server, you may want to consider using a technology that can take advantage of web sockets for live updates over active connections. This is made more manageable by frameworks like SignalR and Deno that manage these connections for you. However, this active connection carries more complexity and server-side performance concerns that may not make it worth the investment unless you expect frequent back and forth interactions or want to stream responses back to the client as they're generated.

Cloud Components

In our reference architecture we define the web application, web API, vector database, file storage, indexer, and LLM components.

The exact specifics of these technologies will vary from cloud provider to cloud provider, but this section offers a high-level list of components you should consider based on each cloud provider:

Azure

When working with Azure, you might consider the following services for this architecture:

Web Application May involve hosting in an Azure App Service or in a Static Web App with a Blob Storage container containing the relevant files. App Service should be used if your application requires complex logic and processing by a server while Static Web App is appropriate for solutions consisting only of static files like HTML, CSS, and JavaScript files.
Web API likely involves a Azure App Service hosting some form of API logic. You may also want to consider Azure API Management to give additional flexibility around this API. When working with Azure API Management, you can sometimes replace the Azure App Service with individual Azure Functions or Azure Logic Apps.
Vector Database and Indexer can be accomplished using Azure AI Search as an indexer and vector storage solution. Alternatively, you could store embeddings in Cosmos DB if you had more complex needs.
File Storage is best accomplished using Blob Storage and an Azure Storage Account. Typically your Azure AI Search resource is configured to search one or more containers in blob storage.
Large Language Models the easiest way of getting started with LLMs on Azure is via Azure Open AI Service and the chat completion models stored there. Alternatively, the Model catalog in Azure Machine Learning Studio has a wide variety of models you can work with and deploy.

AWS

Amazon Web Services follows the same basic structure as Azure, but the technologies change:

Web Applications would use AWS Elastic Beanstalk for live sites or Amazon S3 and AWS CloudFront for static content
Web API could also involve AWS Lambda and Amazon API Gateway.
Vector Database might involve Amazon Kendra or a backing store like DynamoDB or Amazon RDS.
File Storage will definitely involve Amazon S3 and can be searched and indexed through Amazon Elasticsearch Service.
Large Language Models typically involves Amazon SageMaker.

Google Cloud

Like AWS, Google also supports these capabilities, but its services differ:

Web Application could involve Google App Engine or Google Cloud Storage with Firebase hosting.
Web API might also involve Cloud Functions or Google Cloud Endpoints.
Vector Database will look at Bigtable or Google Cloud Search.
File Storage involves Google Cloud Storage and potentially Document AI from Google Cloud's AI APIs.
Large Language Models Google Vertex AI involves some LLM capabilities you might want to consider

The details of each cloud provider will be unique to that provider and its technologies, but at a high level this architecture is supported by most major cloud providers.

Other Concerns

When building a conversational AI agent with RAG capabilities, there are a few other things you should keep in mind.

Security

Any time you expose an LLM to a user through a chat interface, there's a chance a percentage of your users might attack the AI system through prompt injection attacks. These attacks typically aim to give the agent new instructions in how it should operate or aim to discover more about the prompt and data available to the agent.

While some attacks may be benign, such as trying to get the AI to say silly things or agree to ridiculous requests, others could constitute more serious threats.

See Identify Unknowns, Weaknesses, and Risks in AI for more on AI security challenges

Major AI systems do get attacked by some very creative people who are curious about how your system works. These people will try to get access to your system prompt which gives textual instructions as to how the agent should behave. If attackers realize your system uses RAG to query for additional data, they may also try to exploit that to see if they can retrieve sensitive information.

While there are ways of detecting and mitigating prompt injection attacks, it's best to treat everything in your system prompt as text that may eventually be leaked to the public. This involves avoiding putting any sensitive or compromising instructions in your prompt such as "Try to discourage the customer from buying cheaper products".

Additionally, when offering RAG as a possibility to augment data, the data retrieved should not allow for accessing sensitive data the user themselves shouldn't be able to see. This means that you would want to keep any sensitive information out of the index your RAG agent searches and restrict your data to public-facing documents such as a public knowledgebase or training manuals.

Performance and Quality

Agentic chat systems, like a website chat agent using RAG, perform better than systems based solely on LLMs alone, but still have their limitations.

Like any AI or machine-learning system (or any human agent), an AI solution is going to make mistakes a certain percentage of the time. This may occur 1 in every 1,000 requests or it might be 1 in every 10 requests, but at some point your system will likely give an answer that's different than one you wish it gave.

This can be mitigated by refining your system prompt to account for the weak area your AI agent missed or by providing additional documents to be indexed and added to the searchable content the RAG agent looks at. Additionally, adding test cases around known weak areas can help shore up deficiencies in your AI systems.

There is a danger that AI systems may encounter issues and you as an organization may not be aware of the mistakes. There are two major ways of handling this problem:

Provide users with a "thumbs up" / "thumbs down" button that they can click to flag interactions for review. Negative interactions can be logged to a table or trigger an email send so that your team can review them and determine the best steps to handle cases like the one that occurred in the future.
Automatically log all interactions with the AI system, potentially anonymizing or redacting any user information. These logs can then be reviewed either by a human or by being fed into an AI system for automatic grading. In the later case, the AI agent would have its own prompt and access to an LLM and could use it to analyze the interaction and flag potential issues for human review.

In either case, your organization and its legal team should be mindful of data retention and storage of user interactions with your AI system as the user must be aware the interactions are being logged for review.

Scalability

As software systems grow in utilization there often are growing pains involved. Rag chat agents are no different in this regard.

Most hosted LLMs bill you based on the amount of tokens you consume with your requests where tokens are typically fragments of words. This means that every time someone chats with the system, the request consumes a certain number of tokens from the text being sent to the LLM as well as the text it generates. This billed amount is typically very small, but it can add up over time - particularly when you send the entire conversation history over with every request.

There are some strategies you can use to reduce your bill, and options such as reserving a certain amount of capacity or hosting your own LLM can help with pricing at larger scales, but understand in general that as usage grows your costs will grow.

Additionally, in shared environments like Azure, you may have a certain quota of maximum number of tokens per minute that you can use. You may not hit these quota limits at low scale, but as you grow they'll become a problem and you'll need to increase that quota, switch to a different model, reserve a certain token capacity, or host your own LLM to address this.

Model Replacement

Anyone who has followed AI over the last few years can attest to the large volume of new models and model versions that come out every year, with each new model advertising better metrics than the last.

It's important to understand that the LLM you select during prototyping might not be the LLM you wind up deploying to production at initial release. Furthermore, the LLM you initially release may be replaced several times during the lifespan of your application as new models become available and old models are retired.

This is part of why I do not typically recommend organizations spend time and money fine-tuning a model, but instead focus on RAG, your system prompt, and the context of how a model is used, because organizations should plan on replacing the core LLM that their AI systems use periodically.

You may find yourself moving models for a variety of reasons including:

A desire for higher quality output from your model
Needing a model that produces responses faster
Trying to control costs and understanding that a new model is cheaper to use than your current one
Your old model is marked as deprecated and will be retired by the organization hosting it at a specific date

For these reasons, it's important to have an identified core set of tasks or test cases for any model you use and a consistent way of evaluating the performance of each model. Being able to automate this testing process is important for the agility of any development team as this will give you confidence when making changes to your model, your system prompt, or how your RAG integration operates.

Opportunities for expansion

There are a few key opportunities to expand the capabilities of RAG AI chat applications if you view your system's capabilities as insufficient for your desired user experience.

You may determine that you want a RAG solution, but the system should not have the same degree of available information for each user. Under such a system, you may need to add in role-based access to different pieces of information or information sources and then conditionally enabling access to those data sources based on the identity and role of your user.

If you find you need additional data sources beyond a single additional search capability, you are likely moving past RAG and into AI orchestration systems. Under AI orchestration systems an AI agent has multiple potential RAG sources it can use to pull information from and must make decisions as to when to use each data source. For example, an AI orchestration system might need to search the knowledgebase for some information while other queries might be better suited to running pre-defined SQL queries against a read replica of a SQL database. Common AI orchestration solutions include Semantic Kernel and Langchain.

If you find yourself using an AI orchestration solution and needing particularly complex problem-solving capabilities, you may need to introduce a planner that can define a series of RAG steps needed to produce comprehensive answers on user queries.

If you find your system prompt gets too complex and multi-faceted, you may benefit from taking an agentic approach and splitting your system into multiple agents who can work together to find solutions to complex problems by focusing on one facet of the problem per agent.

If you want users to be able to have persistent conversations with your agent and return to those conversations later, you could potentially store conversation history either in the browser's local storage or on the server and require users to log in. Alternatively, you could store and vectorize individual chat messages from a user and store them in a vector database as conversational memory. This is significantly more complex, but would allow users to ask it things like "What were we talking about last week?" and get meaningful responses.

Conclusion

AI systems are powerful and can offer a compelling experience to your users while helping people find the content or services that are most relevant to them. However, these systems take work and refinement and benefit from experience with their technologies. This article should give you some good ideas of common architectures and approaches involved in a conversational AI system. If you'd like some help bringing such a system to life, please contact us as Leading EDJE is happy to talk with you about your needs and plans and see if collaborating more makes sense.