
An LLM Evaluation Framework for AI Systems Performance
One of the challenges of AI systems development is ensuring that your system performs well not just when it is initially released, but as it grows and is deployed to the world. While AI prototyping projects are fun and exciting, eventually systems need to make it to the real-world and evolve over time.
These evolutions can come in the following forms:
- Changing the system prompt to try to improve performance or resolve issues
- Replacing the text completion or embedding model used by your system
- Adding new tools for AI systems to call in function-calling scenarios. This is particularly relevant when working with tooling like Semantic Kernel or Model Context Protocol (MCP)
- Changing the data that is accessible to models for Retrieval Augmentation Generation. This often comes naturally over time as new data is added.
Regardless of the cause of change, organizations need repeatable and effective means for evaluating how their conversational AI systems respond to common contexts. This is where Microsoft.Extensions.AI.Evaluation comes in as an open-source library that helps you gather and compare different metrics related to your AI systems.
In this article we'll cover this LLM testing framework and cover the evaluation metrics of Equivalence, Groundedness, Fluency, Relevance, Coherence, Retrieval, and Completeness using C# code in a small .NET application.
Chat History and Context
Our code in this article uses OpenAI to generate chat completions and then Microsoft.Extensions.AI.Evaluation
to grade its results on standardized metrics.
Note: Microsoft.Extensions.AI and Microsoft.Extensions.AI.Evaluation can work with model providers other than OpenAI, and this can include locally or network-based models such as Ollama models or services like LM Studio that have an OpenAI compatible API
These metrics are produced by sending a chat session to OpenAI for grading and providing a list of evaluators to run on that metric. Because of this, we'll need to connect to OpenAI not just to get our chat completions but to get the evaluation metrics for the interaction as well.
We can connect to OpenAI by providing an API key and optionally an endpoint (if you're targeting a custom deployment of OpenAI). We'll do this using the OpenAIClient
and the IChatClient
interface defined in the Microsoft.Extensions.AI
NuGet package:
1OpenAIClientOptions options = new()
2{
3 Endpoint = new Uri(settings.OpenAIEndpoint)
4};
5ApiKeyCredential key = new ApiKeyCredential(settings.OpenAIKey);
6IChatClient chatClient = new OpenAIClient(key, options)
7 .GetChatClient(settings.TextModelName)
8 .AsIChatClient();
Building a conversation history
Both evaluation and chat completions require a conversation history in order to function.
We'll simulate one by building a ChatHistory
object and populating it with a short interaction:
1const string greeting = "How can I help you today?";
2console.MarkupLineInterpolated($"[cyan]AI[/]: {greeting}");
3
4const string userText = "Is today after May 1st? If so, tell me what the next month will be.";
5console.MarkupLineInterpolated($"[yellow]User[/]: {userText}");
6
7string ragContext = "The current date is May 27th";
8
9List<ChatMessage> messages = [
10 new(ChatRole.System, $"{settings.SystemPrompt} {ragContext}"),
11 new(ChatRole.Assistant, greeting),
12 new(ChatRole.User, userText)
13];
Our history is composed of a system prompt, an assistant greeting, and a simulated message from the user. Normally the user would type the message themselves, but since our goal in this article is to evaluate a simple interaction, this message is hardcoded here instead.
The system prompt is typically one of the things you'll want to try tweaking the most, so it's worth sharing the simple prompt used in this example:
You are a chatbot designed to help the user with simple questions. Keep your answers to a single sentence.
While the evaluation metrics we'll collect deal with the entire AI system, in reality one of the main things you'll use metrics for is tweaking your system prompt to improve its performance for targeted scenarios.
Also note that I am simulating Retrieval Augmentation Generation (RAG) in this example by injecting a ragContext
variable that contains a simple string including the current date. Our main focus in this article is not RAG, but RAG will be relevant for the Retrieval metric later on. If you're curious about RAG, I recommend checking out my recent article on RAG with Kernel Memory.
Getting Chat Completions
Chat completions are relatively straightforward to retrieve once we have our chat history, and involve a simple call to the IChatClient
:
1ChatResponse responses = await chatClient.GetResponseAsync(messages);
2foreach (var response in responses.Messages)
3{
4 console.MarkupLineInterpolated($"[cyan]AI[/]: {response.Text}");
5}
This calls out to OpenAI with the simulated history and gets a response to the message before displaying the responses to the user.
Note: I'm using Spectre.Console as a formatting library to make this sample application easier to read. You can see this here with the MarkupLineInterpolated
call, though this could just have easily been a Console.WriteLine
Our sample scenario for this article is a short interaction where the user asks the AI system for the date, shown here:
AI: How can I help you today?
User: Is today after May 1st? If so, tell me what the next month will be.
AI: Yes, today is after May 1st, and the next month will be June.
Since the date in this interaction is May 27th, this response is factually correct, though evaluation results will flag that it isn't quite as complete as it could be as the AI doesn't specify the current date in its response.
With our IChatClient
ready and our chat completions available, we're now ready to get evaluation results
Evaluating chat completions
Microsoft.Extensions.AI.Evaluation allows you to specify one or more evaluator objects to grade the performance of your AI system for a sample interaction.
Here's a simple example using a CoherenceEvaluator which makes sure the AI system's response is coherent and readable:
1IEvaluator evaluator = new CoherenceEvaluator();
2ChatConfiguration chatConfig = new(chatClient);
3EvaluationResult evalResult = await evaluator.EvaluateAsync(messages, responses, chatConfig);
This produces an EvaluationResult
object that contains a single metric for the coherence of your AI system.
This metric will be a NumericMetric
with a numeric Value
property ranging from 1 to 5 with 1 being poor and 5 being nearly perfect.
Each metric will also include a Reason
property including the justification the LLM provided for this ranking. This helps you understand what might be lacking about responses that scored below a 5 and can be good for reporting as well.
Evaluating multiple metrics
Most of the time you'll want to look at not just a single metric, but many different metrics.
The following example illustrates this by using a CompositeEvaluator
composed of multiple component evaluators:
1// Set up evaluators
2IEvaluator evaluator = new CompositeEvaluator(
3 new CoherenceEvaluator(),
4 new CompletenessEvaluator(),
5 new FluencyEvaluator(),
6 new GroundednessEvaluator(),
7 new RelevanceEvaluator(),
8 new RelevanceTruthAndCompletenessEvaluator(),
9 new EquivalenceEvaluator(),
10 new RetrievalEvaluator()
11);
12
13// Provide context to evaluators that need it
14List<EvaluationContext> context = [
15 new RetrievalEvaluatorContext("The current date is May 27th"),
16 new CompletenessEvaluatorContext("Today is May 27th and the next month is June"),
17 new EquivalenceEvaluatorContext("The current date is May 27th, which is after May 1st and before June."),
18 new GroundednessEvaluatorContext("May 27th is after May 1st. June is the month immediately following May.")
19];
20
21// Evaluate the response
22ChatConfiguration chatConfig = new(chatClient);
23EvaluationResult evalResult = await evaluator.EvaluateAsync(messages, responses, chatConfig, context);
You likely noticed that we're now defining a context
collection of EvaluationContext
objects and providing it to the EvaluateAsync
call.
These context objects are needed for some of the more advanced evaluators. While some of the evaluators can work on their own, a few of them need you to provide additional details on what information should have been retrieved, what an ideal response to the request would look like, and what information absolutely needs to be in the response.
Failing to provide these context objects will not cause errors but does result in missing metric values for the relevant evaluators.
Displaying evaluation results
Evaluation metrics are helpful, but can be cumbersome to look at manually. Thankfully, they're not too hard to display to a table using Spectre.Console.
The following C# code loops over each metric in an EvaluationResult
, adds it to a Table
, and displays it to the console:
1Table table = new Table().Title("Evaluation Results");
2table.AddColumns("Metric", "Value", "Reason");
3foreach (var kvp in evalResult.Metrics)
4{
5 EvaluationMetric metric = kvp.Value;
6 string reason = metric.Reason ?? "No Reason Provided";
7 string value = metric.ToString() ?? "No Value";
8 if (metric is NumericMetric num)
9 {
10 double? numValue = num.Value;
11 if (numValue.HasValue)
12 {
13 value = numValue.Value.ToString("F1");
14 }
15 else // Possibly missing a Context entry
16 {
17 value = "No value";
18 }
19 }
20 table.AddRow(kvp.Key, value, reason);
21}
22console.Write(table);
This will result in a sample output like the following image:
Having a nicely formatted metrics display makes it easy to capture and share AI systems performance and communicate it with others.
Microsoft.Extensions.AI.Evaluation also includes some great HTML and JSON reporting capabilities as well as the ability to examine multiple iterations and scenarios in the same evaluation run. The scope of this is beyond this article, but I plan to cover them in a future article.
Now that we've seen how to pull evaluation metrics, let's discuss what each of these AI system evaluation metrics means.
AI Systems Metrics
Let's cover the currently supported AI metrics available in Microsoft.Extensions.AI.Evaluation
.
Equivalence
Equivalence verifies that the system's response roughly matches the sample response coded into the CompletenessEvaluatorContext
. Put simply, it's a way of measuring that the system's response was close to what we'd expect it to be.
Groundedness
Groundedness checks to make sure we're using the relevant facts in responding to the user. This helps make sure that the LLM isn't providing an answer that's wildly different than what we'd expect it to offer. This can be helpful for organization's that have specific lists of points they want to cover or organization-specific terms or definitions that may differ slightly from publicly-available information.
Fluency
Fluency checks grammatical correctness and adherence to syntactical and structural rules. In short, it evaluates if the system is producing something that appears to be valid English or something that is complete gibberish.
Note: I suspect that Fluency will work with other supported languages that your LLM supports, but I have not investigated this at the point of writing this article
Coherence
Coherence is a readability check to ensure your sentence is easy to read and flows well. While Fluency could be considered a grammar checker, coherence could be considered an editor helping optimize your outputs for readability.
Retrieval
Retrieval grades the effectiveness of RAG systems at providing relevant context to your AI system for responding to the query. This is set via a RetrievalEvaluatorContext
object.
Low retrieval scores may indicate that you have gaps in your content where your system is unable to find relevant information. Alternatively, you might have relevant content, but it's not organized in such a way that your embedding model and indexes are able to effectively retrieve it.
Completeness
Completeness checks to make sure that the response to the user covers all of the major points that the sample response you provided to CompletenessEvaluatorContext
contains.
In our example our system's response was correct, but incomplete as it did not include the current date in its output.
RTC Evaluators (Relevance, Truth, Completeness)
Microsoft also offers a newer and more experimental RelevanceTruthAndCompletenessEvaluator
(RTC) evaluator that bundles together the effects of RelevanceEvaluator
, GroundednessEvaluator
, and CompletenessEvaluator
into a single evaluator that does not require any context.
This RTC evaluator is still in preview and may change significantly or go away, but it has some core advantages over using the three evaluators individually:
- It's faster to make a single evaluation request versus making three separate requests
- This consumes less tokens, resulting in cheaper costs in evaluating your system when using token-based LLMs that charge you based on usage
- It does not require you to provide additional context to the evaluator, making this evaluator easier to use
Once the RTC evaluator leaves preview, it may be a good choice for teams who only want to work with a single evaluator for cost or performance reasons.
Next steps
In this article we saw how you can gather a variety of metrics related to the performance of AI systems using a few lines of C# code.
I'd encourage you to check out this article's source code on GitHub and play around with it yourself using your own API Key and endpoint. Microsoft also has numerous samples related to their own library that are worth checking out as well.
This capability unlocks a number of interesting paths forward for organizations including:
- Creating integration tests that fail if metrics are below a given threshold for key interactions
- Experimenting with different system prompts in a manner similar to A/B testing, but using an LLM evaluator as a referee
- Integrating these tests into a MLOps workflow or CI/CD pipeline to ensure you don't ship suddenly degraded AI systems
As someone who is very cost-averse, having an automated framework for evaluating the performance of LLM-based systems and being able to provide my own models (including Ollama models) is a critical capability and unlocks many different workflows I wouldn't ordinarily have access to.
While this article covers the evaluation capabilities of Microsoft.Extensions.AI.Evaluation, it barely scratches the surface of its capabilities. Stay tuned for part two of this article which will delve into its reporting options and more advanced A/B testing capabilities.