Creating Evaluation Criteria and Datasets for your LLM App

Seya
26 min readApr 29, 2024

--

When creating features using LLMs in products, the most important things are the evaluation criteria and the dataset.

This may seem obvious to some, but I personally did not give it enough attention and ended up making ad-hoc adjustments to prompts.

However, in reality, it is quite challenging to prepare the necessary evaluation criteria and datasets for unique prompts used in products from the get-go. This has sparked my interest in how to rapidly create these components.

In this article, I will verbalize why these are so important and consider what kind of process would be good for creating them.

Introduction: What is needed to release a feature using LLMs

Since LLMs are a “how”, the way of thinking should not differ from regular product development — it can be released once it is confirmed that the required demands are met.

In other words, there should be feature requirements in the form of “given this kind of data” and “the LLM should return this kind of output”. These expected behaviors are defined as evaluation criteria, and to verify that these criteria are met, inputs aligned with the criteria are needed, which are constructed as a dataset.

For those who have not worked with LLMs, it may be difficult to imagine, so I will provide a concrete example. For the following prompt, I envision the evaluation criteria to be something like this:

▼ Prompt

You are an excellent travel assistant AI.
Please provide as much specific and useful information as possible about
sightseeing in {place_name}.

However, please refrain from expressing personal opinions.

▼ Evaluation criteria

- Does it cover the main tourist spots in {place_name}?
- Does it specifically explain the features and highlights of each tourist spot?
- Does it provide practical information such as access methods and required time?
- Does it avoid expressing subjective opinions?
- Does it maintain a bright and polite tone?

And for the dataset, I envision a list of pairs of prompt variable parts and expected outputs like this:

[
{
"input": {
"place_name": "Kamakura"
},
"output": "Kamakura is a popular tourist destination in Kanagawa Prefecture. Here are some of the highlights:\n\n1. Tsurugaoka Hachimangu Shrine: A shrine associated with Minamoto no Yoritomo. The main hall and the stage are designated as national treasures.\n2. Kotoku-in Temple: Famous for the \"Kamakura Daibutsu\" (Great Buddha of Kamakura). The 11.3m tall bronze seated statue of Amitabha Buddha is impressive.\n3. Hase-dera Temple: The principal image, the eleven-headed Kannon Bodhisattva, is the largest wooden statue in Japan at 9.18m tall.\n4. Kamakura Museum of Literature: Introduces literary figures associated with Kamakura. Enjoy their works in historical buildings.\n5. Enoshima: A power spot floating in the sea accessible via the Benten Bridge. The view from the observation lighthouse is spectacular.\n\nAccess is about 1 hour from Tokyo Station on the JR Yokosuka Line. Taking the Enoshima Electric Railway or bus from Enoshima Station to tour each spot is recommended. If you have one full day, you can visit the main spots. Wear comfortable shoes and dress appropriately for the season to enjoy sightseeing."
},
{
"input": {
"place_name": "Hakone"
},
"output": "Hakone is a popular tourist destination in Kanagawa Prefecture. Here are some of the highlights:\n\n1. Hakone Shrine: A shrine with over 1200 years of history. The Torii of Peace and Kuzuryu Shrine are famous.\n2. Lake Ashi: Enjoy the lake scenery from sightseeing boats and a ropeway. The pirate ship is also popular.\n3. Owakudani: A volcanic area at the center of Mt. Hakone. Experience the volcanic fumes and sulfuric smell.\n4. The Hakone Open-Air Museum: Outdoor sculptures are scattered in a lush forest. Works by Picasso are also present.\n5. Odawara Castle: A castle associated with the Hojo clan, a major daimyo of the Kanto region. The keep offers a panoramic view of Sagami Bay.\n\nAccess is about 35 minutes from Tokyo Station to Odawara Station by shinkansen. From there, taking the Hakone Tozan Railway to Yumoto Station and touring each spot is recommended. The Hakone Free Pass is convenient as it offers unlimited rides on transportation. Don't forget to bring a change of clothes and towels if you plan to take a hot spring bath."
}
]

Example outputs are not mandatory when tuning, but they are useful in various ways:

  • Can be applied to few-shot learning
  • Enables concrete image sharing of requirements among stakeholders
  • Can be used for similarity evaluation using embeddings
  • May be used for automatic prompt improvement with frameworks like DSPy

So my opinion is that it’s better to prepare them.

Now, what happens if you don’t create these evaluation criteria and datasets?

  • The behavior of the LLM depends on the qualitative judgment of the person doing the tuning, so it may not stably exhibit the desired behavior for the product.
  • Tuning itself tends to become an ad-hoc response to issues that arise.
  • Without past evaluation criteria, it becomes difficult to modify the prompt later (such as when changing the model) because it’s unclear what the added text was meant to protect, and there’s a possibility of introducing regressions in aspects that weren’t noticed.
  • Incidentally, without past assets, the dataset needs to be recreated.

These phenomena occur, or rather, I experienced them myself… Therefore, both in the short and long term, it is necessary to steadily accumulate evaluation criteria and datasets as assets.

This article considers how to create these by:

  • Thinking about why evaluation of LLM App is so difficult in the first place -> Because I believe the process is also regulated by those properties.
  • Organizing the “how” for conducting evaluations -> Because I don’t think I can consider specifics without understanding the means.
  • Then finally considering the process.

Why is evaluating LLM-based App is so difficult?

First, I thought that verbalizing “why evaluating LLMs in product development is difficult” would help reveal “what kind of properties they have that make it difficult, and therefore how they should be advanced”.

To state the conclusion, I think the following two reasons are responsible:

  • The “breadth” of input and output for LLMs is too wide.
  • Evaluation criteria and datasets specific to our own product requirements do not exist in public

The “breadth” of input and output for LLMs is too wide

A characteristic of LLMs is their ability to handle unstructured data in the form of natural language as input, allowing users to freely enter questions and requests. (Although I think many parts of the prompt will be fixed when used within a product.) While this high degree of freedom is a strength of LLMs, it is also a factor that makes evaluation difficult.

For example, suppose an LLM is used for conversational purposes. Users may sometimes input very enthusiastically, while at other times they may only send brief messages to the AI.

Due to this diversity, it is extremely difficult to prepare comprehensive test cases. No matter how many variations are considered, it is impossible to fully cover the inputs that actual users would make.

Not only is there diversity in inputs, but the high uncertainty of outputs is another difficulty. LLMs behave probabilistically, so even for the same input, different outputs may be generated each time.

Of course, the output can be made somewhat fixed by lowering parameters like temperature, but this “fluctuation” can also be seen as creativity, so depending on the use case, it may sacrifice quality.

Due to these properties, it is nearly impossible to comprehensively cover all aspects in advance (or even after the release), so the evaluation perspectives need to be explored.

I’m sure many of you have experienced the feeling of “wow, this type of disappointing output also occurs…” while engaging in dialogue with the LLM during prompt tuning.

This “breadth” of inputs and outputs is the reason we use LLMs, and also the reason evaluations are difficult.

Evaluation criteria and datasets specific to our own product requirements do not exist

The above was about LLM evaluation in general, but evaluating LLMs within products means checking “whether the prompt created by a specific input + prompt template can handle the task well”.

Our products have unique requirements, and those requirements create the unique value of each product. HuggingFace, GitHub, and other platforms have many publicly available datasets and benchmark codes, but there are very few cases where they can be used as-is.

Therefore, we need to prepare all the evaluation criteria and datasets for prompts used within our products every time.

However, I believe that if we can accumulate past datasets, evaluation criteria, and evaluation execution methods as “assets” to a certain extent, we will be able to handle new problems relatively easily.

Organizing the “How”

Next, while I would like to consider how to create evaluation criteria and datasets based on the above properties, I think it would be difficult to consider the process in a concrete way without knowing the specifics of the “how”, so I will first try to organize these.

To start, as an overall picture, there is a diagram I really like in the LangSmith evaluation video series, so I will quote it.

Landscape of personalized evals

Source: Why Evals Matter | LangSmith Evaluations — Part 1

It introduces four main components, and while I will explain the individual elements later, it also provides decomposed diagrams for each one.

  1. Dataset
  2. Evaluator
  3. Task
  4. Applying Evals

For now, in this article, I will focus on the flow up to release as the primary scope, and I will consider prompts only, without including RAG, non-machine learning code, etc., regardless of whether the Task is single or combined.

Therefore, here I will organize:

  1. How to create evaluation criteria
  2. How to conduct evaluations (Evaluator)
  3. How to create datasets (Dataset)

How to create evaluation criteria

While “evaluation” of LLM outputs is a single term, evaluation criteria have various purposes. Still, I think they can be classified to some extent, and first, they can be broadly divided into “product requirements” and “guardrails”.

Product requirements
These are the roles of the prompt verbalized in light of the product requirements.

These can be further divided into positive and negative requirements.

Positive — “I want answers like this to be returned”
For example, in the case of a travel assistant app,

  • “I want it to comprehensively present the main tourist spots of the destination”
  • “I want it to specifically explain the appeal and highlights of each tourist spot”
  • “I also want it to provide practical information such as access methods and required time”

Negative — “I don’t want answers like this to appear”
For example, again in the case of a travel assistant app,

  • “I don’t want it to express subjective opinions or evaluations”
  • “I don’t want the information to be outdated or inaccurate”
  • “I don’t want it to provide irrelevant information unrelated to the intent of the question”

Guardrails
While slightly similar to the negative items mentioned above, these refer to the overall quality that you want to maintain regardless of the product requirements, such as:

  • Does it contain personal information?
  • Does it contain toxic language?
  • etc.

Quite a few of these are things you want to protect in any product, and there are various generic libraries available.

In particular, there’s a site/library called Guardrails Hub, which is a collection of such validation functions.

Non-functional requirements
Also, while the main topic this time is evaluation and datasets for LLM outputs themselves, in reality, prompts also have non-functional requirements such as token count and latency. These also affect the actual experience and may even be related to the architecture in some cases, so it is important to be aware of them early on.

So far, we have looked at the classification of evaluations, but intuitively, I feel that they can be neatly organized by applying the Kano model.

source: KANO model

Guardrails are “must-have quality”, negatives are “one-dimensional quality”, and positives span “one-dimensional quality” and “attractive quality”. The idea is to first ensure the former at a minimum, and then gradually satisfy the latter quality by increasing data and evaluation.

Next, having created these evaluation criteria, I would like to look at how to actually operate them, the evaluation methods.

How to conduct evaluations

Once you have evaluation criteria, the next issue is how to specifically conduct the evaluations. There are two things to consider: “what will be the concrete output format of the evaluation” and “who will do it and how”.

What will be the concrete output format of the evaluation?
The concrete output of the evaluation is obtained by applying the evaluation criteria to the results of executing the LLM for each data in the dataset.

Ideally, the evaluation results should be expressed numerically. For example, there is a method of scoring on a scale of 0 to 1. This makes it easy to compare the results of multiple prompts or settings and track changes over time.

However, depending on the evaluation criteria, quantification may be difficult. In such cases, truth values (True/False) may be used to express whether the evaluation criteria are met. For example, criteria such as “does the output contain specific keywords” are better suited to be expressed as truth values.

Also, evaluation results are typically expressed not only as results for individual data but also as aggregate values for the entire dataset. For example, the mean or median of quantified evaluation results, the percentage of truth values, etc. This allows for an overall judgment of the performance of the prompt or the entire system.

As an example, the following is the result screen when an evaluation is executed in LangSmith.

Who conducts the evaluations?
There are three main options:

  • Humans
  • LLMs (LLM-as-a-judge)
  • Programmatically

The choice is mainly made by weighing money💰, annotation quality, and time.

Humans

This is the most expensive and time-consuming option, but it gives the highest confidence in annotation quality. I think in most cases, people involved in product development within the company will initially do this.

Or sometimes people are hired to do annotations. I don’t have experience hiring people myself, so I don’t have insights on where it’s best to hire them, but I once learned from the book below that requests can be made through a service called Amazon Mechanical Turk.

Regardless of how you request annotations, an important thing to recognize is “whether that person is suitable for annotating that task”.

This is clear when creating a domain-specific product. For example, when using LLMs in a medical service, I, who has only been programming all my life, would definitely be unsuitable for annotating.

Even if deep domain knowledge is not required to that extent, there will be many cases where the people in the company or those you ask for annotations are not exactly the same as the users or cannot imagine the same issues.

The goal of LLM involvement aside, cooperating with domain experts to explore better product requirements and designs is common in product development, and the same can be said for annotations.

The level of expertise required will vary depending on what requirements LLMs are used for, but I think the key will be whether a system can be created to produce high-quality annotations with people who can do so, and ideally to create evaluation criteria while having them provide reasons along with the annotations.

LLMs (LLM-as-a-judge)

LLM-as-a-judge is also a method that has been used quite frequently in recent times. As the name suggests, LLMs are asked to perform evaluations and give a score between 0 and 1, or make a binary judgment of Yes or No. For example, it is done with a prompt like the following:

Question:
{question}

Answer:
{answer}

Please evaluate the above question and answer pair on a scale of 0 to 1 based on the following perspective. Do not output anything other than the numerical value.

- Does the answer accurately capture the intent of the question?

The simplest way to evaluate is as shown above, and this method is often referred to when the phrase “LLM-as-a-Judge” is used. This is called a score-based evaluation. There are also various other methods such as:

  • Probability-based evaluation: Calculate the probability (generation likelihood) of generating the target output and use it as the evaluation score.
  • Pairwise evaluation: Compare two output results for one task.

So even if it’s all called “LLM-as-a-Judge”, there are various methods, so it’s good to consider them according to the purpose and objective.

One thing to note about LLM-as-a-Judge is that, LLMs also have biases like the following:

Positional bias

  • LLMs tend to favor answers in certain positions (mainly favoring the first and last).
  • The paper below proposes a method of having the LLM judge twice by swapping the order of two answers, and only judging one answer as superior if it is judged as such in both cases.

Redundancy bias

  • LLMs tend to favor long and redundant answers even if they are inferior in clarity, quality, and accuracy.

Self-aggrandizement bias

  • LLMs may favor answers they generated themselves.

Also, while cheaper than humans, LLMs still cost money to some extent, so it’s not realistic to blindly have LLMs evaluate everything and perform evaluations with a variety of inputs.

I think it would be good to operate in a way such as:

  • Start creating them at the beginning of iterating on evaluations, use them as a reference to find errors to some extent while improving the quality of evaluation prompts.
  • If it seems like an evaluation criterion that would be versatile across various prompts, take the time to evaluate the evaluation prompt itself and nurture it into a reliable one before reusing it.

By the way, the question of “can the evaluation prompt really evaluate correctly?” will of course be raised. I was also curious about how everyone is doing it, so I did some research, but it seems that, to put it bluntly, they are starting with the evaluation text that they think is appropriate in a straightforward manner.

For example, you start with a prompt like the following:

The following text introduces recommended spots for a travel destination. Please evaluate on a scale of 1 to 5 whether this text provides useful information for the reader. Consider 5 to be the most useful and 1 to be not useful at all.

Evaluation perspective:
- Are the features and highlights of the spot specifically explained?

Text to be evaluated:
'''
{The text to be evaluated goes here}
'''

Evaluation score:

Starting from a prompt like this, you actually have various texts evaluated. Then, you check the evaluation results against human intuition to see if the evaluations are being made adequately.

In that process, if you feel that errors are being properly detected, or that there are many false positives, etc., you adjust the prompt or add/modify evaluation perspectives, gradually improving the quality of the evaluation.

While evaluation serves the major purpose of final quality assurance, it also aims to speed up the development cycle by assisting in error detection during experimentation. In light of that purpose, it may be better to start with the accuracy of the evaluation itself as a rough guide rather than being overly particular about it from the beginning.

Also, if it is an evaluation item that is likely to appear in various places for that product, I think it may be a good idea to invest a certain amount of time to conduct an “evaluation of the evaluation”.

Programming

This is the cheapest and least time-consuming method, and if the evaluation item can be evaluated this way, it should be actively adopted.

Simple things like judging the length of the text, checking if specific strings are included, or using machine learning models like sentiment analysis to make judgments are what I have in mind.

Humans are of course expensive, and LLMs are also quite expensive, so it may be a good flow to start with LLMs that are easy to try, and if you feel it is a promising evaluation item, consider if it can be replaced programmatically.

How to create datasets

Next, let’s look at how to create the datasets necessary for conducting evaluations.

  • Humans
  • Have LLMs etc. create them (Synthetic Dataset)
  • Collect from actual execution logs

Also, as a premise, regardless of the method used, the purpose of creating the dataset is for evaluation, so it is important to consciously include good examples that meet the evaluation criteria and bad examples that get caught by the guardrails.

Humans

Of course, having humans create them will be the first candidate to come to mind.

Manually prepare multiple patterns of typical inputs that meet the product requirements, and create the expected responses for each. I think it’s good to create at least the first few based on the requirements, to have a common understanding among developers and product managers of “this kind of output is desirable”.

However, for evaluation, it is desirable to test with multiple inputs for the same evaluation perspective. Creating a sufficient amount of data manually to achieve this is quite a laborious task, so this is where the next option of generation by LLMs comes into consideration.

Have LLMs etc. create them (Synthetic Dataset)

This is where LLMs come into play again for dataset creation.

I have also been trying this recently, and while just passing the prompt string and saying “please create various variations of data” doesn’t seem to create a very good dataset (it may output data with only words changed), if you provide evaluation criteria and/or ask it to think about them first and then output data to test them, it produces data of a quality that can be used for evaluation to a decent extent.

At this point, presenting a few typical examples created by humans in the style of few-shot learning is also effective in preventing the format from deviating too much (although I sometimes feel it gets pulled along a bit too much).

Example of a data generation prompt

▼ Prompt for generating evaluation criteria

given prompt below, this prompt is aimed to {purpose_of_prompt}  
list what perspectives is necessary to evaluate this prompt in addition to the initial evaluation perspectives.

<evaluation_perspectives>
{evaluation_perspectives}
</evaluation_perspectives>

<prompt>
{prompt}
</prompt>

<examples>
{examples}
</examples>

▼ Prompt for generating data

now create 3 examples of input and expected output that can test each evaluation perspective.

write only list of examples JSON

In addition to creating with prompts, you can also consider having the LLM think of inputs while repeatedly running it many times through a CLI or actual screen in combination with E2E testing tools.

I think this is a method that is particularly useful during the initial prompt tuning and evaluation when there isn’t much data available.

Also, in the field of creating Instruction datasets for model tuning, the use of Synthetic Data is being further explored, and methods such as augmenting from seed examples and introducing mutations have been proposed.

Source: Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

These may not be exactly the same as the goal is to create data for learning, but I think the ideas can be directly applied to creating datasets for prompts in this case. For example, I believe the method of Evolve-Instruct for expanding variations could extend the evaluation perspectives to include those that developers couldn’t recognize in advance.

Collect from actual execution logs

And the last is “collecting data from actual logs”. This includes logs accumulated from internal testing as well as from actual users using the product.

By collecting the input-output pairs of actual users, you can cover a wide range of cases that developers may not have anticipated. Especially after releasing the product, operating it for a certain period of time and accumulating data will allow you to build a dataset that is more in line with reality.

However, when collecting data from logs, careful consideration must be given to user privacy. Utmost care should be taken in handling personal information, and appropriate anonymization measures should be applied.

The collected data will be annotated manually or using LLMs as described earlier, in light of the evaluation criteria. Regardless of how the annotation is done, it will be an important initiative to create a pipeline for “selecting targets (ideally tied to user feedback)”, “automatically adding to the dataset when annotated”, etc.

Also, as a major premise, in order to be able to do this, it is assumed that logging tools like LangSmith are introduced. These tools often come with various support features such as annotation queues, feedback functions, and automatic scoring functions as mentioned earlier, so it would be good to keep such aspects in mind when making technology selections.

To reiterate, it is basically impossible for just the people involved in development within the company to cover all the evaluation perspectives and create a complete dataset, so using this raw data for continuous improvement is an extremely important initiative.

How to manage evaluation criteria and datasets

So far, we have looked at the specific “how”, but let’s also look at how evaluation criteria and datasets should be managed more concretely.

I was trying to keep the content of this article somewhat abstract without depending on specific technologies, but I think it’s better to have a concrete image, so here I will introduce a case study using LangSmith.

LangSmith has evaluation and dataset features.

In the dataset feature, you can manage Examples of inputs and outputs for a specific prompt as shown below. By the way, LangSmith also has a logging feature (this is actually the main feature of LangSmith), and you can add to this dataset by simply clicking “Add to Dataset” from the log, which is convenient when reinforcing the dataset from testing and real user interactions.

Evaluations are executed using code with the LangSmith SDK. I think it would be good for such code to be maintained like test code.

from langsmith.schemas import Run, Example
from langsmith.evaluation import evaluate

# Define dataset: these are your test cases
dataset_name = "langgraph-test"

def predict(inputs: dict) -> dict:
# Execute LLM here and return the result
# inputs contains one row from the dataset = this predict function is executed as many times as there are Examples in the dataset.
return {"output": response}

# Function to execute evaluation, evaluators takes a list so multiple evaluation functions can be defined
# {"key":"Label name to see on the result screen", "score": Express as a number or True/False}
def must_mention(run: Run, example: Example) -> dict:
prediction = run.outputs.get("output") or ""
required = example.outputs.get("must_mention") or []
score = all(phrase in prediction for phrase in required)
return {"key":"must_mention", "score": score}

experiment_results = evaluate(
predict,
data=dataset_name,
evaluators=[must_mention],
)

Executing this code runs the LLM and evaluation, and the results can be viewed from the “Experiments” tab on the dataset page shown earlier. (It’s throwing an error and the content is empty, but the actual execution results are included in the output.)

▼ Experiment result list screen

▼ Experiment details

By the way, one of the selling points of LangSmith is that it allows you to view traces of LangGraphs and other chains tied to evaluations, not just single prompts.

For example, the following is a trace of LangGraph, which makes it easy to take a workflow of going back and forth between executing the entire system processing in LangGraph <=> finding individual problems and unit testing only those parts.

This is so powerful that I’m starting to feel it may be best to actively get locked into the LangChain ecosystem.

However, while I introduced LangSmith as an example this time, the important part is that “datasets, evaluation criteria, and their execution code” are maintained as common assets, and as long as that is ensured, I think other technologies are fine as well.

Process to create evaluation metrics and datasets

So far, we have looked at the specific “how”. While I would like to say “just combine these and you’re good to go!”, as mentioned earlier in “why evaluating LLM-based features is difficult”, it is extremely challenging to define the evaluation criteria that should be the goal all at once.

Therefore, I would like to ponder how to create evaluation criteria and datasets “in stages”.

To reiterate, until the judgment to release a feature, the goal is to “have evaluation criteria and the associated evaluation dataset, execute the LLM on that evaluation dataset, and confirm that it exceeds the predefined threshold”.

So, first, let’s consider creating evaluation criteria and a dataset as quickly as possible, but as I have repeatedly mentioned, the process of creating these has to be exploratory, so it will be accompanied by the premise of ad-hoc prompt tuning.

Roughly speaking, I think the flow will be something like this:

  • Create an initial version of evaluation criteria and dataset
  • Architecture design
  • Manually run the overall and individual prompts to some extent
  • Create datasets and evaluation criteria for individual prompts (unit testing)
  • Check the entire system (integration testing)

Create an initial version of evaluation criteria and dataset

First, manually derive evaluation criteria from the requirements.

If the requirements are quite specific, that’s a different story, but often it’s easier to broaden ideas while actually running the system, so rather than trying to create them rigorously here, create them with the mindset of creating data for PoC testing.

Initial architecture design

For the feature you want to create, it’s rare that a single prompt is sufficient, and there will be some business logic, multiple prompts linked together, or branches inserted. This phase is where you get the feeling of “yeah, this seems like it’ll work”.

However, systems involving machine learning are known to often fall into “PoC poverty”, where “what was created as a trial turns out to be more disappointing than imagined when verified later”.

below is a quote from Japanese article

A common case is to first create a trial version as a PoC (Proof of Concept), and if it goes well, move on to the actual project, but this also has problems. That is, there are examples of “PoC poverty” being seen everywhere. As a characteristic of inductive software, it cannot guarantee 100% correctness. The client can point out later that “the accuracy is insufficient” or “the capability is inadequate”.

This is more of a mindset issue rather than a technical one, but I feel it’s important to recognize that there is an unimaginably large gap between the developer’s qualitative “seems like it’ll work” and the actual production quality, and to diligently work on evaluations without letting your guard down.

Manually run the overall and individual prompts

Once a certain framework is in place, rather than immediately creating rigorous evaluation criteria and datasets, I think it’s good to engage with the system to some extent as a preliminary step to deepen the understanding of evaluation perspectives and also to increase the data in the process.

First, run through the entire flow and check the behavior of each prompt and whether the coordination between prompts is working well. In doing so, trying various input patterns will reveal issues such as the robustness of the system and handling of edge cases.

Also, for each prompt, create test cases from the perspective of the evaluation criteria and thoroughly check whether it is behaving as expected. Through this manual testing, the excessiveness or insufficiency of the evaluation criteria and the variations needed in the dataset will become apparent.

Create datasets and evaluation criteria for individual prompts (Unit tuning)

Based on the insights gained from manual testing, refine the evaluation criteria for each prompt. The evaluation criteria should be as quantitative and measurable as possible, but I think there will be some that can only be judged qualitatively to some extent.

On top of that, expand the dataset to meet the evaluation criteria. As mentioned earlier, it is essential to efficiently collect high-quality data while utilizing LLMs and user logs (initially from people within the company).

Once the dataset is ready, tune the prompts using it. Perform the tuning until the numerical values of the evaluation criteria increase or until you feel there are no more issues when looking at it qualitatively, but in this process, you may also want to update the evaluation criteria, dataset, and evaluation methods as appropriate.

However, depending on the importance of the task and the possibility of changes, creating strict evaluation criteria and datasets may not always be necessary. For example, for minor tasks or parts that are likely to change significantly at the prototype stage.

That being said, it is important to leave even a small amount of data and evaluation perspectives as assets that can be referred to later. By saving them not only for checking on hand but also in a designated place within the team, they can be utilized for subsequent development and improvement.

Creating evaluation criteria and datasets comes with a certain cost, but from the perspective of long-term quality improvement and maintainability, I recommend doing it to an appropriate extent.

Check the entire system (Integration Test)

Once the quality of individual prompts has improved, check the operation of the entire system again. It is not uncommon for new issues to arise when combined, even if there are no problems individually. There are various ways to view the overall results, such as using logging tools if they allow you to view the entire process at once, or in some cases, interacting with the UI or creating dedicated tools.

I think the tuning flow will involve going back and forth between this “testing through the entire system <=> unit testing”.

When new problems stop appearing, define a threshold of being able to release based on the evaluation criteria and dataset you have built, and once it exceeds that, you can finally start preparing for release.

However, continuous monitoring and improvement are essential even after the release. It will be necessary to continuously update the evaluation criteria and dataset while reflecting actual user feedback.

Conclusion

So, in this way, I have considered how to create evaluation criteria and datasets. I would be happy if you could comment on how you are doing it in your own workplaces.

Somehow, I have a feeling that the maintenance of evaluation criteria and datasets for prompts will tend to be neglected, just like frontend and E2E testing, even though it is known that “it’s better to write them”.

This is because prompt tuning is more conducive to feeling progress by constantly changing prompts in response to the challenges at hand, rather than diligently creating surrounding elements like evaluations and datasets.

However, even if it may feel like progress is slower in the short term, properly establishing appropriate evaluation criteria and datasets will lead to improved productivity and quality in the long run.

Having evaluation criteria allows for objectively measuring the quality of prompts and clarifies points for improvement, and a good dataset enhances the generalization performance of prompts, enabling stable output for various inputs.

However, even if the importance of evaluation criteria and datasets is understood, actually creating and maintaining them is not easy. In fact, there may be cases where creating them does not pay off depending on the situation, so flexibility is necessary. On the other hand, I feel that efforts to raise awareness of their importance and create an environment that lowers the hurdles for creation are also needed. For example, preparing templates for evaluation criteria and developing tools to assist in dataset creation.

With that said, I personally would like to work on the following in the future. I would like to share information on these as well.

  • Building an evaluation, dataset, and experiment management environment
  • Synthetic Data for prompt examples
  • Automatic prompt tuning from evaluation criteria and datasets

That’s all for this article, thank you for reading!!👋

Reference literature

Evaluation of LLM Apps

Synthetic Dataset

--

--