Evaluations

What is an Evaluation?

Evaluations are examples or statements that judge the quality or performance of your LLM outputs. In other words, evaluations (evals for short) are statements that tell your LLM Application that it is performing the way it's human wants it to.

You can evaluate a number of factors such as relevance, hallucination or if the LLM is being mean.

With evals, when you adjust or Empromptu automatically adjusts your prompts, you will if that prompt is more or less performant.

There are 4 ways to define your evaluations in Empromptu

Standard templates from Empromptu
Manually defined
End User Confirmed
Data generated

Standard Evaluation Templates from Empromptu

If you do not have any evaluation statements (yet) you can use Empromptu's UI to select an example set.

Using the UI:

Head to Empromptu's evaluations page
Click add a new evaluation
From the standard tab select a template eval set.

In Code:

Manually defined

Using the UI:

Head to Empromptu's evaluations page
Click add a new evaluation
On the custom tab write your eval statements by:
1. Define a name
2. Description of what is to be evaluated
3. Evaluation Criteria which is your eval statements
4. Expected output what is the output supposed to look like
5. Model you want to use
6. Temperature

In Code:

Name the task [string] (optional) Define an embedding model name
Name the prompt or prompt [prompt_string]
Define what must be true as a string [prompt_text]
Name a Temperature [default is 0]
Define a model

my_task = {
    'prompt_family_name': "summary_for_langchain",
    ‘Embedding_model’:’small-3’,  (OPTIONAL)
    'prompts': [
        {
            'prompt_name': 'prompt_1', 
            'Prompt_text':prompt_text_1, (eg. “DO thing X to this text: {{scraped_text}}”
            'temperature': 0.6, (OPTIONAL)
            'model': 'gpt-4o-mini',
        },
       ….
    ],
    'evals': [ # The short statements that will be used to grade the prompt's accuracy
        {
            'eval_name': 'extracted_truth',
            'eval_text': 'The information extracted is factual and nothing incorrect was kept.'
        },
        {
            'eval_name': 'extracted_completely',
            'eval_text': 'The information was extracted completely and nothing important is missing.'
        },
    ]
}
prompt_registry.register_prompt(my_task)

End User Confirmed

If you are using a human in the loop model or if your end users are evaluating results, you can send those results to Empromptu in order to get a better system based on your end users.

Using the UI:

Head to Empromptu's evaluations page
Click add a new evaluation
On the Feedback tab:
1. Name your Evals

In Code:

Find your session key from your observability or analytics provider
After where you defined your prompt registry, paste this line to send your user confirmed evaluations as the prompt evaluator.

# find this line of code where you defined your prompts
#input_data = {"prompt_text": ModelUtils.random_article()}

#Paste this line after the line above
my_thread_key =<your_thread_key_UID>
prompt_registry.new_thread(thread_key=my_thread_key)

Find where you define you ingest user scored evals
Import empromptus prompt registry into that file
Paste this line

prompt_registry.annotate_data(my_thread_key, {'user_score': <the_user_score>})

Celebrate!

Data generated

(coming soon)

PreviousEdge Case Optimizer NextModels

Last updated 2 months ago

Evaluations

What is an Evaluation?

You can evaluate a number of factors such as relevance, hallucination or if the LLM is being mean.

With evals, when you adjust or Empromptu automatically adjusts your prompts, you will if that prompt is more or less performant.

There are 4 ways to define your evaluations in Empromptu

Standard templates from Empromptu
Manually defined
End User Confirmed
Data generated

Standard Evaluation Templates from Empromptu

If you do not have any evaluation statements (yet) you can use Empromptu's UI to select an example set.

Using the UI:

Head to Empromptu's evaluations page
Click add a new evaluation
From the standard tab select a template eval set.

In Code:

Manually defined

Using the UI:

Head to Empromptu's evaluations page
Click add a new evaluation
On the custom tab write your eval statements by:
1. Define a name
2. Description of what is to be evaluated
3. Evaluation Criteria which is your eval statements
4. Expected output what is the output supposed to look like
5. Model you want to use
6. Temperature

In Code:

Name the task [string] (optional) Define an embedding model name
Name the prompt or prompt [prompt_string]
Define what must be true as a string [prompt_text]
Name a Temperature [default is 0]
Define a model

my_task = {
    'prompt_family_name': "summary_for_langchain",
    ‘Embedding_model’:’small-3’,  (OPTIONAL)
    'prompts': [
        {
            'prompt_name': 'prompt_1', 
            'Prompt_text':prompt_text_1, (eg. “DO thing X to this text: {{scraped_text}}”
            'temperature': 0.6, (OPTIONAL)
            'model': 'gpt-4o-mini',
        },
       ….
    ],
    'evals': [ # The short statements that will be used to grade the prompt's accuracy
        {
            'eval_name': 'extracted_truth',
            'eval_text': 'The information extracted is factual and nothing incorrect was kept.'
        },
        {
            'eval_name': 'extracted_completely',
            'eval_text': 'The information was extracted completely and nothing important is missing.'
        },
    ]
}
prompt_registry.register_prompt(my_task)

End User Confirmed

If you are using a human in the loop model or if your end users are evaluating results, you can send those results to Empromptu in order to get a better system based on your end users.

Using the UI:

Head to Empromptu's evaluations page
Click add a new evaluation
On the Feedback tab:
1. Name your Evals

In Code:

Find your session key from your observability or analytics provider
After where you defined your prompt registry, paste this line to send your user confirmed evaluations as the prompt evaluator.

# find this line of code where you defined your prompts
#input_data = {"prompt_text": ModelUtils.random_article()}

#Paste this line after the line above
my_thread_key =<your_thread_key_UID>
prompt_registry.new_thread(thread_key=my_thread_key)

Find where you define you ingest user scored evals
Import empromptus prompt registry into that file
Paste this line

prompt_registry.annotate_data(my_thread_key, {'user_score': <the_user_score>})

Celebrate!

Data generated

(coming soon)

PreviousEdge Case Optimizer NextModels

Last updated 2 months ago