Skip to content

Evaluating Responses

ChainForge makes it simple to systematically evaluate LLM responses: all you have to do is describe how to 'score' each response. That's it --no thinking about other metadata, or which LLM generated the response. ChainForge keeps track of all of that information for you.

To score responses, you can do a simple check, write code, or prompt an LLM Scorer with a scoring rubric. The choice depends on your use case, and you may decide to branch into multiple evaluations.

Note

ChainForge keeps track of metadata like what LLM generated the response or the values of template variables that filled the prompt which generated the response. You can also use this metadata in your scoring functions, both for code evaluators and for LLM scorers.

Code Evaluators

You can write code in Python or JavaScript to score responses, using the respective Evaluator node. Python Evaluators are only available in the locally installed version of ChainForge, and execute Python code within your local Python environment.

An Evaluator node looks like:

python-eval-node

js-eval-node

To use either code evaluator, you define a function evaluate that takes a single ResponseInfo object (described below) and returns a score, which may be a boolean (true or false), numeric, or categorical value (a string). For instance, here is a simple evaluation function returning how long the LLM's response was:

def evaluate(response):
  return len(response.text)
function evaluate(response) {
  return response.text.length;
}

Note that you may define other functions to use; the only requirement is that you have a function named evaluate that takes a single response.

Warning

Only run code that you trust, as code is evaluated on your machine. In particular, we caution against running eval() on code generated by LLMs unless you sandbox the application. In the future, we are looking into adding a sandboxing toggle for Python evaluators, so that you can easily sandbox evaluation of code. If you are interested in this use case, please raise an Issue on our GitHub.

ResponseInfo objects

evaluate functions take a single argument, a ResponseInfo object representing an LLM response. ResponseInfo objects have a number of properties, defined below:

Property
Description Type
response.text The text of the response. string
response.prompt The text of the prompt using to query the LLM string
response.llm The nickname of the queried model (its name in the model list) string
response.var A dictionary of variable-value pairs, indexed by the name of variables that filled in the prompt template(s) used to generate the final prompt. For instance, if {country} is a template variable, then response.var['country'] equals the value that was used to fill in the prompt which led to this response (e.g., Canada). dict
response.meta A dictionary of metavariable-value pairs. Metavariables are 'carried alongside' explicit variables used to generate a prompt, and currently only refer to columns in Tabular Data. For instance, if {book} is a template variable filled by one column of a Tabular Data node, and there is also a column named Author, then you can write response.meta['Author'] to get the author associated with the book that filled the prompt for this response. dict
response.asMarkDownAST() Parses the text of the response as markdown, returning an abstract syntax tree (AST). For Python evaluators, the markdown is parsed with the mistune Python package; for JavaScript, the parser is the markdown-it library. mistune AST (Python) or markdown-it AST (JS)

Practical examples of usage may be found on the Use Cases page, or by loading Example Flows in the application.

Debugging Output

Both Python and JavaScript evaluators allow you to print to a console. This is especially helpful for debugging code or double-checking the format of response objects. Simple use print (in Python) or console.log (in JavaScript) to print outputs:

Evaluating Output

Importing outside libraries (Python only)

The Python evaluator has the benefit of access to third-party libraries via import statements. Any packages installed in the environment you are running ChainForge from will be available. Simply add import statements at the top of your evaluation code:

import numpy as np

def evaluate(resp):
  try: 
    return str(np.power(int(resp.var['number']), 2)) in resp.text
  except Exception as e:
    return False

This code uses the numpy package to help evaluate whether the square of a template variable, number, appears in the LLM response.


LLM Scorers

Sometimes, it is not easy to write code to score each response. For instance:

  • What is the sentiment of a paragraph? (positive/neutral/negative)
  • Was the LLM's response apologetic in tone? (true/false)
  • How professional did the response sound? (1 to 5)

For such cases, you can use an LLM to 'score' each response. Simply add an LLM scorer and write a scoring prompt:

LLM-Scorer

We recommend that you explicitly specify the output format (e.g., true or false) and ask the LLM to only reply with its classification, nothing else. Scores are not guaranteed to stick to your output format, so you may need to try multiple variations of scoring prompts until you find one that is consistent. GPT-4 at temperature 0 is the default scorer.

Danger

The scores of LLMs are inherently imprecise, and can sometimes vary drastically with seemingly minor changes to your prompt. Always use a strong LLM to score responses and double-check the scores by clicking "Inspect Results". In the future, we are looking into ways to make human verification (cross-checking) of LLM scores easier.

Using implicit template variables in scorers

Sometimes, it may be useful to pass in extra information to LLM scorers, such as when you are comparing to a ground truth. For instance, consider a situation where you are using a prompt template with a variable {writing_style} —'poem', 'text message', or 'formal letter' —and you want to validate that the LLM's output was really in that style. You could do:

Respond with 'true' if the text below is in the style of a {#writing_style}, 
'false' if not. Only reply with the classification, nothing else. 

The hashtag # indicates that ChainForge should look for a variable or metavariable in each LLM response with that name. Since prompts were generated by filling a template with values for {writing_style}, such values are accessible later in a chain via {#writing_style}. For more details on implicit template variables, see Prompt Templating.

Comparing across scorers

LLM Scorer nodes are a convenient wrapper for scoring responses, but sometimes you want to compare scores across multiple LLMs. You can use prompt chaining to accomplish this:

Comparing LLM scorers

The table view of the Response Inspector is a nice way to spot the differences:

Comparing LLM scorers: response table

Prompt chaining does not automatically score responses, but can give you a sense of which model generates the best scores, that you can then use in an LLM Scorer. We hope to add more features in the future to aggregate scores, or extend the LLM Scorer to allow for querying multiple LLMs.


Simple Evaluator

Conduct a simple boolean check on LLM responses against some text, a variable value, or a metavariable value, without needing to write code.

simple-eval-node

Operations include contains, starts with, ends with, equals, and appears in. You can choose either response or response in lowercase.

You might check if a response starts with LOL, for instance. Or, you can use the dropdown to check against a variable or metavariable associated with each response:

simple-eval-node-meta

For instance, checking whether the LLM can solve math equations, using the value of column Expected in the Tabular Data node:

simple-eval-grouth-truth


Multi-Evaluator

Looking to evaluate responses across multiple criteria at once? Create multiple criteria to evaluate a set of responses, with a mix of code-based and LLM-based evaluators:

multi-eval-node

Table View of response inspectors can plot on a per-criteria basis:

multi-eval-table

The code and LLM evaluators inside a Multi-Eval have exactly the same features as their node counterparts.

Multi-Evaluators are in beta, and don't yet include all features. In particular, there is no "output" handle to a Multi-Eval node, since Vis Nodes do not yet support criteria-level plotting. We will also be adding an AI-assisted eval generation interface, EvalGen, in the coming weeks.