We are continuously developing new features in ChainForge to better support our users needs. For example, we have connected popular benchmarks like HumanEval and OpenAI evals, so that anyone can run them across multiple models and cache responses with a (really nice) interface, versus writing tons of boilerplate code in Jupyter notebooks.
Release Notes are published on our GitHub page alongside every major update to ChainForge. Below, we have included some of the most recent changelogs from version 0.2 onwards.
v0.2.5: Chat Turns, LLM Scorers
We're excited to release two new nodes: Chat Turns and LLM Scorers. These nodes came from feedback during user sessions: - Some users wanted to first tell chat models 'how to act', and then wanted to put their real prompt in the second turn. - Some users wanted a quicker, cheaper way to 'evaluate' responses and visualize results.
We describe these new nodes below, as well as a few quality-of-life improvements.
🗣️ Chat Turn nodes
Chat models are all the rage (in fact, they are so important that OpenAI announced it would no longer support plain-old text generation models going forward.) Yet strikingly, very few prompt engineering tools let you evaluate LLM outputs beyond a prompt.
Now with Chat Turn nodes, you can continue conversations beyond a single prompt. In fact, you can:
Continue multiple conversations simultaneously across multiple LLMs
Just connect the Chat Turn to your initial Prompt Node, and voilà:
Here, I've first prompted four chat models: GPT3.5, GPT-4, Claude-2, and PaLM with the question: "What was the first {game} game?". Then I ask a follow-up question, "What was the second?" By default, Chat Turns continue the conversation with all LLMs that were used before, allowing you to follow-up on LLM responses in parallel. (You can also toggle that off, if you want to query different models --more details below).
Template chat messages, just like prompts
You can do everything you can with Chat Turns that you could with Prompt Nodes, including prompt templating and adding input variables. For instance, here's a prompt template as a follow-up message:
Note In fact, Chat Turns are merely modified Prompt Nodes, and use the underlying
PromptNode
class.
Start a conversation with one LLM, and continue it with a different LLM
Chat Turns include a toggle of whether you'd like to continue chatting with the same LLMs, or query different ones, passing chat context to the new models. With this, you can start a conversation with one LLM and continue it with another (or several):
Supported chat models
Simple in concept, chat turns were the result of 2 weeks' work, revising many parts of the ChainForge backend to store and carry chat context. Chat history is automatically translated to the appropriate format for a number of providers:
- OpenAI chat models
- Anthropic models (Claude)
- Google PaLM2 chat
- HuggingFace (you need to set 'Model Type' in Settings to 'chat', and choose a Conversation model or custom endpoint. Currently there's only one chat model listed in ChainForge dropdown: microsoft/DialoGPT
. Go to the HuggingFace site to find more!)
Warning If you use a non-chat, text completions model like GPT-2, chat turns will still function, but the chat context won't be passed into the text completions model.
Let us know what you think!
🤖 LLM Scorer nodes
More commonly called "LLM evaluators", LLM scorer nodes allow you to use an LLM to 'grade'/score outputs of other LLMs:
Although ChainForge supported this functionality before via prompt chaining, it was not straightforward and required an additional chain to a code evaluator node for postprocessing. You can now connect the output of the scorer directly to a Vis Node to plot outputs. For instance, here's GPT-4 scoring whether different LLM responses apologized for a mistake:
Note that LLM scores are finicky --if one score isn't in the right format (true/false), visualization nodes won't work properly, because they'll think the outputs are notof boolean type but categorical. We'll work on improving this, but, for now, enjoy LLM scorers!
❗ Why we're not calling LLM scorers 'LLM evaluators'
We thought long and hard about what to call LLMs that score outputs of other LLMs. Ultimately, using LLMs to score outputs is helpful, and can save time when it's hard to write code to achieve the same effect. However, LLMs are imperfect. Although the AI community currently uses the term 'LLM evaluator,' we ultimately decided not to use that term, for a few reasons: 1. LLM scores should not be blindly trusted. They are helpful if you already have a sense of what you're looking for, and want to grade hundreds of responses and don't care about picture-perfect accuracy. This is especially true after playing with LLM scorer nodes for a while and finding that small tweaks to the scoring prompt can result in vast differences in results. 2. Evaluators, like 'graders' or 'annotations,' is a term that has connotations with humans (i.e. human evaluator). We want to avoid anthropomorphizing LLMs, which contributes to peoples' over-trust in them. 'Scorers' still has human connotations, but arguably less so, and less authoritative ones than 'evaluator'. 3. Evaluators is a term in ChainForge that refers to programs that score responses. Calling LLM scorers 'evaluators' loosely equates them with programmatic evaluators, suggesting they carry the same authority. Although code can be wrong or incorrect, the scoring process for code is inspectable and auditable --not so with LLMs.
Fundamentally, then, we disagree with the positions taken by projects like LangChain, which tend to emphasize LLM scorers as the go-to solution for evaluation. We believe this is a massive mistake that tends to mislead people and causing them to over-trust AI outputs, including ML researchers at MIT. In choosing the term Scorers, we aim to --at the very least --distance ourselves from such positions.
Other changes
-
Inspecting true/false scored responses (in Evaluators or LLM scorers) will now show false in red, to easily eyeball failure cases:
-
In Response Inspectors, the term "Hierarchy" has been replaced with "Grouped List". Grouped Lists are again the default.
- In table view of the response inspector, you can now choose what variable to use for columns. With this method you can compare across prompt templates or indeed anything of interest:
Future Work
Chat Turns opened up a whole new can of worms, both for the UI, and for evaluation. Some open questions are: * How can we display Chat History in response inspectors? Right now, you'll only see the latest response from the LLM. There's more design work to do such that you can view the chat context of specific responses. * Should there be a Chat History node so you can predefine/preset chat histories to test on, without needing to query an LLM?
We hope to prioritize such features based on user feedback. If you use Chat Turns or LLM Scorers, let us know what you think --open an Issue or start a Discussion! 👍
v0.2.1.2: Table view, Response Inspectors keep state
There's two minor, but important quality-of-life improvements in this release.
Table view
Now in response inspectors, you can elect to see a table, rather than a hierarchical grouping of prompt variables:
Columns are prompt variables, followed by LLMs. We might add the ability to change columns in the future, if there's interest.
Persistent state in response inspectors
Response inspectors' state will, to an extent, persist across runs. For instance, say you were inspecting a specific response grouping:
Imagine you now close the inspector window, delete one of the models and then increase num generations per prompt to 2. You will now see:
Right where you left off, with the updated responses. It also keeps track if you've selected Table view, and retains the view you last selected.
Specify hostname and port (v0.2.1.3)
I've added --host
and --port
flags when you're running ChainForge locally. You can specify what hostname and port to run it on like so:
chainforge serve --host 0.0.0.0 --port 3400
The front-end app also knows you're running it from Flask (locally) regardless of what the hostname and port is.
v0.2.1: Prompt previews, Toggleable prompt variables, Anthropic Claude2
We've made several quality-of-life improvements from 0.2 to this release.
Prompt previews
You can now inspect what generated prompts will be sent off to LLMs. For a quick glance, simply hover over the 'list' icon on Prompt Nodes:
For full inspection, just click the button to bring up a popup inspector.
Thanks to Issue https://github.com/ianarawjo/ChainForge/issues/90 raised by @profplum700 !
Ability To Enable/Disable Prompt Variables in Text Fields Without Deleting Them
You can now enable/disable prompt variables selectively:
https://github.com/ianarawjo/ChainForge/assets/5251713/92f9c869-8201-43d0-a4a5-8aee7524319e
Thanks to Issue https://github.com/ianarawjo/ChainForge/issues/93 raised by @profplum700 !
Anthropic model Claude-2
We've also added the newest Claude model, Claude-2. All prior models remain supported; however, strangely, Claude-1 and 100k context models have disappeared from the Anthropic API documentation. So, if you are using earlier Claude models, just know that they may stop working at some future point.
Bug fixes
There have also been numerous bug fixes, including: - braces { and } inside Tabular Data tables are now escaped by default when data is pulled from the nodes, so that they are never treated as prompt templates - escaping template braces { and } now removes the escape slash when generating prompts for models - outputs of Prompt Nodes, when chained into other Prompt Nodes, now escape the braces in LLM responses by default. Note that whenever prompts are generated, the escaped braces are cleaned up to just { and }. In response inspectors, input variables will appear with escaped braces, as input variables in ChainForge may themselves be templates.
Future Goals
We've been running pilot studies internally at Harvard HCI and getting some informal feedback.
- One point that keeps coming up echoes Issue https://github.com/ianarawjo/ChainForge/issues/56 , raised by @jjordanbaird : the ability to keep chat context and evaluate multiple chatbot turns. We are thinking to implement this as a Chat Turn Node
, where optionally, one can provide "past conversation" context as input. The overall structure will be similar to Prompt Nodes, except that only Chat Models will be available. See https://github.com/ianarawjo/ChainForge/issues/56 for more details.
- Another issue we're aware of is the need for better documentation on what you can do with ChainForge, particularly on the rather unique feature of chaining prompt templates together.
As always, if you have any feedback or comments, open an Issue or start a Discussion.
v0.2: App logic now runs in browser, HuggingFace Models, JavaScript Evaluators, Comment Nodes
The entire backend has been rewritten in TypeScript 🥷🧑💻️
Thousands of lines of Python code, comprising nearly the entire backend, has been rewritten in TypeScript. The mechanism for generating prompt permutations, querying LLMs and cache'ing responses is performed now in the front-end (entirely in the browser). Tests were added in jest
to ensure the outputs of the TypeScript functions performed the same as their original Python versions. There are additional performance and maintainability benefits to adding static type checking. We've also added ample docstrings, which should help devs looking to get involved.
Functionally, you should not experience any difference (expect maybe a slight speed boost).
Javascript Evaluator Nodes 🧩
Because the application logic has moved to the browser, we added JavaScript evaluator nodes. These let you write evaluation functions in JavaScript, and function the same as Python evaluators.
Here is a side-by-side comparison of JavaScript and Python evaluator nodes, showing semantically equivalent code and the in-node support for displaying console.log and print output:
When you are running ChainForge on localhost
, you can still use Python evaluator nodes, which will execute on your local Flask server (the Python backend) as before. JavaScript evaluators run entirely in the browser (specifically, eval
sandboxed inside an iframe
).
HuggingFace Models 🤗
We added support for querying text generation models hosted on the HuggingFace Inference API. For instance, here is falcon.7b.instruct, an open-source model:
For HF models, there is a 250 token limit. This can sometimes be rather limiting, so we've added a "number of continuations" setting to help with that. You can set it to > 0 to feed the response back into the API for text completions models, which will generate longer completions, for up to 1500 tokens.
We also support HF Inference Endpoints for text generation models. Simply put the API call URL in the custom_model
field of the settings window.
Comment Nodes ✏️
You can write comments about your evaluation using a comment node:
'Browser unsupported' error 💢
If you load ChainForge on a mobile device or unsupported browser, it will now display an error message:
This helps for our public release. If you'd like ChainForge to support more browsers, open an Issue or (better yet) make a Pull Request.
Fun example
Finally, I wanted to share a fun practical example: an evaluation to check if the LLM reveals a secret key. This evaluation, including all API calls and JavaScript evaluation code, was run entirely in the browser:
Questions, concerns?
Open an Issue or start a Discussion!