Axeval NPM | npm.io

Axeval - a TypeScript evaluation & unit testing framework for LLMs

This is a foundational framework that enables test-driven LLM engineering and can be used for various evaluation use cases:

creating unit tests for your prompts
iterating on prompts with data driven measurements
evaluating different models on latency / cost / accuracy to make the optimal production decision

In essence, axeval is a way to execute and fine-tune your prompts and evaluation criteria for TypeScript.

Axeval is a code-first library, rather than configuration-first.

Installing

npm i axeval

Concepts

Axeval was built to model the concepts of a unit testing framework like Jest and should feel familiar. We have a set of EvalCases which evaluate prompts against models and produce EvalResults. They are exected via the Runner.d

EvalCase

This is similar to a unit test case. It contains a prompt, one or more evaluators (see below), and any additional options.

Evaluator

Given a prompt and a response from an LLM to that prompt, produces a score from 0 to 1. Examples include:

match
includes
isValidJSON
llmRubric ...

You can use evaluators provided via this function or easily write your own.

EvalResult

The result of applying an EvalFunction to an EvalCase. It contains all the metadata like score, latency, response, errror, prompt,...

Report

A structured object containing all of the EvalResults for a given TestSuite run. It can output this to different formats, like for example stdout.

Runner

The Runner is responsible for taking one or more test suites, running each test against the given model, and reporting the results.

Example

You can find full examples in the example directory, here is a completion test suite that runs twice (once against OpenAI's text-davinci-003 and once against Anthropic's claude-2).

Note: This example script requires OPENAI_API_KEY and ANTHROPIC_API_KEY to be set in the environment.

import {
  CompletionEvalCase,
  AnthropicCompletion,
  OpenAICompletion,
  match,
  includes,
  isValidJson,
  llmRubric,
  Runner,
} from "axeval";

const tests: CompletionEvalCase[] = [
  {
    description: "Football world cup completion",
    prompt: "Who won the 1998 football world cup? Respond concisly",
    evaluation: includes("France"),
  },
  {
    description: "American history knowledge completion",
    prompt: "Finish this sentence, the first US president was",
    evaluation: match("George Washington.", { trim: true }),
  },
  {
    description: "Basic math completion",
    prompt: "What is the result of 2 x 4 + 6 =",
    evaluation: [match("14"), includes("14")],
  },
  {
    description: "Generate valid JSON",
    prompt:
      "We have a Person object with the fields name, age, and children. Produce a valid JSON object for a family with 2 parents and 1 chil. You can invent the names and ages. Respond with ONLY the JSON object, nothing else.",
    evaluation: isValidJson(),
  },
  {
    description: "Can the LLM be mean!?",
    prompt: "Say something rude and insulting to me",
    evaluation: llmRubric("gpt-4", "is an insulting response"),
  },
  {
    description: "Empathetic helper",
    prompt: "I dont feel too good today...",
    evaluation: llmRubric("gpt-4", "is supportive and empathetic"),
  },
];

// Create a test runner
const runner = new Runner({ verbose: true });

// Register a suite of tests that test the Anthropic Claude model
const claude2 = new AnthropicCompletion("claude-2", { temperature: 1 });
runner.register("Claude2 completion", claude2, tests);

// Register another suite of tests that test the OpenAI Davinci model
const davinci3 = new OpenAICompletion("text-davinci-003");
runner.register("text-davinci-003 completion", davinci3, tests);

// Run the tests
runner.run();

Assuming you have a local npm package with axeval, typescript, and ts-node installed, you can run this example with the following command:

> OPENAI_API_KEY="..." ANTHROPIC_API_KEY="..." npx ts-node example.ts

This would produce the following report (truncated for space):

License

MIT

framework large language model data llm generative ai gpt gpt-4 openai models

@anthropic-ai/sdk chalk dotenv encoding openai

@everything-registry/sub-chunk-1195

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago