hbllmutils.testing.base

Binary testing utilities for language model evaluation.

This module provides a small framework for executing binary tests on large language models, where each test yields a pass/fail result. It offers simple data structures for representing the results of individual tests and aggregated statistics for repeated runs. A base class is also provided to simplify the implementation of concrete tests.

The module contains the following main components:

Typical usage involves subclassing BinaryTest and implementing BinaryTest._single_test() to define the test logic. The BinaryTest.test() method can then execute the test once or multiple times to produce statistics.

Example:

>>> class AlwaysPassTest(BinaryTest):
...     def _single_test(self, model, **params):
...         return BinaryTestResult(passed=True, content="ok")
...
>>> test = AlwaysPassTest()
>>> result = test.test(model="my-llm", n=3, silent=True)
>>> result.passed_ratio
1.0

Note

This module expects a non-empty list of tests when computing aggregate statistics. Passing an empty list to MultiBinaryTestResult will raise a ZeroDivisionError due to division by zero.

BinaryTestResult

class hbllmutils.testing.base.BinaryTestResult(passed: bool, content: str)[source]

Data class representing the result of a single binary test.

Parameters:
  • passed (bool) – Whether the test passed or failed.

  • content (str) – The content or output produced during the test.

Example:

>>> BinaryTestResult(passed=True, content="response text")
BinaryTestResult(passed=True, content='response text')

MultiBinaryTestResult

class hbllmutils.testing.base.MultiBinaryTestResult(tests: List[BinaryTestResult], total_count: int = 0, passed_count: int = 0, passed_ratio: float = 0, failed_count: int = 0, failed_ratio: float = 0)[source]

Data class representing aggregated results from multiple binary tests.

This class automatically calculates statistics about the test results, including total count, passed/failed counts, and their ratios.

Parameters:
  • tests (List[BinaryTestResult]) – List of individual binary test results.

  • total_count (int) – Total number of tests (automatically calculated).

  • passed_count (int) – Number of tests that passed (automatically calculated).

  • passed_ratio (float) – Ratio of tests that passed (automatically calculated).

  • failed_count (int) – Number of tests that failed (automatically calculated).

  • failed_ratio (float) – Ratio of tests that failed (automatically calculated).

Raises:

ZeroDivisionError – If tests is an empty list.

Example:

>>> results = [
...     BinaryTestResult(passed=True, content="test1"),
...     BinaryTestResult(passed=False, content="test2"),
... ]
>>> multi_result = MultiBinaryTestResult(tests=results)
>>> multi_result.passed_ratio
0.5
__post_init__() None[source]

Post-initialization method that calculates test statistics.

This method is automatically called after the dataclass is initialized. It computes the total count, passed/failed counts, and their ratios based on the provided test results.

Raises:

ZeroDivisionError – If tests is an empty list.

BinaryTest

class hbllmutils.testing.base.BinaryTest[source]

Base class for implementing binary tests on language models.

This class provides a framework for running tests that have a pass/fail outcome. Tests can be run once or multiple times to gather statistics. Subclasses should implement the _single_test() method to define the specific test logic.

Variables:

__desc_name__ (Optional[str]) – Optional descriptive name for the test, used in progress bars.

Example:

>>> class MyBinaryTest(BinaryTest):
...     def _single_test(self, model, **params):
...         return BinaryTestResult(passed=True, content="ok")
...
>>> test = MyBinaryTest()
>>> result = test.test(model="my-llm", n=1, silent=True)
>>> result.passed
True
test(model: str | LLMModel, n: int = 1, silent: bool = False, **params: Any) BinaryTestResult | MultiBinaryTestResult[source]

Run the binary test one or multiple times on the given model.

If n == 1, runs a single test and returns a BinaryTestResult. If n > 1, runs multiple tests and returns a MultiBinaryTestResult with aggregated statistics.

Parameters:
  • model (LLMModelTyping) – The language model to test. Can be a model instance or a model identifier.

  • n (int) – Number of times to run the test, defaults to 1.

  • silent (bool) – If True, suppresses the progress bar, defaults to False.

  • params (dict) – Additional parameters to pass to the test.

Returns:

Single test result if n == 1, otherwise aggregated results.

Return type:

Union[BinaryTestResult, MultiBinaryTestResult]

Example:

>>> test = MyBinaryTest()  # Assuming MyBinaryTest is a subclass
>>> result = test.test(model="my-llm", n=10, silent=True)
>>> print(f"Pass rate: {result.passed_ratio}")
Pass rate: 0.8