hbllmutils.testing.base
Binary testing utilities for language model evaluation.
This module provides a small framework for executing binary tests on large language models, where each test yields a pass/fail result. It offers simple data structures for representing the results of individual tests and aggregated statistics for repeated runs. A base class is also provided to simplify the implementation of concrete tests.
The module contains the following main components:
BinaryTestResult- Stores the outcome of a single binary testMultiBinaryTestResult- Aggregates multiple test results and statisticsBinaryTest- Base class for implementing binary tests
Typical usage involves subclassing BinaryTest and implementing
BinaryTest._single_test() to define the test logic. The BinaryTest.test()
method can then execute the test once or multiple times to produce statistics.
Example:
>>> class AlwaysPassTest(BinaryTest):
... def _single_test(self, model, **params):
... return BinaryTestResult(passed=True, content="ok")
...
>>> test = AlwaysPassTest()
>>> result = test.test(model="my-llm", n=3, silent=True)
>>> result.passed_ratio
1.0
Note
This module expects a non-empty list of tests when computing aggregate
statistics. Passing an empty list to MultiBinaryTestResult will
raise a ZeroDivisionError due to division by zero.
BinaryTestResult
- class hbllmutils.testing.base.BinaryTestResult(passed: bool, content: str)[source]
Data class representing the result of a single binary test.
- Parameters:
passed (bool) – Whether the test passed or failed.
content (str) – The content or output produced during the test.
Example:
>>> BinaryTestResult(passed=True, content="response text") BinaryTestResult(passed=True, content='response text')
MultiBinaryTestResult
- class hbllmutils.testing.base.MultiBinaryTestResult(tests: List[BinaryTestResult], total_count: int = 0, passed_count: int = 0, passed_ratio: float = 0, failed_count: int = 0, failed_ratio: float = 0)[source]
Data class representing aggregated results from multiple binary tests.
This class automatically calculates statistics about the test results, including total count, passed/failed counts, and their ratios.
- Parameters:
tests (List[BinaryTestResult]) – List of individual binary test results.
total_count (int) – Total number of tests (automatically calculated).
passed_count (int) – Number of tests that passed (automatically calculated).
passed_ratio (float) – Ratio of tests that passed (automatically calculated).
failed_count (int) – Number of tests that failed (automatically calculated).
failed_ratio (float) – Ratio of tests that failed (automatically calculated).
- Raises:
ZeroDivisionError – If
testsis an empty list.
Example:
>>> results = [ ... BinaryTestResult(passed=True, content="test1"), ... BinaryTestResult(passed=False, content="test2"), ... ] >>> multi_result = MultiBinaryTestResult(tests=results) >>> multi_result.passed_ratio 0.5
- __post_init__() None[source]
Post-initialization method that calculates test statistics.
This method is automatically called after the dataclass is initialized. It computes the total count, passed/failed counts, and their ratios based on the provided test results.
- Raises:
ZeroDivisionError – If
testsis an empty list.
BinaryTest
- class hbllmutils.testing.base.BinaryTest[source]
Base class for implementing binary tests on language models.
This class provides a framework for running tests that have a pass/fail outcome. Tests can be run once or multiple times to gather statistics. Subclasses should implement the
_single_test()method to define the specific test logic.- Variables:
__desc_name__ (Optional[str]) – Optional descriptive name for the test, used in progress bars.
Example:
>>> class MyBinaryTest(BinaryTest): ... def _single_test(self, model, **params): ... return BinaryTestResult(passed=True, content="ok") ... >>> test = MyBinaryTest() >>> result = test.test(model="my-llm", n=1, silent=True) >>> result.passed True
- test(model: str | LLMModel, n: int = 1, silent: bool = False, **params: Any) BinaryTestResult | MultiBinaryTestResult[source]
Run the binary test one or multiple times on the given model.
If
n == 1, runs a single test and returns aBinaryTestResult. Ifn > 1, runs multiple tests and returns aMultiBinaryTestResultwith aggregated statistics.- Parameters:
model (LLMModelTyping) – The language model to test. Can be a model instance or a model identifier.
n (int) – Number of times to run the test, defaults to 1.
silent (bool) – If True, suppresses the progress bar, defaults to False.
params (dict) – Additional parameters to pass to the test.
- Returns:
Single test result if
n == 1, otherwise aggregated results.- Return type:
Example:
>>> test = MyBinaryTest() # Assuming MyBinaryTest is a subclass >>> result = test.test(model="my-llm", n=10, silent=True) >>> print(f"Pass rate: {result.passed_ratio}") Pass rate: 0.8