hbllmutils.response.datamodel

Data model-based LLM task utilities.

This module provides functionality for creating and managing LLM tasks that parse and validate responses against structured data models. It supports both Pydantic models and dataclasses, with automatic prompt generation and validation capabilities.

The module serves as a bridge between LLM outputs and structured data validation, enabling type-safe parsing of LLM responses into well-defined data structures. It handles the complete workflow from prompt generation to response validation, with built-in retry mechanisms for handling parsing failures.

Key Features:

Structured data validation using Pydantic or dataclasses
Automatic format prompt generation from data model schemas
Sample-based learning support for few-shot prompting
Retry mechanism for failed validations with configurable attempts
JSON parsing and validation with error recovery
Support for related data models to provide context
Customizable parsing and serialization functions

Main Components:

DataModelLLMTask - Core task class for model-based validation
create_datamodel_task() - Factory function for creating configured tasks

Architecture:

The module follows a layered architecture:

Task Layer: DataModelLLMTask handles the high-level workflow
Prompt Generation: Automatic creation of format instructions
Parsing Layer: Extraction and validation of structured data
Retry Logic: Automatic retry on validation failures

Performance Considerations:

Format prompts are cached using LRU cache to avoid regeneration
Sample serialization is performed once during task creation
Validation functions can be customized for optimal performance

Note

This module requires either Pydantic BaseModel or Python dataclasses for data model definitions. Custom types require explicit parsing functions.

Warning

Large numbers of samples or complex data models may result in very long prompts, which could impact token usage and response time.

Example:

>>> from pydantic import BaseModel
>>> from hbllmutils.model import load_llm_model
>>> from hbllmutils.response import create_datamodel_task
>>>
>>> class Person(BaseModel):
...     gender: str  # male or female
...     age: int
...     hair_color: str  # use hex color
...     skin_color: str  # use readable color
...     appearance_desc: str  # a line of text for description of this guy
>>>
>>> model = load_llm_model('gpt-4o')
>>> print(f"Loaded Model: {model}")
>>>
>>> task = create_datamodel_task(
...     model=model,
...     datamodel_class=Person,
...     task_requirements="""
... You are a bot to tell me the information of a celebrity.
...
... I will give you his/her name, and you should tell me about his/her appearance information.
...
...     """,
...     samples=[
...         # European female
...         ("Taylor Swift", Person(
...             gender="female",
...             age=34,
...             hair_color="#F5DEB3",  # blonde
...             skin_color="fair",
...             appearance_desc="Tall blonde singer with blue eyes, known for her elegant and graceful appearance"
...         )),
...
...         # African male
...         ("Will Smith", Person(
...             gender="male",
...             age=55,
...             hair_color="#2F1B14",  # dark brown
...             skin_color="dark brown",
...             appearance_desc="Charismatic actor with a bright smile, athletic build and confident demeanor"
...         )),
...     ]
... )
>>> print(task.ask_then_parse('Jackie Chan'))
gender='male' age=69 hair_color='#1C1C1C' skin_color='light brown' appearance_desc='Martial arts action star with a lively personality, known for his agile physique and distinctive smile'
>>> print(task.ask_then_parse('Donald Trump'))
gender='male' age=77 hair_color='#FFD700' skin_color='light' appearance_desc='Notable public figure known for his distinct hairstyle and fair complexion, often seen in formal suits'
>>> print(task.ask_then_parse('Tohsaka Rin'))
gender='female' age=17 hair_color='#2F1B14' skin_color='fair' appearance_desc='A young woman with twin-tailed brown hair and aqua eyes, usually seen wearing a red sweater and black skirt, exuding both elegance and a strong-willed demeanor'

DataModelLLMTask

class hbllmutils.response.datamodel.DataModelLLMTask(model: str | LLMModel, history: LLMHistory, fn_parse_and_validate: Callable[[Any], Any], default_max_retries: int = 5)[source]

A specialized LLM task that parses and validates responses against a data model.

This class extends ParsableLLMTask to provide structured data validation using a custom parsing and validation function. It handles the complete workflow of sending prompts to an LLM, receiving responses, and validating them against a predefined data model structure.

The class is designed to work with any data model that can be validated through a callable function, making it flexible enough to support Pydantic models, dataclasses, or custom validation logic.

The workflow consists of:

Sending a request to the LLM with conversation history
Receiving the raw text response
Extracting code blocks from the response
Parsing the extracted code as JSON
Validating the parsed data against the data model
Retrying on validation failure (up to max_retries times)

Parameters:

model (LLMModelTyping) – The LLM model to use for generating responses.
history (LLMHistory) – The conversation history to maintain context.
fn_parse_and_validate (Callable[[Any], Any]) – Function to parse and validate the response data. Should accept the parsed JSON data and return a validated instance of the data model. Must raise an exception on validation failure.
default_max_retries (int) – Maximum number of retries for failed attempts, defaults to 5.

Variables:

_fn_parse_and_validate (Callable[[Any], Any]) – The validation function used for parsing responses.

Note

The validation function should raise an exception on invalid data to trigger the retry mechanism. The exception type should match the __exceptions__ class variable defined in ParsableLLMTask.

Warning

Each retry sends a new request to the LLM, which may incur additional API costs. Set appropriate max_retries values based on your use case and budget.

Example:

>>> from pydantic import BaseModel
>>> from hbllmutils.model import load_llm_model
>>> from hbllmutils.history import LLMHistory
>>>
>>> class MyModel(BaseModel):
...     name: str
...     age: int
>>>
>>> model = load_llm_model('gpt-4')
>>> history = LLMHistory().with_system_prompt("Extract person info")
>>> task = DataModelLLMTask(
...     model=model,
...     history=history,
...     fn_parse_and_validate=MyModel.model_validate
... )
>>> result = task.ask_then_parse("Extract info: John is 30 years old")
>>> isinstance(result, MyModel)
True
>>> result.name
'John'
>>> result.age
30

__init__(model: str | LLMModel, history: LLMHistory, fn_parse_and_validate: Callable[[Any], Any], default_max_retries: int = 5)[source]

Initialize a DataModelLLMTask instance.

Sets up the task with a model, conversation history, and validation function. The validation function will be called on each response to ensure it conforms to the expected data model structure.

Parameters:

model (LLMModelTyping) – The LLM model to use for generating responses. Can be a model name string, an LLMModel instance, or None for the default model.
history (LLMHistory) – The conversation history to maintain context. Should include system prompts and any previous conversation turns.
fn_parse_and_validate (Callable[[Any], Any]) – Function to parse and validate the response data. Should accept the parsed JSON data and return a validated instance of the data model. Must raise an exception (matching __exceptions__) on failure.
default_max_retries (int) – Maximum number of retries for failed attempts, defaults to 5. Must be a positive integer.

Raises:

ValueError – If default_max_retries is not a positive integer.

Example:

>>> from pydantic import BaseModel
>>> class Person(BaseModel):
...     name: str
...     age: int
>>>
>>> task = DataModelLLMTask(
...     model='gpt-4',
...     history=LLMHistory(),
...     fn_parse_and_validate=Person.model_validate,
...     default_max_retries=3
... )

create_datamodel_task

hbllmutils.response.datamodel.create_datamodel_task(model: str | LLMModel, datamodel_class: type, task_requirements: str, samples: List[Tuple[str, Any]] | None = None, related_datamodel_classes: List[type] | None = None, prompt_generation_model: str | LLMModel | None = None, fn_parse_and_validate: Callable[[Any], Any] | None = None, fn_dump_json: Callable[[Any], Any] | None = None) → DataModelLLMTask[source]

Create a DataModelLLMTask with configured prompts and validation.

This factory function sets up a complete LLM task that:

Generates format prompts based on the data model structure
Configures task requirements describing the expected behavior
Sets up parsing and validation logic for response processing
Optionally includes sample inputs and outputs for few-shot learning
Handles related data models to provide additional context

The function automatically handles Pydantic BaseModel and dataclass types, providing default parsing and serialization functions. For custom types, you can provide your own parsing and serialization functions.

The generated task uses a structured prompt that includes:

Requirements Section: Description of what the task should accomplish
Samples Section (optional): Input-output examples for few-shot learning
Output Guide Section: Format instructions generated from the data model

The complete system prompt is printed to stdout for debugging and verification purposes before the task is created.

Parameters:

model (LLMModelTyping) – The LLM model to use for the main task. Can be a model name string, an LLMModel instance, or None for the default model.
datamodel_class (type) – The data model class that defines the expected output structure. Must be a Pydantic BaseModel subclass or dataclass, unless custom parsing/serialization functions are provided.
task_requirements (str) – Description of what the task should accomplish. This text is included in the system prompt to guide the LLM’s behavior. Can include multiple lines and will be dedented automatically.
samples (Optional[List[Tuple[str, Any]]]) – Optional list of (input, output) tuples to provide as examples for few-shot learning. Each tuple contains a sample input string and the corresponding data model instance. Defaults to None.
related_datamodel_classes (Optional[List[type]]) – Optional list of related data model classes for context. These models are included in the format prompt to provide additional structural information. Defaults to None.
prompt_generation_model (Optional[LLMModelTyping]) – Optional separate model for prompt generation. If None, uses the main model. Can be useful to use a more capable model for prompt generation. Defaults to None.
fn_parse_and_validate (Optional[Callable[[Any], Any]]) – Optional custom parsing and validation function. Should accept parsed JSON data and return a validated instance. If None, uses the default for Pydantic BaseModel (model_validate). Defaults to None.
fn_dump_json (Optional[Callable[[Any], Any]]) – Optional custom function to convert data model instances to JSON-serializable dicts. Used for serializing samples. If None, uses the default for Pydantic BaseModel (model_dump) or dataclass (dataclasses.asdict). Defaults to None.

Returns:

A configured DataModelLLMTask instance ready for use.

Return type:

DataModelLLMTask

Raises:

ValueError – If datamodel_class is not a Pydantic BaseModel subclass and fn_parse_and_validate is not provided.
ValueError – If samples are provided but datamodel_class is not a Pydantic BaseModel or dataclass and fn_dump_json is not provided.

Note

The function prints the generated system prompt to stdout for debugging purposes. This can be useful for understanding what instructions are being sent to the LLM and for verifying the prompt structure.

Warning

Large numbers of samples or complex data models may result in very long prompts, which could impact token usage, response time, and API costs. Monitor your prompt lengths and adjust accordingly.

Example:

>>> from pydantic import BaseModel
>>> from hbllmutils.model import load_llm_model
>>>
>>> class Person(BaseModel):
...     name: str
...     age: int
...     occupation: str
>>>
>>> model = load_llm_model('gpt-4')
>>> task = create_datamodel_task(
...     model=model,
...     datamodel_class=Person,
...     task_requirements="""
...         Extract person information from the given text.
...         Parse the name, age, and occupation if available.
...     """,
...     samples=[
...         ("John Doe, 30, software engineer",
...          Person(name="John Doe", age=30, occupation="software engineer")),
...         ("Alice Smith is 25 and works as a teacher",
...          Person(name="Alice Smith", age=25, occupation="teacher")),
...     ]
... )
>>> result = task.ask_then_parse("Bob Johnson, age 35, doctor")
>>> isinstance(result, Person)
True
>>> result.name
'Bob Johnson'
>>> result.age
35
>>> result.occupation
'doctor'