hbllmutils.response.datamodel

Data model-based LLM task module.

This module provides functionality for creating and managing LLM tasks that parse and validate responses against structured data models. It supports both Pydantic models and dataclasses, with automatic prompt generation and validation capabilities.

The module serves as a bridge between LLM outputs and structured data validation, enabling type-safe parsing of LLM responses into well-defined data structures. It handles the complete workflow from prompt generation to response validation, with built-in retry mechanisms for handling parsing failures.

Key Features:
  • Structured data validation using Pydantic or dataclasses

  • Automatic format prompt generation from data model schemas

  • Sample-based learning support for few-shot prompting

  • Retry mechanism for failed validations with configurable attempts

  • JSON parsing and validation with error recovery

  • Support for related data models to provide context

  • Customizable parsing and serialization functions

Main Components:
  • DataModelLLMTask - Core task class for model-based validation

  • create_datamodel_task() - Factory function for creating configured tasks

  • _get_format_prompt() - Generate format prompts from data models

  • _ask_for_format_prompt() - Cached prompt generation

Architecture:

The module follows a layered architecture:

  1. Task Layer: DataModelLLMTask handles the high-level workflow

  2. Prompt Generation: Automatic creation of format instructions

  3. Parsing Layer: Extraction and validation of structured data

  4. Retry Logic: Automatic retry on validation failures

Performance Considerations:
  • Format prompts are cached using LRU cache to avoid regeneration

  • Sample serialization is performed once during task creation

  • Validation functions can be customized for optimal performance

Note

This module requires either Pydantic BaseModel or Python dataclasses for data model definitions. Custom types require explicit parsing functions.

Warning

Large numbers of samples or complex data models may result in very long prompts, which could impact token usage and response time.

Example:

>>> from pydantic import BaseModel
>>> from hbllmutils.model import load_llm_model
>>> from hbllmutils.response import create_datamodel_task
>>> 
>>> class Person(BaseModel):
...     gender: str  # male or female
...     age: int
...     hair_color: str  # use hex color
...     skin_color: str  # use readable color
...     appearance_desc: str  # a line of text for description of this guy
>>> 
>>> model = load_llm_model('gpt-4o')
>>> print(f"Loaded Model: {model}")
>>> 
>>> task = create_datamodel_task(
...     model=model,
...     datamodel_class=Person,
...     task_requirements="""
... You are a bot to tell me the information of a celebrity.
... 
... I will give you his/her name, and you should tell me about his/her appearance information.
... 
...     """,
...     samples=[
...         # European female
...         ("Taylor Swift", Person(
...             gender="female",
...             age=34,
...             hair_color="#F5DEB3",  # blonde
...             skin_color="fair",
...             appearance_desc="Tall blonde singer with blue eyes, known for her elegant and graceful appearance"
...         )),
... 
...         # African male
...         ("Will Smith", Person(
...             gender="male",
...             age=55,
...             hair_color="#2F1B14",  # dark brown
...             skin_color="dark brown",
...             appearance_desc="Charismatic actor with a bright smile, athletic build and confident demeanor"
...         )),
...     ]
... )
>>> print(task.ask_then_parse('Jackie Chan'))
gender='male' age=69 hair_color='#1C1C1C' skin_color='light brown' appearance_desc='Martial arts action star with a lively personality, known for his agile physique and distinctive smile'
>>> print(task.ask_then_parse('Donald Trump'))
gender='male' age=77 hair_color='#FFD700' skin_color='light' appearance_desc='Notable public figure known for his distinct hairstyle and fair complexion, often seen in formal suits'
>>> print(task.ask_then_parse('Tohsaka Rin'))
gender='female' age=17 hair_color='#2F1B14' skin_color='fair' appearance_desc='A young woman with twin-tailed brown hair and aqua eyes, usually seen wearing a red sweater and black skirt, exuding both elegance and a strong-willed demeanor'

DataModelLLMTask

class hbllmutils.response.datamodel.DataModelLLMTask(model: LLMModel, history: LLMHistory, fn_parse_and_validate: Callable[[Any], Any], default_max_retries: int = 5)[source]

A specialized LLM task that parses and validates responses against a data model.

This class extends ParsableLLMTask to provide structured data validation using a custom parsing and validation function. It handles the complete workflow of sending prompts to an LLM, receiving responses, and validating them against a predefined data model structure.

The class is designed to work with any data model that can be validated through a callable function, making it flexible enough to support Pydantic models, dataclasses, or custom validation logic.

Parameters:
  • model (LLMModel) – The LLM model to use for generating responses.

  • history (LLMHistory) – The conversation history to maintain context.

  • fn_parse_and_validate (Callable[[Any], Any]) – Function to parse and validate the response data. Should accept the parsed JSON data and return a validated instance of the data model.

  • default_max_retries (int) – Maximum number of retries for failed attempts, defaults to 5.

Variables:

_fn_parse_and_validate (Callable[[Any], Any]) – The validation function used for parsing responses.

Note

The validation function should raise an exception on invalid data to trigger the retry mechanism. The exception type should match the __exceptions__ class variable defined in ParsableLLMTask.

Example:

>>> from pydantic import BaseModel
>>> class MyModel(BaseModel):
...     name: str
...     age: int
>>> task = DataModelLLMTask(
...     model=my_model,
...     history=my_history,
...     fn_parse_and_validate=MyModel.model_validate
... )
>>> result = task.ask_then_parse("Extract info: John is 30 years old")
>>> isinstance(result, MyModel)
True
>>> result.name
'John'
>>> result.age
30
__init__(model: LLMModel, history: LLMHistory, fn_parse_and_validate: Callable[[Any], Any], default_max_retries: int = 5)[source]

Initialize a DataModelLLMTask instance.

Parameters:
  • model (LLMModel) – The LLM model to use for generating responses.

  • history (LLMHistory) – The conversation history to maintain context.

  • fn_parse_and_validate (Callable[[Any], Any]) – Function to parse and validate the response data. Should accept the parsed JSON data and return a validated instance of the data model.

  • default_max_retries (int) – Maximum number of retries for failed attempts, defaults to 5.

Example:

>>> task = DataModelLLMTask(
...     model=my_model,
...     history=my_history,
...     fn_parse_and_validate=MyModel.model_validate
... )

create_datamodel_task

hbllmutils.response.datamodel.create_datamodel_task(model: str | LLMModel, datamodel_class: type, task_requirements: str, samples: List[Tuple[str, Any]] | None = None, related_datamodel_classes: List[type] | None = None, prompt_generation_model: str | LLMModel | None = None, fn_parse_and_validate: Callable[[Any], Any] | None = None, fn_dump_json: Callable[[Any], Any] | None = None) DataModelLLMTask[source]

Create a DataModelLLMTask with configured prompts and validation.

This factory function sets up a complete LLM task that:

  • Generates format prompts based on the data model

  • Configures task requirements

  • Sets up parsing and validation logic

  • Optionally includes sample inputs and outputs for reference

The function automatically handles Pydantic BaseModel and dataclass types, providing default parsing and serialization functions. For custom types, you can provide your own parsing and serialization functions.

The generated task uses a structured prompt that includes:

  1. Task requirements describing what the LLM should do

  2. Optional samples showing input-output examples

  3. Format guide explaining the expected output structure

Parameters:
  • model (LLMModelTyping) – The LLM model to use for the main task.

  • datamodel_class (type) – The data model class that defines the expected output structure.

  • task_requirements (str) – Description of what the task should accomplish.

  • samples (Optional[List[Tuple[str, Any]]]) – Optional list of (input, output) tuples to provide as examples, defaults to None.

  • related_datamodel_classes (Optional[List[type]]) – Optional list of related data model classes for context, defaults to None.

  • prompt_generation_model (Optional[LLMModelTyping]) – Optional separate model for prompt generation, defaults to None (uses main model).

  • fn_parse_and_validate (Optional[Callable[[Any], Any]]) – Optional custom parsing and validation function, defaults to None.

  • fn_dump_json (Optional[Callable[[Any], Any]]) – Optional custom function to convert data model instances to JSON-serializable dicts, defaults to None.

Returns:

A configured DataModelLLMTask instance.

Return type:

DataModelLLMTask

Raises:
  • ValueError – If datamodel_class is not a pydantic BaseModel subclass and fn_parse_and_validate is not provided.

  • ValueError – If samples are provided but datamodel_class is not a pydantic BaseModel or dataclass and fn_dump_json is not provided.

Note

The function prints the generated system prompt to stdout for debugging purposes. This can be useful for understanding what instructions are being sent to the LLM.

Example:

>>> from pydantic import BaseModel
>>> class MyModel(BaseModel):
...     name: str
...     age: int
>>> task = create_datamodel_task(
...     model=my_llm_model,
...     datamodel_class=MyModel,
...     task_requirements="Extract user information from the text",
...     samples=[
...         ("John Doe, age 30", MyModel(name="John Doe", age=30)),
...     ],
...     related_datamodel_classes=[AddressModel]
... )
>>> result = task.ask_then_parse("Jane Smith is 25 years old")
>>> isinstance(result, MyModel)
True
>>> result.name
'Jane Smith'
>>> result.age
25