Judgement Processor

The Judgement Processor module is designed to handle the evaluation of responses using various judgement models. It includes two main components: judge_responses for evaluating responses and judge_images for evaluating images.

Quick Start

judge_responses

The judge_responses function processes all data files in a specified directory to evaluate responses using specified models.

Definition:

judge_responses(
    data_folder: str,
    async_judge_model: List[str],
    target_models: List[str],
    judge_type: str,
    response_key: List[str] = ['responses'],
    judge_key: str = 'judge',
    response_extension: str = '_responses',
    judge_extension: str = '_judge',
    reverse_choice: bool = False
) -> None

Parameters:

data_folder (str) Path to the folder containing JSON files to process
async_judge_model (List[str]) List of asynchronous judge models
target_models (List[str]) Combined list of target asynchronous and synchronous models
judge_type (str) Type of judge (‘llm’, ‘vlm’, ‘toxicity’, etc.)
response_key (List[str], optional) List of keys to look for in the responses
judge_key (str, optional) Key to store judge results
response_extension (str, optional) Extension for response files
judge_extension (str, optional) Extension for judge result files
reverse_choice (bool, optional) Whether to reverse choices in mappings

Examples:

For LLM Usage:

import trusteval
await trusteval.judge_responses(
    data_folder='path/to/data',
    async_judge_model=['model1', 'model2'],
    target_models=['model3'],
    judge_type='llm',
    response_key=['responses'],
    judge_key='judge'
)

For VLM Usage:

import trusteval
await trusteval.judge_responses(
    data_folder='path/to/data',
    async_judge_model=['model1', 'model2'],
    target_models=['model3'],
    judge_type='vlm',
    response_key=['responses'],
    judge_key='judge',
)

judge_images

The judge_images function processes all image data files in a specified directory to evaluate images using specified models.

Definition:

judge_images(
    base_dir: str,
    aspect: str,
    handler_type: str = 'api',
    target_models: List[str] = None
) -> None

Parameters:

base_dir (str) Base directory for data and output
aspect (str) Evaluation aspect (‘robustness’, ‘fairness’, etc.)
handler_type (str, optional) Type of handler (‘api’ or ‘local’)
target_models (List[str], optional) List of model names to evaluate

Example Usage:

import trusteval

trusteval.judge_images(
    base_dir='path/to/base_dir',
    aspect='robustness_t2i',
    handler_type='api',
    target_models=['model1', 'model2']
)

Classes

JudgeProcessor

The JudgeProcessor class processes responses from different models, handling both asynchronous and synchronous services.

Parameters:

data_folder (str) Path to the folder containing JSON files to process
async_judge_model (List[str]) List of asynchronous judge models
response_key (List[str], optional) List of keys to look for in the responses
judge_key (str, optional) Key to store judge results
target_models (List[str]) Combined list of target asynchronous and synchronous models
response_extension (str, optional) Extension for response files
judge_extension (str, optional) Extension for judge result files
judge_type (str) Type of judge (‘llm’, ‘vlm’, ‘toxicity’, etc.)
reverse_choice (bool, optional) Whether to reverse choices in mappings

Functions

get_response

Definition:

get_response(
    task_config: Dict[str, Any],
    data_path: str,
    max_concurrent_tasks: int = 30
) -> None

Parameters:

task_config (Dict[str, Any]) Configuration for the current task
data_path (str) Path to the data file
max_concurrent_tasks (int, optional) Maximum number of concurrent tasks

toxicity

Definition:

toxicity(
    data_path: str,
    response_key: List[str]
) -> None

Parameters:

data_path (str) Path to the data file
response_key (List[str]) Key(s) to extract responses from