Judgement Processor

The Judgement Processor module is designed to handle the evaluation of responses using various judgement models. It includes two main components: judge_responses for evaluating responses and judge_images for evaluating images.

Quick Start

judge_responses

The judge_responses function processes all data files in a specified directory to evaluate responses using specified models.

Definition:

judge_responses(
    data_folder: str,
    async_judge_model: List[str],
    target_models: List[str],
    judge_type: str,
    response_key: List[str] = ['responses'],
    judge_key: str = 'judge',
    response_extension: str = '_responses',
    judge_extension: str = '_judge',
    reverse_choice: bool = False
) -> None
Parameters:

  • data_folder (str) Path to the folder containing JSON files to process

  • async_judge_model (List[str]) List of asynchronous judge models

  • target_models (List[str]) Combined list of target asynchronous and synchronous models

  • judge_type (str) Type of judge (‘llm’, ‘vlm’, ‘toxicity’, etc.)

  • response_key (List[str], optional) List of keys to look for in the responses

  • judge_key (str, optional) Key to store judge results

  • response_extension (str, optional) Extension for response files

  • judge_extension (str, optional) Extension for judge result files

  • reverse_choice (bool, optional) Whether to reverse choices in mappings

Examples:

For LLM Usage:

import trusteval
await trusteval.judge_responses(
    data_folder='path/to/data',
    async_judge_model=['model1', 'model2'],
    target_models=['model3'],
    judge_type='llm',
    response_key=['responses'],
    judge_key='judge'
)

For VLM Usage:

import trusteval
await trusteval.judge_responses(
    data_folder='path/to/data',
    async_judge_model=['model1', 'model2'],
    target_models=['model3'],
    judge_type='vlm',
    response_key=['responses'],
    judge_key='judge',
)

judge_images

The judge_images function processes all image data files in a specified directory to evaluate images using specified models.

Definition:

judge_images(
    base_dir: str,
    aspect: str,
    handler_type: str = 'api',
    target_models: List[str] = None
) -> None
Parameters:

  • base_dir (str) Base directory for data and output

  • aspect (str) Evaluation aspect (‘robustness’, ‘fairness’, etc.)

  • handler_type (str, optional) Type of handler (‘api’ or ‘local’)

  • target_models (List[str], optional) List of model names to evaluate

Example Usage:

import trusteval

trusteval.judge_images(
    base_dir='path/to/base_dir',
    aspect='robustness_t2i',
    handler_type='api',
    target_models=['model1', 'model2']
)

Classes

JudgeProcessor

The JudgeProcessor class processes responses from different models, handling both asynchronous and synchronous services.

Parameters:

  • data_folder (str) Path to the folder containing JSON files to process

  • async_judge_model (List[str]) List of asynchronous judge models

  • response_key (List[str], optional) List of keys to look for in the responses

  • judge_key (str, optional) Key to store judge results

  • target_models (List[str]) Combined list of target asynchronous and synchronous models

  • response_extension (str, optional) Extension for response files

  • judge_extension (str, optional) Extension for judge result files

  • judge_type (str) Type of judge (‘llm’, ‘vlm’, ‘toxicity’, etc.)

  • reverse_choice (bool, optional) Whether to reverse choices in mappings

Functions

get_response

Definition:

get_response(
    task_config: Dict[str, Any],
    data_path: str,
    max_concurrent_tasks: int = 30
) -> None
Parameters:

  • task_config (Dict[str, Any]) Configuration for the current task

  • data_path (str) Path to the data file

  • max_concurrent_tasks (int, optional) Maximum number of concurrent tasks

toxicity

Definition:

toxicity(
    data_path: str,
    response_key: List[str]
) -> None
Parameters:

  • data_path (str) Path to the data file

  • response_key (List[str]) Key(s) to extract responses from