Metadata Curator ================ Overview -------- Metadata Curator is designed for intelligent data generation, combining web search, section-specific dataset generation, and model-based generation to create high-quality, diverse datasets for various research and application needs. Components ---------- Web-Browsing agents ~~~~~~~~~~~~~~~~~~~ BingWebSearchPipeline ^^^^^^^^^^^^^^^^^^^^^ The ``BingWebSearchPipeline`` automates the process of web searching and result processing. It extracts keywords from user input, performs a Bing search, and processes the results into JSON format. :Parameters: - :param **instruction**: (:type:`str`) A string that specifies the user's instruction for what to find on the web pages. - :param **basic information**: (:type:`dict`) A dictionary that defines the specific information for the search. - :param **need_azure**: (:type:`bool`) A boolean that indicates whether to use the Azure API for generating responses. - :param **output_format**: (:type:`dict`) A dictionary that outlines the structure of the expected results. - :param **keyword_model**: (:type:`str`) The model to use for generating keywords (e.g., "gpt-4o"). - :param **response_model**: (:type:`str`) The model to use for generating responses (e.g., "gpt-4o"). - :param **include_url**: (:type:`bool`) A boolean that indicates whether to include the URL of the web pages in the output. - :param **include_summary**: (:type:`bool`) A boolean that indicates whether to include a summary of the web pages in the output. - :param **include_original_html**: (:type:`bool`) A boolean that indicates whether to include the original HTML of the web pages in the output. - :param **include_access_time**: (:type:`bool`) A boolean that indicates whether to include the access time of the web pages in the output. - :param **output_file**: (:type:`str`) A string that specifies the name of the output file where the results will be saved. - :param **direct_search_keyword**: (:type:`str`, optional) Direct keyword for the search. If provided, this keyword will be used directly. - :param **direct_site**: (:type:`str`, optional) Specific site to search within. If provided, the search will be limited to this site. If there are multiple specified websites, please separate them with commas, for example `"www.wikipedia.com,www.nytimes.com"`. BingImageSearchPipeline ^^^^^^^^^^^^^^^^^^^^^^^ The ``BingImageSearchPipeline`` automates the process of extracting keywords from user input, performing a Bing image search, and processing the results into JSON format. :Parameters: - :param **instruction**: (:type:`str`) A string that specifies the user's instruction for what kind of images to find. - :param **basic_information**: (:type:`dict`) A dictionary that defines the specific information for the search (e.g., breed of dog, age, etc.). - :param **output_path**: (:type:`str`, optional) A string that specifies the name of the output file where the results will be saved. Defaults to "processed_image_results.json". - :param **keyword_model**: (:type:`str`) The model to use for generating keywords (e.g., "gpt-4o"). - :param **include_access_time**: (:type:`bool`, optional) A boolean that indicates whether to include the access time of the web pages in the output. Defaults to ``True``. - :param **direct_search_keyword**: (:type:`str`, optional) Direct keyword for the search. If provided, this keyword will be used directly. Dataset Pool Maintainer ~~~~~~~~~~~~~~~~~~~~~~~ The dataset pool maintainer component is a mechanism for creating diverse and enriched test sets by leveraging section-specific pipelines. Each section (such as safety, ethics, fairness) contains its own pipeline for generating targeted test datasets. .. note:: - **Section-Specific Generation**: Each research section has a dedicated pipeline for generating relevant test datasets. - **Targeted Data Creation**: Pipelines focus on producing datasets specific to their domain's unique requirements. - **Diversity and Representation**: Ensure comprehensive coverage of potential scenarios. **Workflow** 1. **Section Selection**: Choose a specific research section (e.g., safety, ethics, robustness). 2. **Pipeline Execution**: - Navigate to the section's directory. - Run the section's ``pipeline.py``. - Generate a domain-specific test dataset. **Available Sections** - Safety - Ethics - Fairness - Privacy - Robustness - Truthfulness Model-based Data Generators ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Model-based data generators component leverages advanced language models to directly generate synthetic data. This approach provides a flexible and powerful method for creating datasets tailored to specific requirements. .. note:: - **Direct Model Invocation**: Call generation services to produce data. - **Configurable Generation Parameters**: Customize the generation process. - **Support for Multiple Model Services**: Compatible with various model services. **Usage** For detailed instructions on using the model service, refer to the `ModelService documentation `_. Quickstart ---------- Web-Browsing agents ~~~~~~~~~~~~~~~~~~~ BingWebSearchPipeline ^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python import asyncio from trusteval.src.metadata_curator import BingWebSearchPipeline def main(): instruction = "Please find examples of unfair treatment that match the given information." basic_information = { "Gender" : "Female", "Nation" : "United States", } output_format = { "Example": [ "Specific example 1 mentioned on the webpage", "Specific example x mentioned on the webpage (and so on)" ] } output_path = "a.json" extractor = BingWebSearchPipeline( instruction=instruction, basic_information=basic_information, need_azure=True, output_format=output_format, keyword_model="gpt-4o", response_model="gpt-4o", include_url=True, include_summary=True, include_original_html=False, include_access_time=True ) asyncio.run(extractor.run(output_file=output_path)) if __name__ == "__main__": main() BingImageSearchPipeline ^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python import asyncio from trusteval.src.metadata_curator import BingImageSearchPipeline def main(): instruction = "Find images of cute puppies" basic_information = {"breed": "Golden Retriever", "age": "2 months"} custom_output_path = "custom_puppy_images.json" pipeline = BingImageSearchPipeline(instruction, basic_information, output_path=custom_output_path) asyncio.run(pipeline.run()) if __name__ == "__main__": main() Dataset Pool ~~~~~~~~~~~~ .. code-block:: bash # Navigate to a specific section cd section/robustness/robustness_llm # Run the section's pipeline to generate a test dataset python pipeline.py Output Format ------------- Web-Browsing agents ~~~~~~~~~~~~~~~~~~~ BingWebSearchPipeline ^^^^^^^^^^^^^^^^^^^^^ The generated JSON file will have the following structure: .. code-block:: json [ { "Example": [ "25% of working women have earned less than male counterparts for the same job, while only 5% of working men report earning less than female peers.", "Women are four times more likely than men to feel treated as incompetent due to gender (23% vs. 6%).", "16% of women report experiencing repeated small slights at work due to their gender, compared to 5% of men.", "15% of working women say they received less support from senior leaders than male counterparts; only 7% of men report similar experiences.", "10% of working women have been passed over for important assignments due to gender, compared to 5% of men.", "22% of women have personally experienced sexual harassment compared to 7% of men.", "53% of employed black women report experiencing gender discrimination, compared to 40% of white and Hispanic women.", "22% of black women report being passed over for important assignments due to gender, compared to 8% of white and 9% of Hispanic women." ], "url": "https://www.pewresearch.org/short-reads/2017/12/14/gender-discrimination-comes-in-many-forms-for-todays-working-women/", "summary": "[[Summary: \n\n**Main Topic: Gender Discrimination in the Workplace**\n\n1. **Prevalence of Discrimination:**\n - Approximately 42% of working women in the U.S. report experiencing gender discrimination at work.\n - Women are about twice as likely as men (42% vs. 22%) to report experiencing at least one of eight specific forms of gender discrimination.\n\n2. **Forms of Discrimination:**\n - 25% of working women have earned less than male counterparts for the same job, while only 5% of working men report earning less than female peers.\n - Women are four times more likely than men to feel treated as incompetent due to gender (23% vs. 6%).\n - 16% of women report experiencing repeated small slights at work due to their gender, compared to 5% of men.\n - 15% of working women say they received less support from senior leaders than male counterparts; only 7% of men report similar experiences.\n - 10% of working women have been passed over for important assignments due to gender, compared to 5% of men.\n\n3. **Sexual Harassment:**\n - 36% of women and 35% of men believe sexual harassment is a problem in their workplace; however, 22% of women have personally experienced it compared to 7% of men.\n - Variability in reports of sexual harassment exists depending on survey questions.\n\n4. **Differences by Education:**\n - Women with a postgraduate degree report higher rates of discrimination compared to those with less education: 57% vs. 40% (bachelor\u2019s degree) and 39% (less than bachelor\u2019s).\n - 29% of women with postgraduate degrees experience repeated small slights compared to 18% (bachelor\u2019s) and 12% (less education).\n\n5. **Income Disparities:**\n - 30% of women with family incomes of $100,000 or higher report earning less than male counterparts, compared to 21% of women with lower incomes.\n\n6. **Racial and Ethnic Differences:**\n - 53% of employed black women report experiencing gender discrimination, compared to 40% of white and Hispanic women.\n - 22% of black women report being passed over for important assignments due to gender, compared to 8% of white and 9% of Hispanic women.\n\n7. **Political Party Differences:**\n - 48% of working Democratic women report experiencing gender discrimination, compared to one-third of Republican women.\n\n8. **Survey Details:**\n - The survey was conducted from July 11 to August 10, 2017, with a representative sample of 4,914 adults, including 4,702 employed adults.\n - The margin of error is \u00b12.0 percentage points for the total sample and \u00b13.0 for employed women.\n\n**Authors:** Kim Parker and Cary Funk, Pew Research Center. \n**Publication Date:** December 14, 2017.]]", "access_time": "2024-08-10T05:21:17.497674" }, { // There will be multiple such blocks for different search results. } ] BingImageSearchPipeline ^^^^^^^^^^^^^^^^^^^^^^^ The generated JSON file will have the following structure: .. code-block:: json [ { "name": "Image Name", "contentUrl": "https://example.com/contentUrl", //The original image URL is large and may be inaccessible. "thumbnailUrl": "https://example.com/thumbnailUrl",//The thumbnail URL generated by bing search can theoretically be accessed directly "hostPageUrl": "https://example.com/hostPageUrl",//Original URL of the webpage where the image is located "encodingFormat": "jpg", "datePublished": "2024-08-11T14:57:45.000Z",//Image published time "accessTime": "2024-08-11T14:57:45.000Z" }, { // There will be multiple such blocks for different search results. } ]