🦍 Gorilla: Large Language Model Connected with Massive APIs

BFCL V2 • Live

BFCL V2 • Live Dataset

Last updated: 2024-08-19 [Change Log]

Introduction

The growing use of LLMs for function-calling is transforming various AI applications. LLMs now serve as intelligent agents, interfacing with external tools, APIs, and databases to perform complex tasks. This capability enables LLMs to navigate web browsers, query databases, control robotic systems, and integrate with a wide array of software tools. The Berkeley Function-Calling Leaderboard (BFCL) has emerged as the standard for evaluating the function-calling (tool-use) capabilities of LLMs such as Claude, OpenAI, Gemini, Llama, and Mistral. With BFCL V1 we introduced the taxonomy of parallel and multiple function calling scenarios - calling many instantiations of the same function, and choosing between different functions, in addition to evaluating the LLMs in not just python, but also java, javascript, REST languages. BFCL introduced and continues to be the only leaderbord that evaluates the LLMs ability to invoke functions in the real-world by actually triggering the API call and comparing responses.

To tackle the issues of data contamination, bias and fairness, and need to generalize functions (tools) to real-world scnearios, we are excited to release BFCL V2 • Live. BFCL V2 • Live employs live, user-contributed function documentation and queries, avoiding the drawbacks of dataset contamination and biased benchmarks. By utilizing user-provided data, BFCL V2 • Live aims to more faithfully measure the LLM's function-calling performance in real-world scenarios, emphasizing the importance of models performing effectively in diverse and dynamic environments. In this blog, we describe the menthodology, data composition, and provide ample examples of how our users including large banks, tech-corporations, agent developers, amongst hobbyists and enterprises to understand LLM function calling deployments. For example, we observe a very high demand for the feature of having to intelligently choose between functions (multiple functions) and lower demand for making parallel function calls in a single turn (parallel functions). Further, this dataset helps us pin-point contamination in different models!

On Sept 19th, we released the BFCL V3 dataset, featuring multi-turn & multi-step function calling evaluation. Checkout the BFCL V3 Blog Post for more details!

Quick Links:

With BFCL, we have been constantly listening to the community and improving the evaluation data-set and metrics. BFCL V2 • Live is a big step in this direction, and we will continue to curate and release community contributed Live data-sets periodically. Thank you for your enthusiastic participation and we look forward to making BFCL the one-stop destination for evaluating function-calling (tool-calling) of LLMs.


Dataset Composition

BFCL V2 • Live features a diverse set of 2,251 question-function-answer pairs (question + [{ 'name ': 'func1 ', ' description ': 'order takeout'}] -> answer), including rare function documentations that contain 10+ function options or a complex function with 10+ nested parameters. The dataset also includes new user queries that vary in tone, are part of multi-turn interactions, cover specialized use cases, and span multiple languages. We will assess this dataset using the BFCL Abstract Syntax Tree (AST) evaluation to ensure thorough and accurate analysis.

To avoid confusion, BFCL V1 refers to the initial release (release v1.0, read more at the BFCL V1 blog), BFCL V2 • Live is the user contributed dataset that we will discuss in this blog. Together, they form the BFCL-v2 dataset that is used to evaluate the function calling capabilities of LLMs on our Berkeley Function Calling Leaderboard (BFCL).

BFCL V2 • Live comprises of 258 simple, 1053 multiple, 16 parallel, 24 parallel multiple, 882 irrelevance detection, and 18 relevance detection entries. Each test category is detailed in the Evaluation Categories section, designed to assess different function-calling scenarios comprehensively. We further divide the Relevance category (which detects the LLMs ability to recognize that no function (tool) fits the users request, so it should ask for more information) into Relevance detection and Irrelevance detection to provide more nuanced insights.

  • Irrelevance detection: The scenario where none of the function choices provided are relevant to the user query and none should be invoked. We expect the model to not output a function call; the model can either output a message explaining why the function provided are not relevant or simply output a non-function call response (e.g., an empty list).
  • Relevance detection: The opposite of irrelevance detection. The scenario where at least one of the function choices provided are relevant to the user query and should be invoked, but the way the user prompt or the function doc is stated means that there could be infinitely many correct function calls and impossible to use a pre-defined possible answer set to evaluate. We expect the model to output some function call (one or multiple) that is relevant to the user query; we don't check for the correctness of the function call in this category (eg, correct parameter value).

It is worth noting that in BFCL V1, we have 400, 200, 200, 200 entries respectively in simple, multipl, parallel, and parallel multiple category. This reflects what our team had thought the composition of real life function calling scenario should be based on our experiences building Gorilla LLM. On the other hand, in BFCL V2 • Live, there are significantly more scenarios on multiple (where models need to decide which functions to use) and less parallel function calling scenarios. This observation reflects how most users interact with function calling: high demand for the feature of having to intelligently choose between functions and lower demand for making parallel function calls in a single turn.

Dataset Composition Berkeley Function-Calling Leaderboard Live (BFCL V2 • Live) Data Composition


On average, each entry in BFCL V2 • Live contains 3 function choices, with the maximum one having 37 function choices. Each function has an average of 4 parameters, with the maximum one having 28 parameters.

Function Count Distribution Function Count Distribution in BFCL V2 • Live
Function Params Count Distribution Function Parameter Count Distribution in BFCL V2 • Live

Methodology

Here is a high-level overview of the methodology used to generate the BFCL V2 • Live dataset. The process begins with Data Pre-Processing, where raw data from real-world user queries is deduplicated and cleaned to ensure a unique and relevant dataset. This is followed by Data Filtering, which involves identifying and handling low-quality inputs, separating out irrelevant queries for specific testing purposes. The final stage is Data Quality Improvement, where the remaining data is standardized and enhanced to fit the requirements of the BFCL evaluation pipeline while maintaining the integrity of the original user inputs. We will be going into details in the following sections.

<i>BFCL V2 • Live</i> Methodology BFCL V2 • Live Methodology Flowchart: There are three main stages for cleaning raw user queries: Data Pre-Processing (using ROGUE scores and text embedding for deduplication), Data Filtering (assessing function document and prompt quality), and Data Quality Standardization & Improvements (formatting and enhancing high-quality inputs).


🧑‍💻 Data Source

BFCL V2 • Live is composed of real-world user-provided data to our hosted model end-point through partnerships, and public access on our website gorilla.cs.berkeley.edu/leaderboard. We start-off with 64,517 queries (post basic filtering for spam), that hit our hosted endpoint between 2024-02-26 to 2024-04-01.


🔨 Data Pre-Processing

Data preprocessing is essential for generating structured and fresh user data, excluding existing public function calling test sets. This process includes:

  1. Deduplication Using ROUGE Score. To ensure uniqueness, we employ the ROUGE-L score in two distinct ways:
    • Remove Function Documentation From Public Test Sets: Remove user queries against existing datasets identified from NexusRaven, Firefunctions, Anyscale, and NousResearch-Glaive-Function-Calling.
    • Function Documentation Deduplication: Repeated user-provided function documentation is deduplicated. This is crucial as some users may submit the same or slightly modified documents multiple times for testing.
  2. Query Deduplication through Text Embedding Model. We use OpenAI's text-embedding-3-small model to further deduplicate user queries, enhancing the uniqueness of our data.
  3. We parse the user-provided function documents string to a valid list of JSON format that can be used in the BFCL evaluation pipeline.

After all the preprocessing steps, we are only left with less than 3% of the original data! See the table below for more statistics.

Stage Number of Function Docs Left
Initial Function Doc-User Query Pairs 64,517
After Merging Exact Same Function Docs 6,478
After Removing Overlap with Public Datasets 3,698
After Further Merging Similar Function Docs (ROGUE-L > 0.8) 1,692


🗑️ Data Filtering

Real-world user data is messy. There are low-quality data that can't fit directly into a high-quality function-calling evaluation dataset, since low-quality function docs and prompts often produce model non-unique responses. This section details how we define, judge, and handle low-quality data for downstream usages, which we will transform function documents into a well-formed format. After transformation, these function documents will be compatible with our current BFCL evaluation procedure.

We defined high-quality function documents and queries as the following.
  • High-Quality Function Document: The function document should have a clear JSON structure includes essential fields such as the function's name, function description, and detailed information on each parameter. This information includes parameter type, detailed parameter description (e.g. its format and possible value range), and the default value if it's an optional parameter.
  • High-Quality User Prompt: The user prompt should align well with the function's intended use, utilize all required parameters effectively, and clearly specify all relevant parameters.
Based on the above definition, here are some examples of low-quality data:
+ Low-Quality Function Document Example
+ Low-Quality User Prompt Example 1
+ Low-Quality User Prompt Example 2
+ Irrelevant User Prompt Example 1
+ Irrelevant User Prompt Example 2

As outlined in our framework, low-quality or irrelevant user queries are identified and used as the “Irrelevance Detection” test category, which evaluates the model's ability to generate non-function-calling responses to irrelevant queries. For more details, please refer to the Evaluation Categories section of our BFCL blog. These queries will be unmodified in subsequent data-cleaning phases.

On the other hand, documents with low-quality functions will be excluded and will not proceed to the next stage of data cleaning.


💯 Data Quality Improvement

After filtering out low-quality documents and user prompts, it is important to ensure data standardization so that the remaining function docs and user prompts are of high quality and compatible with the current BFCL evaluation pipeline. This process involves minimal yet necessary modifications to maintain the integrity of the v1 real-world documents while adapting them to fit seamlessly into the pipeline.

  • For function docs, we carefully revise each function doc to conform to the BFCL format, which imposes specific structural and content requirements:

    • The function doc must include the fields 'name', 'description', and 'parameters'.
    • The function parameters must be precisely defined.
      1. Correctly categorized (e.g., one of boolean, array, string, integer, float, tuple, any, dict).
      2. All required parameters must be clearly indicated in the "required" field.
      3. Comply with detailed formatting guidelines. Dictionary-type parameter should include the “property“ field that details each key-value pair, while array-type parameter should have the “items” field that specifies the type of elements in the array.
      4. Any optional parameters should clearly specify their default values either as an additional "default" field or as part of their parameter description.

    We strictly follow the "minimum edits" principle, where default values are only added if they are missing, types are corrected (or inferred when absent) to align with the BFCL format, and any necessary fields are filled out to comply with type restrictions. This ensures that we enhance the usability of the documents in our evaluation pipeline without compromising their original intent or functionality.

  • For prompts, we focus on enhancing their quality while strictly preserving the original format and semantic content. We ensure that the user prompts are relevant to the associated function doc and practically applicable in real-world scenarios. Special attention is given to clarity, emphasizing the presentation of specific, concrete parameter values that are necessary for function calls. We eliminate ambiguity to improve clarity and precision. We also replace any sensitive information with generic placeholders to maintain confidentiality.

    Some prompts involves system, user and assistant roles. We ensure that the system prompt is unmodified; we only make necessary edits to the user prompts.


Live Dataset Examples

Here are some interesting examples from the BFCL V2 • Live dataset. We hope to show how diverse the BFCL V2 • Live dataset is and how it is different from the previous BFCL dataset.

+ Multi-lingual User Prompt
+ Multi-lingual Function Document
+ Multiple Function with 10+ Function Choices (the example has 37!)
+ Complex Function with 10+ Parameters (the example has 28!)
+ Classification through Function-Calling (interesting use case)
+ User Prompt Contains Lots of Redundant Information
+ Tricky User Prompt (how would you parse it?) (before prompt enhancement)
+ User Prompt Typo Affecting Important Parameter Information
+ User Prompt is the Unaltered Ground Truth Function Call (considered low-quality before prompt enhancement)

Going Deep

To better understand the performance differences between BFCL V1 and BFCL V2 • Live, we provide two types of visualizations: histograms and scatterplots. These graphs offer insights into the distribution of scores, the relative difficulty of tasks in each version of the benchmark, and exposing model data contamination issues. The plots are interactive, you can drag to zoom in and out, and hover over points to see model details.

Interpreting Histograms

The histograms show the distribution of scores for each category and overall, allowing us to compare the difficulty of user queries and function documents between the two versions. Generally, we observe that BFCL V2 • Live presents more challenging scenarios in categories such as simple, multiple, parallel, and parallel multiple function calling, while remaining comparable in irrelevance detection. This increased difficulty reflects the real-world complexity captured by user-contributed data.

Interpreting Scatterplots

The scatterplots directly compare how models perform on BFCL V1 versus BFCL V2 • Live. Each point represents a model, with its x-coordinate showing the BFCL V1 score and its y-coordinate showing the BFCL V2 • Live score. These plots help us identify performance discrepancies and assess model robustness across different datasets.

In the scatterplots:

  • Points on the y=x line (dashed red) indicate equal performance in both versions.
  • Points above this line suggest better performance in BFCL V2 • Live.
  • Points below the line indicate poorer performance in BFCL V2 • Live compared to V1.

Notably, points significantly below the y=x line (i.e., much lower scores in BFCL V2 • Live) may indicate potential data contamination in BFCL V1 for that model. This suggests the model struggles with the new, unseen data in V2 Live despite performing well on V1. Such discrepancies help us identify models that may have been overfitted to the V1 dataset.

Below are comparisons between the BFCL V1 dataset and the BFCL V2 • Live dataset in summary statistics (Overall Accuracy, AST summary) and individual categories (Simple Function, Multiple Function, Parallel Function, Parallel Multiple Function, Irrelevance Detection).


We hope you enjoyed this blog post. We would love to hear from you on Discord, Twitter (#GorillaLLM), and GitHub.

Citation

If you would like to cite BFCL:

                        
    @inproceedings{berkeley-function-calling-leaderboard,
title={Berkeley Function Calling Leaderboard},
author={Fanjia Yan and Huanzhi Mao and Charlie Cheng-Jie Ji and Tianjun Zhang and Shishir G. Patil and Ion Stoica and Joseph E. Gonzalez},
year={2024},
howpublished={\url{https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html}},
}