The Berkeley Function Calling Leaderboard V3 (also called Berkeley Tool Calling Leaderboard V3) evaluates the LLM's ability to call functions (aka tools) accurately. This leaderboard consists of real-world data and will be updated periodically. For more information on the evaluation dataset and methodology, please refer to our blogs: BFCL-v1 introducing AST as an evaluation metric, BFCL-v2 introducing enterprise and OSS-contributed functions, and BFCL-v3 introducing multi-turn interactions. Checkout code and data.
FC = native support for function/tool calling.
Cost is calculated as an estimate of the cost per 1000 function calls, in USD. Latency is measured in seconds. For Open-Source Models, the cost and latency are calculated when serving with vLLM using 8 V100 GPUs. The formula can be found in the blog.
AST Summary is the unweighted average of the four test categories under AST Evaluation. Exec Summary is the unweighted average of the four test categories under Exec Evaluation. Overall Accuracy is the unweighted average of all the sub-categories.
Click on column header to sort. If you would like to add your model or contribute test-cases, please contact us via discord.
The following chart shows the comparison of the models based on a few metrics. You can select and deselect which models to compare. More information on each metric can be found in the blog.
This interactive treemap shows the distribution of error types across different models. The size of each block represents the number of errors encountered by that model. Errors are categorized hierarchically, encompassing both top-level errors (e.g., Value Errors) and more specific subsidiary errors (e.g., Invalid String Format). You can hover over and click on each block to see the detailed breakdown of error types for different models. For more information on how these errors are identified and addressed, refer to our evaluation metrics and insight blog (coming soon).
In this demo for function calling, you can enter a prompt and a function and see the output. There will be two outputs (and two output boxes accordingly): one in the actual code format (the top one) and the other in the OpenAI compatible format (the bottom one). Note that the OpenAI compatible format output is only available if the actual code output has valid syntax and can be parsed. We also provide you a few examples to try out and get a sense of the input format and the output.
@misc{berkeley-function-calling-leaderboard,
title={Berkeley Function Calling Leaderboard},
author={Fanjia Yan and Huanzhi Mao and Charlie Cheng-Jie Ji
and Tianjun Zhang and Shishir G. Patil and Ion Stoica and Joseph E.
Gonzalez},
howpublished={\url{https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html}},
year={2024},
}