UC Berkeley Logo Berkeley Function-Calling Leaderboard

Leaderboard

The Berkeley Function Calling Leaderboard (also called Berkeley Tool Calling Leaderboard) evaluates the LLM's ability to call functions (aka tools) accurately. This leaderboard consists of real-world data and will be updated periodically. For more information on the evaluation dataset and methodology, please refer to our blog post and code release.

Last updated: 2024-04-27 [Change Log]
Latency (s) Abstract Syntax Tree (AST) Evaluation Evaluation by Executing APIs
Rank Overall Acc Model Cost ($) Average Latency (s) AST Summary Exec Summary Relevance Cost ($) Mean SD P95 Simple Function Multiple Functions Parallel Functions Parallel Multiple Simple Function Multiple Functions Parallel Functions Parallel Multiple Relevance Detection Organization License

FC = native support for function/tool calling.

Cost is calculated as an estimate of the cost per 1000 function calls, in USD. Latency is measured in seconds. For Open-Source Models, the cost and latency are calculated when serving with vLLM using 8 V100 GPUs. The formula can be found in the blog.

AST Summary is the unweighted average of the four test categories under AST Evaluation. Exec Summary is the unweighted average of the four test categories under Exec Evaluation.

Click on column header to sort. If you would like to add your model or contribute test-cases, please contact us via discord.

Wagon Wheel

The following chart shows the comparison of the models based on a few metrics. You can select and deselect which models to compare. More information on each metric can be found in the blog.

Error Type Analysis

This interactive treemap shows the distribution of error types across different models. The size of each block represents the number of errors encountered by that model. Errors are categorized hierarchically, encompassing both top-level errors (e.g., Value Errors) and more specific subsidiary errors (e.g., Invalid String Format). You can hover over and click on each block to see the detailed breakdown of error types for different models. For more information on how these errors are identified and addressed, refer to our evaluation metrics and insight blog (coming soon).

Function Calling Demo

In this demo for function calling, you can enter a prompt and a function and see the output. There will be two outputs (and two output boxes accordingly): one in the actual code format (the top one) and the other in the OpenAI compatible format (the bottom one). Note that the OpenAI compatible format output is only available if the actual code output has valid syntax and can be parsed. We also provide you a few examples to try out and get a sense of the input format and the output.

Model:

0.7

Output will be shown here:
OpenAI compatible format output:

Contact Us

Citation

                
@misc{berkeley-function-calling-leaderboard,
    title={Berkeley Function Calling Leaderboard}, 
    author={Fanjia Yan and Huanzhi Mao and Charlie Cheng-Jie Ji
    and Tianjun Zhang and Shishir G. Patil and Ion Stoica and Joseph E.
    Gonzalez},
    howpublished={\url{https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html}},
    year={2024},
}