Berkeley Function Calling Leaderboard V3 (aka Berkeley Tool Calling Leaderboard V3)

BFCL: From Tool Use to Agentic Evaluation of Large Language Models

The Berkeley Function Calling Leaderboard V3 (also called Berkeley Tool Calling Leaderboard V3) evaluates the LLM's ability to call functions (aka tools) accurately. This leaderboard consists of real-world data and will be updated periodically. For more information on the evaluation dataset and methodology, please refer to our blogs: BFCL-v1 introducing AST as an evaluation metric, BFCL-v2 introducing enterprise and OSS-contributed functions, and BFCL-v3 introducing multi-turn interactions. Checkout code and data.

Last Updated: 2025-06-14 [Change Log]

FC = native support for function/tool calling. Prompt = walk-around for function calling, using model's normal text generation capability.

Cost is calculated as an estimate of the cost per 1000 function calls, in USD. Latency is measured in seconds.

Overall Accuracy is the unweighted average of all the sub-categories. For details on score composition, please refer to our blog.

Click on column header to sort. If you would like to add your model or contribute test-cases, please contact us via discord.

Models are evaluated using commit 1bb65c9. All the model response we obtained is available here. To reproduce the results, please either checkout our codebase at this checkpoint, or install the PyPI package pip install bfcl-eval==2025.6.14.

Wagon Wheel

The following chart shows the comparison of the models based on a few metrics. You can select and deselect which models to compare. More information on each metric can be found in the blog.

Function Calling Demo

In this demo for function calling, you can enter a prompt and a function and see the output. There will be two outputs (and two output boxes accordingly): one in the actual code format (the top one) and the other in the OpenAI compatible format (the bottom one). Note that the OpenAI compatible format output is only available if the actual code output has valid syntax and can be parsed. We also provide you a few examples to try out and get a sense of the input format and the output.

Model:

Temperature: 0.7

Output will be shown here:

OpenAI compatible format output:
                    

Citation

@inproceedings{patil2025bfcl, title={The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models}, author={Patil, Shishir G. and Mao, Huanzhi and Cheng-Jie Ji, Charlie and Yan, Fanjia and Suresh, Vishnu and Stoica, Ion and E. Gonzalez, Joseph}, booktitle={Forty-second International Conference on Machine Learning}, year={2025}, }

Berkeley Function-Calling Leaderboard

🦍 Announcement: We're excited to release BFCL V4 Agentic! We're currently sprinting to integrate all models into the new benchmark. Once generation is complete, the leaderboard will migrate from v3 to v4.
Hang tight — big updates incoming!

BFCL: From Tool Use to Agentic Evaluation of Large Language Models

Wagon Wheel

Function Calling Demo

Contact Us

Citation

Berkeley Function-Calling Leaderboard

🦍 Announcement: We're excited to release BFCL V4 Agentic! We're currently sprinting to integrate all models into the new benchmark. Once generation is complete, the leaderboard will migrate from v3 to v4. Hang tight — big updates incoming!

BFCL: From Tool Use to Agentic Evaluation of Large Language Models

Wagon Wheel

Function Calling Demo

Contact Us

Citation

🦍 Announcement: We're excited to release BFCL V4 Agentic! We're currently sprinting to integrate all models into the new benchmark. Once generation is complete, the leaderboard will migrate from v3 to v4.
Hang tight — big updates incoming!