🦍 Gorilla: Large Language Model Connected with Massive APIs

Blog 5: How to Use Gorilla

How to Use Gorilla: A Step-by-Step Walkthrough

How to Use Gorilla Image

Gorilla returns the right API calls for your specific task, enabling LLMs to interact with the world. In this blog post, learn about the different ways you can use Gorilla directly, or integrate it into your workflow.

Gorilla LLM is a rapidly growing open-source project that connects LLMs to APIs, accurately returning API calls for a specific task provided by the user in natural language. Imagine ChatGPT with the ability to generate accurate CLI commands, set up meetings on Google Calendar, and order a pizza for you. These are all online tasks that can be accomplished through API calls, and Gorilla automatically generates relevant calls given a task. With this functionality, Gorilla effectively has the ability to take actions in the world by interacting with thousands of services and tools to accomplish a wide variety of user-defined tasks. Gorilla usage falls into 3 primary categories:

  1. Ready-to-Use Product: Gorilla CLI
  2. Developer Building Blocks:
    1. Gorilla Inference
    2. Gorilla Open Functions
    3. Integrating Gorilla With Third Party Libraries
    4. Train Your Own Gorilla
  3. Gorilla Community:
    1. Gorilla API Zoo Index
    2. Berkeley Function Calling Leaderboard
This blog post outlines these ways in a step-by-step manner and links Google Colab playground environments for you to quickly experiment with Gorilla! To find out what makes Gorilla so good at interacting with APIs, delve deeper into the novel method used to train Gorilla and check out this blog post on retrieval-aware training (RAT) and our newly released work on retrieval-aware finetuning which generalizes high performance RAG systems beyond APIs.

  1. Gorilla CLI: Gorilla in Your Command Line 💻
  2. Gorilla powered CLI. Get started with pip install gorilla-cli.

    Gorilla CLI simplifies command line interactions by generating candidate commands for your desired task. Instead of needing to recall verbose commands for various API's, Gorilla CLI streamlines this process into a simple command: gorilla “your query”. Moreover, we prioritize confidentiality and full user control, only executing commands with explicit permission from the user and never collecting output data.

    1. Get started with pip install:
      pip install gorilla-cli
    2. Using gorilla CLI to write to files:
      • Command:
        $ gorilla “generate 100 random characters into a file called text.txt”
      • Output: Using arrow keys, toggle between commands and press Enter
        » cat /dev/urandom | env LC_ALL=C tr -dc 'a-zA-Z0-9' | head -c 100 > test.txt
        echo $(head /dev/urandom | LC_CTYPE=C tr -dc 'a-zA-Z0-9' | dd bs=100 count=1) > test.txt
        dd if=/dev/urandom bs=1 count=100 of=test.txt
    3. Other Usage Examples:
      • Command:
        $ gorilla “list all my GCP instances”
      • Output:
        » gcloud compute instances list --format="table(name,zone,status)"
        gcloud compute instances list --format table
        gcloud compute instances list --format="table(name, zone, machineType, status)"

  3. Gorilla Inference 🦍
  4. Note: If you would like to try out zero-shot Gorilla in a playground-like environment check out this Google Colab.
    It is plug-and-play with OpenAI's chat completion API:
    OpenAI Chat Completion Compliance

    Gorilla powered CLI. Our hosted Gorilla models are compliant with OpenAI's chat completion API, and can be used as a drop-in replacement! You just need to point openai.api_base to our endpoints, and pick the right model.

    Some of our Gorilla models , were trained on data from HuggingFace's Model Hub, Pytorch Hub, and TensorFlow Hub. Out of the box, it returns API calls for your specific prompt based on its training data (and we've now introduced a way to finetune Gorilla on your own set of APIs but more on that in later sections 😊). Running inference with gorilla is now supported on both hosted endpoints and locally! Inference with Gorilla can be done in 3 primary modalities:
    1. CLI inference for Single and Batched Prompt Inference. (Not to be confused with Gorilla-CLI, the tool).
    2. Local Inference with Quantized Models
    3. Inference with Hosted Endpoint on Replicate

    CLI Inference: Install dependencies, download model, prompt!

    conda create -n gorilla python=3.10
    conda activate gorilla
    pip install -r requirements.txt

    At this point you have multiple options as to which Gorilla Model you would like to use. In order to comply with the Llama license, we have released the Gorilla delta weights, which need to be merged with Llama. If you would like to use models out of the box, you can directly download gorilla-mpt-7b-hf-v0 and gorilla-falcon-7b-hf-v0 from Hugging Face. Run the model with:
    python3 serve/gorilla_falcon_cli.py --model-path path/to/gorilla-falcon-7b-hf-v0
    To run inference with batched prompts (multiple prompts in one batch), create a json file following this format with all of your prompts. Get results for your set of prompts by running:
    python3 gorilla_eval.py --model-path path/to/gorilla-falcon-7b-hf-v0 --question-file path/to/questions.jsonl ----answer-file path/to/answers.jsonl

    Local Inference With Quantized Models

    To support running Gorilla locally, we have released quantized versions of the llama, falcon, and mpt-based versions of Gorilla. The easiest way to run Gorilla locally with a clean UI is through text-generation-webui.

    1. Clone text-generation-webui:
      git clone https://github.com/oobabooga/text-generation-webui.git
    2. Go to cloned folder and install dependencies:
      cd text-generation-webui
      pip install -r requirements.txt
    3. Run the following command to start the UI:
    4. Open a browser and paste the following url:

    From here, navigate to the “Model” tab, select your desired quantized model by pasting the name from this Hugging Face repo. For example in the first text field under “Model,” one could paste “gorilla-llm/gorilla-7b-hf-v1-gguf” and then specify the specific model as “gorilla-7b-hf-v1-q3_K_M.gguf.” After loading the model, navigate to the “Chat” tab and start prompting!
    If you want to see an in depth walk through of how quantization with llama.cpp is performed under the hood, check out this Google Colab.

    Inference with Hosted Endpoint on Replicate

    Rather than hitting our UC Berkeley hosted endpoint, for those wanting to run fast, private, and secure inference, Replicate is an alternative scalable platform to run and deploy ML models. Useful for private applications, and running Gorilla for personal use-cases, Gorilla can now be turned into a production-grade service following the instructions below!

    1. Install Cog (Replicate's tool to containerize/deploy Gorilla):
      sudo curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/latest/download/cog_`uname -s`_`uname -m`
      sudo chmod +x /usr/local/bin/cog
    2. Utilize the config file here to use Gorilla with Cog
    3. Set up predict.py according to this specification
    4. Run the following command to build a Docker image:
      cog build -t your_image_name_here
    5. Log in to Replicate and push your model to Replicate's registry:
      cog login
      cog push r8.im/your_username_here/your_model_name_here
    6. The Gorilla model should now be visible on Replicate's website. To run inference, install Replicate's python client library, authenticate your API token, and start generating outputs!
      pip install replicate
      export REPLICATE_API_TOKEN=your_token_here
      output = replicate.run("your_username_here/your_model_name_here:model_version_here", input={"user_query": your_query_here})

  5. Gorilla OpenFunctions: Executable API Calls ⚔️
  6. NEW: We've now released Gorilla Open Functions v2! Checkout this blog post for an in-depth analysis on the capabilities of the latest iteration of Gorilla Open Functions!
    To quickly test out the newest Open Functions model, you can interactively use this demo!

    Gorilla Open Functions returns executable API calls given a specific user query and API documentation. It acts as an open source version of OpenAI's function calling except with chat completion. One key difference is that OpenAI will repeatedly prompt users if arguments are missing in the query but Gorilla Open Functions utilizes chat completion to accurately infer and fill in these missing parameters. Open Functions v2 boasts the best functional calling performance amongst all open source models and is on par with GPT-4! v2 introduces 5 new features including support for more argument data types across different languages, parallel and multiple function calling, function relevance detection, and an improved ability to format RESTful API calls.

    Using the latest Gorilla Open Functions is simple: Install the necessary dependencies, provide a query and a set of functions, and get a response.

    1. Install dependencies:
      pip install openai==0.28.1
    2. Provide a query and set of functions (example below):
          query = "Call me an Uber ride type \"Plus\" in Berkeley at zip code 94704 in 10 minutes"
          function_documentation = {  
              "name": "Uber Carpool",
              "api_name": "uber.ride",
              "description": "Find suitable ride for customers given the location, type of ride, 
                  and the amount of time the customer is willing to wait as parameters",
                      "name": "loc", 
                      "description": "location of the starting place of the uber ride"
                          "enum": ["plus", "comfort", "black"], 
                          "description": "types of uber ride user is ordering"
                          "name": "time", 
                          "description": "the amount of time in minutes the customer is willing to wait"
    3. Define a response function (example below):
          def get_gorilla_response(prompt="Call me an Uber ride type \"Plus\" in Berkeley at zip code 94704 in 10 minutes", 
                                   model="gorilla-openfunctions-v2", functions=[]):
              openai.api_key = "EMPTY"
              openai.api_base = "http://luigi.millennium.berkeley.edu:8000/v1"
                  completion = openai.ChatCompletion.create(
                  messages=[{"role": "user", "content": prompt}],
                  return completion.choices[0].message.content
              except Exception as e:
                  print(e, model, prompt)
    4. Return relevant API call:
      get_gorilla_response(query, functions=functions)
    5. Output Example:
      uber.ride(loc="berkeley", type="plus", time=10)

    Also, checkout the following tutorials for running and self hosting OpenFunctions locally! Rather than relying on our UC Berkeley hosted endpoints, the tutorials below go over local

    • Run OpenFunctions Locally!
      The link above provides a walkthrough of the quantization process of the Gorilla OpenFunctions model and demonstrates how you can run fast, local inference with the model on Google Colab's free T4 GPU!
    • Self-Hosting OpenFunctions
      The link above provides a clear walkthrough on how to set up serving Gorilla OpenFunctions for your enterprise use-cases. Self-hosting the model in this manner allows for control over your sensitive enterprise data, resulting in secure, private inference!

  7. Integrating Gorilla with Third Party Libraries 🦜
  8. For a self contained walkthrough on how to integrate Gorilla with Langchain check out this Google Colab.

    Integrate Gorilla with Langchain for easy deployment in less than 10 lines of code. Install dependencies, create a langchain agent and start prompting!

    1. Install dependencies:
      pip install transformers[sentencepiece] datasets langchain_openai &> /dev/null
    2. Define Langchain Chat Agent:
      from langchain_openai import ChatOpenAI
      chat_model = ChatOpenAI(
    3. Prompt:
      example = chat_model.invoke("I want to translate from English to Chinese")

  9. Train Your Own Gorilla 🔨🔧
  10. Want to use Gorilla on your own set of APIs? Whether it's utilizing Gorilla to help your company run more efficiently by training it on internal APIs or playing around with personal applications built on top of Gorilla, its now easy to train your own Gorilla LLM! We currently have two ways finetune Gorilla:

    1. text-generation-webui: With this method of finetuning, its possible to train full-precision versions of Gorilla (llama, falcon, and mpt base) on your data with a clean chat interface to interact with your model. However, for quantized models, this method does not support .gguf file formats, which is the file format for the official k-quantized gorilla models. As a result, if you would like to finetune quantized models with this method, we will be using GPTQ quantized gorilla from TheBloke.
    2. Llama.cpp built-in finetuning: This method allows you to finetune Gorilla on our official a href="https://huggingface.co/gorilla-llm/gorilla-7b-hf-v1-gguf/tree/main">llama-based k-quantized models. Unfortunately this method doesn't support finetuning the falcon-based or mpt-based quantized versions of Gorilla. The steps are straightforward, but we are working to make the pipeline for this process self-contained and more streamlined so stay-tuned for that!
    Simply format your API dataset, pick a method of finetuning and follow the steps below to LoRA finetune Gorilla:
    1. Finetuning with text-generation-webui:
      1. git clone https://github.com/oobabooga/text-generation-webui.git
      2. Follow text-generation webui to run the application locally
        1. Go to cloned folder
        2. cd loras
        3. Format your APIs according to this template and upload your desired API's (stored as a .txt file) into the loras directory
        4. pip install -r requirements.txt
        5. ./start_macos.sh and it will open a localhost interface
      3. Open a browser and go to url
      4. Navigate to the "Model" tab
      5. Load this GPTQ quantized version of Gorilla>. Note that text-generation-webui method DOES NOT support .gguf models for LoRA finetuning, which is why we use GPTQ quantized version.
        1. If you would like to finetune the full precision models, you can pick any of the models WITHOUT the gguf or ggml suffix tag in this Hugging Face Repo.
        2. To use the llama-based model, we released delta weights to follow compliance rules, so first follow these instructions to merge with base llama.
      6. In the dropdown menu under the Models tab, click the LoRA dropdown menu to add your LoRA.
      7. Click "Apply Lora" Note that this step might take some time.
      8. Navigate to the Chat tab and start chatting with your fine-tuned model!
        1. If you would like to speed up inference and have a GPU available on your laptop, increasing n-gpu-layers on the left side of the "Model" tab accelerates inference.
    2. Finetuning with Llama.cpp:
      To finetune llama-based quantized models, we can directly utilize finetuning functionality in llama.cpp in accordance with the steps below:
      1. Format your APIs according to this template and store as a .txt file
      2. Download your desired k-quantized model from here to use as the base model (note only llama based models are supported). Make sure you are installing a quantized model (.gguf file format).
      3. Run llama.cpp/finetune --model-base the-model-you-downloaded.gguf --train-data your-training-data.txt
        1. Refer to llama.cpp documentation for any extra flags you would like to include.
      4. Run the following command for inference:
        llama.cpp/main --model the-model-you-downloaded.gguf --lora ggml-lora-LATEST-f32.gguf --prompt "YOUR_PROMPT_HERE"

  11. Gorilla API Zoo Index 📖
  12. The API Zoo is an open sourced index containing API documentation that can be used by LLMs to increase tool-use capabilities via API calls. Anyone can upload to the index, simply follow the instructions here to upload your API!

  13. Berkeley Function Calling Leaderboard 🏆
  14. The Gorilla Team has now released the Berkeley Function Calling Leaderboard (BFCL)! This is the first comprehensive dataset for evaluating executable function calls across different languages for LLM function calls. The dataset considers function callings of various forms, different function calling scenarios, and the executability of function call. Specifically, BFCL includes evaluating across Python, Java, JavaScript, REST APIs, and SQL across a wide variety of real-world function calling scenarios. OpenFunctions v2 currently boasts the best performance amongst all open-source models and is on-par with GPT-4! For an interactive visualization to see how different models perform on this data set, checkout the leaderboard here!

    Berkeley Function-Calling Leaderboard (BFCL) Wagon Chart Detailed analysis using Berkeley Function-Calling Leaderboard (BFCL) Wagon Chart


    This blog post went over the numerous methods to utilize and experiment with Gorilla for your own use cases including running local inference, finetuning Gorilla on your own set of APIs, utilizing Gorilla OpenFunctions, the API Zoo. Gorilla enables some super cool applications like Gorilla-Spotlight! Enabling LLMs to utilize tools like APIs opens up the door for agents that can concretely interact with the world. The possibilities for applications are endless. Be sure to check out our Discord community for updates and any questions as well as our official repo on Github.