This plugin provides C# bindings and a high-level Unity integration for the Gemma.cpp library, allowing you to run Google's Gemma models directly within your Unity projects.
This is not an officially supported Google product.
This plugin depends upon the following Unity Package Manager (UPM) package:
- UniTask >= 2.5.10 (link)
 This is used byGemmaManagerto handle asynchronous operations and ensure callbacks (likeonTokenReceived) can safely interact with the Unity API from background threads.
Note that if this package is added to a Unity project using UPM, Unity should automatically fetch UniTask based on the dependency information.
- Add Package: Add this repository's URL to the Unity Package Manager in your Unity project.
- Create Settings: Create a GemmaManagerSettingsScriptableObject asset (Assets -> Create -> GemmaCpp -> Gemma Manager Settings). Configure it with the paths to your downloaded Gemma model weights, tokenizer file, and desired model parameters (Model Type, Weight Format, Temperature).
- Add Component: Add the GemmaManagercomponent to a GameObject in your scene.
- Assign Settings: Drag your created GemmaManagerSettingsasset onto theSettingsfield of theGemmaManagercomponent in the Inspector.
- Use in Script: Get a reference to the GemmaManagercomponent in your scripts and call its methods (see API Reference below).
A Gemma instance is created by passing in the information that would ordinarily be passed to Gemma.cpp on the command line.
Gemma gemma = new Gemma(
    "tokenizer.spm",    // tokenizer path
    "gemma3-4b",        // model flag
    "4b-it-sfp.sbs",    // weights file path
    "sfp",              // weights format, usually "sfp"
    384                 // maximum number of tokens generated per turn
);
gemma.SetMultiturn(true);
gemma.SetTemperature(1.0f);
gemma.EnableLogging(true);Note: This pattern relies on multiturn conversation history. Ensure that gemma.SetMultiturn(true) has been called.
Managing conversations with multiple Non-Player Characters (NPCs) can be achieved by using Gemma.cpp's conversation context feature.
In brief, Gemma.cpp keeps track of a number of ConversationData structs that each contain a KV cache and an absolute position (i.e. the point in the cache up to which tokens have been generated).
// gemma.cpp - gemma/bindings/context.h
struct ConversationData {
  std::unique_ptr<KVCache> kv_cache;
  size_t abs_pos = 0;
  ConversationData(const ModelConfig& model_config, size_t prefill_tbatch_size);
};In this way, each ConversationData represents a conversation that the user is having with Gemma, with strict boundaries - what happens in conversation A, does not affect conversation B. By using the various conversation-related methods in Gemma.cpp, it is possible to switch between conversations appropriately, allowing one Gemma model instance to power multiple NPCs.
Each conversation is stored in a key-value store, where the key is a string, and the value is a ConversationData struct. The base conversation is "default".
The user wants to talk to an NPC, "Elara Smith", with the unique string identifier npc_elara.
// create conversation if it doesn't yet exist
bool conversation_existed = gemma.HasConversation("npc_elara");
if (!conversation_existed) {
    gemma.CreateConversation("npc_elara");
}
// switch to the conversation
gemma.SwitchConversation("npc_elara");
// make sure that multiturn is on
gemma.SetMultiturn(true);
// prewarm the conversation by sending the NPC's biography as a turn
if (!conversation_existed) {
    var elara_bio = "Your name is Elara Smith, a wise old elf living in the Whispering Woods. You are knowledgeable about ancient runes but wary of strangers.";
    gemma.Generate(elara_bio, 256); // we won't display the inference result, so keep maxLength short
}
// ready to talk to the NPC!As Gemma is set up to operate in multiturn mode, when talking to the model the user should only send the current user input as the prompt (e.g., "Can you tell me about the Rune of Binding?"). It is not necessary to prepend the conversation history; the model automatically uses the history stored within the active conversation context's KV cache.
It may be useful to perform prewarming well in advance of interacting with an NPC. We provide the Prewarm() method that takes a Dictionary<string, string> object as a parameter for this purpose.
GemmaManager is a convenience MonoBehaviour class that provides a high-level interface for interacting with Gemma within Unity.
After the model has been loaded in Start(), use GeneateResponseAsync() to perform inference. This method takes a callback as one of its parameters, which uses UniTask for thread-safe use of Unity engine APIs (e.g. for updating text boxes and the like).
The GemmaManager MonoBehaviour provides a high-level interface for interacting with the Gemma model within Unity.
Configuration (Inspector)
- Settings: (Required) Assign a GemmaManagerSettingsScriptableObject containing model paths and parameters.
- Verbose Logging: Enable detailed logging from both the C# wrapper and the underlying native library.
Methods
- 
async UniTask Prewarm(Dictionary<string, string> conversations, PrewarmStatusCallback statusCallback = null)- Asynchronously prewarms specified conversation contexts before they are actively used.
- conversations: A dictionary where keys represent the unique names of the conversations to prewarm, and values represent the initial prompt to send to each respective conversation.
- statusCallback: (Optional) A delegate of type- GemmaManager.PrewarmStatusCallback(string conversationName, PrewarmState state)that can be provided to receive status updates during the prewarming process. The callback receives the name of the conversation being processed and its subsequent- PrewarmState(e.g., Pending, InProgress, Done, Skipped, Failed).
- For each entry, it ensures the conversation exists (creates if not), switches to it, and generates an initial response using the provided prompt. This helps reduce latency on the first interaction.
 
- 
async UniTask<string> GenerateResponseAsync(string prompt, Gemma.TokenCallback onTokenReceived = null)- Generates a text response based on the input prompt.
- Runs the generation on a background thread.
- onTokenReceived: Optional callback delegate (- bool TokenCallback(string token)) that receives generated tokens one by one. Callbacks are executed on the main thread via UniTask, allowing safe interaction with Unity APIs (e.g., updating UI elements). Return- truefrom the callback to continue generation,- falseto stop early.
- Returns the complete generated response string.
 
- Generates a text response based on the input 
- 
async UniTask<string> GenerateMultimodalResponseAsync(string prompt, RenderTexture renderTexture, Gemma.TokenCallback onTokenReceived = null)- Generates a text response based on the input promptand an image provided as aRenderTexture.
- The RenderTextureis automatically converted to the required format on the main thread.
- Runs the generation on a background thread.
- onTokenReceived: Optional callback, same behavior as in- GenerateResponseAsync. This is the appropriate place to update UI based on incoming tokens.
- Returns the complete generated response string.
 
- Generates a text response based on the input 
- 
void ResetCurrentConversation()- Resets the history of the current conversation context in the Gemma model.
 
- 
bool CreateConversation(string conversationName)- Creates a new, named conversation context. Returns trueon success.
 
- Creates a new, named conversation context. Returns 
- 
bool SwitchConversation(string conversationName)- Switches the active context to a previously created conversation. Returns trueif the conversation exists and was switched to.
 
- Switches the active context to a previously created conversation. Returns 
- 
bool DeleteConversation(string conversationName)- Deletes a named conversation context. Returns trueon success.
 
- Deletes a named conversation context. Returns 
- 
bool HasConversation(string conversationName)- Checks if a conversation with the given name exists. Returns trueif it exists.
 
- Checks if a conversation with the given name exists. Returns 
- 
string GetCurrentConversation()- Gets the name of the currently active conversation context.
- Returns the name as a string. Returns null or empty if no conversation is active or an error occurs.
 
This ScriptableObject holds the configuration for the Gemma model used by GemmaManager.
- Tokenizer Path: Filesystem path to the Gemma tokenizer file (tokenizer.spm).
- Weights Path: Filesystem path to the Gemma model weights file (e.g., .sbsfile).
- Model Flag: The model type string (e.g., "gemma3-4b").
- Weight Format: The format of the weights file (e.g., sfp).
- Temperature: Sampling temperature for generation (e.g., 0.9).
This class provides direct C# bindings to the native gemma.dll functions. It is used internally by GemmaManager but can be used directly for more advanced scenarios.
Constructor
- Gemma(string tokenizerPath, string modelType, string weightsPath, string weightType, int maxLength = 8192)- Creates and initializes a native Gemma context. Throws GemmaExceptionon failure.
 
- Creates and initializes a native Gemma context. Throws 
Methods
- string Generate(string prompt, int maxLength = 4096)- Generates text based on the input prompt, to a maximum length ofmaxLength.
 
- Generates text based on the input 
- string Generate(string prompt, TokenCallback callback, int maxLength = 4096)- Generates text based on the input prompt, to a maximum length ofmaxLength. TheTokenCallback(if provided) is executed directly on the generation thread. Returntrueto continue,falseto stop.
 
- Generates text based on the input 
- string GenerateMultimodal(string prompt, float[] imageData, int imageWidth, int imageHeight, int maxLength = 4096)- Generates text based on a text input prompt, with an image input as well, to a maximum length ofmaxLength.imageDatamust be a flat array of RGB float values (0.0-1.0).
 
- Generates text based on a text input 
- string GenerateMultimodal(string prompt, float[] imageData, int imageWidth, int imageHeight, TokenCallback callback, int maxLength = 4096)- Generates text with image input. The TokenCallback(if provided) is executed directly on the generation thread. Returntrueto continue,falseto stop.
 
- Generates text with image input. The 
- int CountTokens(string text)- Counts the number of tokens in the given text according to the loaded tokenizer.
 
- void SetMultiturn(bool enable)- Enables or disables multiturn conversation history.
 
- void SetTemperature(float temperature)- Sets the sampling temperature.
 
- void SetTopK(int topK)- Sets the top-K sampling value.
 
- void SetDeterministic(bool deterministic)- Enables or disables deterministic sampling (uses seed 0).
 
- void ResetConversation()- Resets the history of the currently active conversation context.
 
- bool CreateConversation(string conversationName)- Creates a named conversation context.
 
- bool SwitchConversation(string conversationName)- Switches the active context.
 
- bool DeleteConversation(string conversationName)- Deletes a named conversation context.
 
- bool HasConversation(string conversationName)- Checks if a named conversation exists.
 
- string GetCurrentConversation()- Gets the history of the currently active conversation context as a string. Returns nullon error.
 
- Gets the history of the currently active conversation context as a string. Returns 
- void EnableLogging(bool enable = true)- Enables or disables log messages from the native library via a callback to Debug.WriteLine.
 
- Enables or disables log messages from the native library via a callback to 
- void Dispose()- Releases the native Gemma context and associated resources. Must be called when finished. Implements IDisposable.
 
- Releases the native Gemma context and associated resources. Must be called when finished. Implements 
Exceptions
- GemmaException: Thrown for errors during Gemma operations (e.g., initialization failure, generation failure).
- DllNotFoundException: Thrown if the- gemma.dll(or platform equivalent) cannot be loaded.
- ObjectDisposedException: Thrown if methods are called after- Dispose()has been called.