eecebe7ef5
Five-lane parallel research pass. Each subdir under tooling/ has its own README indexing downloaded files with verified upstream sources. - google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts, gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev HTML snapshots, Gemma 3 tech report - huggingface/: 8 gemma-4-* model cards, chat-template .jinja files, tokenizer_config.json, transformers gemma4/ source, launch blog posts, official HF Spaces app.py - inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI comparison, run_commands.sh with 8 working launches, 9 code snippets - gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2, Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma) - fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE), TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md Findings that update earlier CORPUS_* docs are flagged in tooling/README.md (not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM, FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech report PDF yet, no Gemma-4-generation specialized siblings yet. Pre-commit secrets hook bypassed per user authorization — flagged "secrets" are base64 notebook cell outputs and example Ed25519 keys in the HDP agentic-security demo, not real credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.4 KiB
6.4 KiB
Gemma.cpp API Server
This is an HTTP API server for gemma.cpp that implements the Google API protocol, allowing you to interact with Gemma models through REST API endpoints compatible with the Google API format.
Features
- API-compatible: Implements Google API endpoints
- Unified client/server: Single codebase supports both local and public API modes
- Text generation: Support for
generateContentendpoint - Streaming support: Server-Sent Events (SSE) for
streamGenerateContent - Model management: Support for
/v1beta/modelsendpoint - Session management: Maintains conversation context with KV cache
- JSON responses: All responses in Google API format
- Error handling: Proper HTTP status codes and error messages
Building
The API server is built alongside the main gemma.cpp project:
# Configure the build
cmake -B build -DCMAKE_BUILD_TYPE=Release
# Build the API server and client
cmake --build build --target gemma_api_server gemma_api_client -j 8
The binaries will be created at:
build/gemma_api_server- Local API serverbuild/gemma_api_client- Unified client for both local and public APIs
Usage
Starting the Local API Server
./build/gemma_api_server \
--tokenizer path/to/tokenizer.spm \
--weights path/to/model.sbs \
--port 8080
Required arguments:
--tokenizer: Path to the tokenizer file (.spm)--weights: Path to the model weights file (.sbs)
Optional arguments:
--port: Port to listen on (default: 8080)--model: Model name for API endpoints (default: gemma3-4b)
Using the Unified Client
With Local Server
# Interactive chat with local server
./build/gemma_api_client --interactive 1 --host localhost --port 8080
# Single prompt with local server
./build/gemma_api_client --prompt "Hello, how are you?"
With Public Google API
# Set API key and use public API
export GOOGLE_API_KEY="your-api-key-here"
./build/gemma_api_client --interactive 1
# Or pass API key directly
./build/gemma_api_client --api_key "your-api-key" --interactive 1
API Endpoints
The server implements Google API endpoints:
1. Generate Content - POST /v1beta/models/gemma3-4b:generateContent
Generate a response for given content (non-streaming).
Request:
{
"contents": [
{
"parts": [
{"text": "Why is the sky blue?"}
]
}
],
"generationConfig": {
"temperature": 0.9,
"topK": 1,
"maxOutputTokens": 1024
}
}
Response:
{
"candidates": [
{
"content": {
"parts": [
{"text": "The sky appears blue because..."}
],
"role": "model"
},
"finishReason": "STOP",
"index": 0
}
],
"promptFeedback": {
"safetyRatings": []
},
"usageMetadata": {
"promptTokenCount": 5,
"candidatesTokenCount": 25,
"totalTokenCount": 30
}
}
2. Stream Generate Content - POST /v1beta/models/gemma3-4b:streamGenerateContent
Generate a response with Server-Sent Events (SSE) streaming.
Request: Same as above
Response: Stream of SSE events:
data: {"candidates":[{"content":{"parts":[{"text":"The"}],"role":"model"},"index":0}],"promptFeedback":{"safetyRatings":[]}}
data: {"candidates":[{"content":{"parts":[{"text":" sky"}],"role":"model"},"index":0}],"promptFeedback":{"safetyRatings":[]}}
data: [DONE]
3. List Models - GET /v1beta/models
List available models.
Response:
{
"models": [
{
"name": "models/gemma3-4b",
"displayName": "Gemma3 4B",
"description": "Gemma3 4B model running locally"
}
]
}
Example Usage
Using curl with Local Server
# Generate content (non-streaming)
curl -X POST http://localhost:8080/v1beta/models/gemma3-4b:generateContent \
-H "Content-Type: application/json" \
-d '{
"contents": [{"parts": [{"text": "Hello, how are you?"}]}],
"generationConfig": {"temperature": 0.9, "topK": 1, "maxOutputTokens": 1024}
}'
# Stream generate content (SSE)
curl -X POST http://localhost:8080/v1beta/models/gemma3-4b:streamGenerateContent \
-H "Content-Type: application/json" \
-d '{
"contents": [{"parts": [{"text": "Tell me a story"}]}],
"generationConfig": {"temperature": 0.9, "topK": 1, "maxOutputTokens": 1024}
}'
# List models
curl http://localhost:8080/v1beta/models
Multi-turn Conversation with curl
# First message
curl -X POST http://localhost:8080/v1beta/models/gemma3-4b:generateContent \
-H "Content-Type: application/json" \
-d '{
"contents": [
{"parts": [{"text": "Hi, my name is Alice"}]}
]
}'
# Follow-up message with conversation history
curl -X POST http://localhost:8080/v1beta/models/gemma3-4b:generateContent \
-H "Content-Type: application/json" \
-d '{
"contents": [
{"parts": [{"text": "Hi, my name is Alice"}]},
{"parts": [{"text": "Hello Alice! Nice to meet you."}]},
{"parts": [{"text": "What is my name?"}]}
]
}'
Using Python
import requests
# Generate content
response = requests.post('http://localhost:8080/v1beta/models/gemma3-4b:generateContent',
json={
'contents': [{'parts': [{'text': 'Explain quantum computing in simple terms'}]}],
'generationConfig': {
'temperature': 0.9,
'topK': 1,
'maxOutputTokens': 1024
}
}
)
result = response.json()
if 'candidates' in result and result['candidates']:
text = result['candidates'][0]['content']['parts'][0]['text']
print(text)
Configuration Options
The Google API supports various generation configuration options:
- temperature: Controls randomness (0.0 to 2.0, default: 1.0)
- topK: Top-K sampling parameter (default: 1)
- maxOutputTokens: Maximum number of tokens to generate (default: 8192)
Key Features
- Unified Implementation: Same codebase handles both local server and public API
- Session Management: Maintains conversation context using KV cache
- Streaming Support: Real-time token generation via Server-Sent Events
- Error Handling: Comprehensive error responses and HTTP status codes
- Memory Efficient: Optimized token processing and caching
Compatibility
This implementation is compatible with:
- Google API format and endpoints
- Standard HTTP clients (curl, browsers, Python requests, etc.)
- Server-Sent Events (SSE) for streaming responses
- JSON request/response format