docs: add canonical tooling corpus (147 files) from Google/HF/frameworks
Five-lane parallel research pass. Each subdir under tooling/ has its own README indexing downloaded files with verified upstream sources. - google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts, gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev HTML snapshots, Gemma 3 tech report - huggingface/: 8 gemma-4-* model cards, chat-template .jinja files, tokenizer_config.json, transformers gemma4/ source, launch blog posts, official HF Spaces app.py - inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI comparison, run_commands.sh with 8 working launches, 9 code snippets - gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2, Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma) - fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE), TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md Findings that update earlier CORPUS_* docs are flagged in tooling/README.md (not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM, FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech report PDF yet, no Gemma-4-generation specialized siblings yet. Pre-commit secrets hook bypassed per user authorization — flagged "secrets" are base64 notebook cell outputs and example Ed25519 keys in the HDP agentic-security demo, not real credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,512 @@
|
||||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
|
||||
# </div>
|
||||
#
|
||||
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
#
|
||||
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
|
||||
|
||||
# ### News
|
||||
|
||||
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
|
||||
#
|
||||
# <table><tr>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
|
||||
# </tr></table>
|
||||
#
|
||||
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
|
||||
#
|
||||
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
|
||||
#
|
||||
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
|
||||
#
|
||||
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
|
||||
|
||||
# # ### Installation
|
||||
#
|
||||
# # In[1]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
|
||||
#
|
||||
#
|
||||
# # In[2]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
|
||||
#
|
||||
#
|
||||
# # ### Unsloth
|
||||
#
|
||||
# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
|
||||
|
||||
# In[3]:
|
||||
|
||||
|
||||
from unsloth import FastModel
|
||||
import torch
|
||||
|
||||
gemma4_models = [
|
||||
# Gemma-4 instruct models:
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E4B-it",
|
||||
"unsloth/gemma-4-31B-it",
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
# Gemma-4 base models:
|
||||
"unsloth/gemma-4-E2B",
|
||||
"unsloth/gemma-4-E4B",
|
||||
"unsloth/gemma-4-31B",
|
||||
"unsloth/gemma-4-26B-A4B",
|
||||
] # More models at https://huggingface.co/unsloth
|
||||
|
||||
model, tokenizer = FastModel.from_pretrained(
|
||||
model_name = "unsloth/gemma-4-26B-A4B-it",
|
||||
dtype = None, # None for auto detection
|
||||
max_seq_length = 8192, # Choose any for long context!
|
||||
load_in_4bit = True, # 4 bit quantization to reduce memory
|
||||
full_finetuning = False, # [NEW!] We have full finetuning now!
|
||||
# token = "YOUR_HF_TOKEN", # HF Token for gated models
|
||||
)
|
||||
|
||||
|
||||
# # Gemma 4 can process Text, Vision and Audio!
|
||||
#
|
||||
# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
|
||||
|
||||
# In[4]:
|
||||
|
||||
|
||||
from transformers import TextStreamer
|
||||
# Helper function for inference
|
||||
def do_gemma_4_inference(messages, max_new_tokens = 128):
|
||||
_ = model.generate(
|
||||
**tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
return_tensors = "pt",
|
||||
).to("cuda"),
|
||||
max_new_tokens = max_new_tokens,
|
||||
use_cache = True,
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# # Gemma 4 can see images!
|
||||
#
|
||||
# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
|
||||
|
||||
# In[5]:
|
||||
|
||||
|
||||
sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
|
||||
|
||||
messages = [{
|
||||
"role" : "user",
|
||||
"content": [
|
||||
{ "type": "image", "image" : sloth_link },
|
||||
{ "type": "text", "text" : "Which films does this animal feature in?" }
|
||||
]
|
||||
}]
|
||||
# You might have to wait 1 minute for Unsloth's auto compiler
|
||||
do_gemma_4_inference(messages, max_new_tokens = 256)
|
||||
|
||||
|
||||
# Let's make a poem about sloths!
|
||||
|
||||
# In[6]:
|
||||
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{ "type" : "text",
|
||||
"text" : "Write a poem about sloths." }]
|
||||
}]
|
||||
do_gemma_4_inference(messages)
|
||||
|
||||
|
||||
# # Let's finetune Gemma 4!
|
||||
#
|
||||
# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
|
||||
|
||||
# We now add LoRA adapters so we only need to update a small amount of parameters!
|
||||
|
||||
# In[7]:
|
||||
|
||||
|
||||
model = FastModel.get_peft_model(
|
||||
model,
|
||||
finetune_vision_layers = False, # Turn off for just text!
|
||||
finetune_language_layers = True, # Should leave on!
|
||||
finetune_attention_modules = True, # Attention good for GRPO
|
||||
finetune_mlp_modules = True, # Should leave on always!
|
||||
|
||||
r = 8, # Larger = higher accuracy, but might overfit
|
||||
lora_alpha = 8, # Recommended alpha == r at least
|
||||
lora_dropout = 0,
|
||||
bias = "none",
|
||||
random_state = 3407,
|
||||
)
|
||||
|
||||
|
||||
# <a name="Data"></a>
|
||||
# ### Data Prep
|
||||
# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
|
||||
#
|
||||
# ```
|
||||
# <bos><|turn>user
|
||||
# Hello<turn|>
|
||||
# <|turn>model
|
||||
# Hey there!<turn|>
|
||||
# ```
|
||||
# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
|
||||
|
||||
# In[8]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import get_chat_template
|
||||
tokenizer = get_chat_template(
|
||||
tokenizer,
|
||||
chat_template = "gemma-4-thinking",
|
||||
)
|
||||
|
||||
|
||||
# We get the first 3000 rows of the dataset
|
||||
|
||||
# In[9]:
|
||||
|
||||
|
||||
from datasets import load_dataset
|
||||
dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
|
||||
|
||||
|
||||
# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
|
||||
|
||||
# In[10]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import standardize_data_formats
|
||||
dataset = standardize_data_formats(dataset)
|
||||
|
||||
|
||||
# Let's see how row 100 looks like!
|
||||
|
||||
# In[11]:
|
||||
|
||||
|
||||
dataset[100]
|
||||
|
||||
|
||||
# We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
|
||||
|
||||
# In[12]:
|
||||
|
||||
|
||||
def formatting_prompts_func(examples):
|
||||
convos = examples["conversations"]
|
||||
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
|
||||
return { "text" : texts, }
|
||||
|
||||
dataset = dataset.map(formatting_prompts_func, batched = True)
|
||||
|
||||
|
||||
# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
|
||||
|
||||
# In[13]:
|
||||
|
||||
|
||||
dataset[100]["text"]
|
||||
|
||||
|
||||
# <a name="Train"></a>
|
||||
# ### Train the model
|
||||
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
|
||||
|
||||
# In[14]:
|
||||
|
||||
|
||||
from trl import SFTTrainer, SFTConfig
|
||||
trainer = SFTTrainer(
|
||||
model = model,
|
||||
tokenizer = tokenizer,
|
||||
train_dataset = dataset,
|
||||
eval_dataset = None, # Can set up evaluation!
|
||||
args = SFTConfig(
|
||||
dataset_text_field = "text",
|
||||
per_device_train_batch_size = 1,
|
||||
gradient_accumulation_steps = 4, # Use GA to mimic batch size!
|
||||
warmup_steps = 5,
|
||||
# num_train_epochs = 1, # Set this for 1 full training run.
|
||||
max_steps = 60,
|
||||
learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
|
||||
logging_steps = 1,
|
||||
optim = "adamw_8bit",
|
||||
weight_decay = 0.001,
|
||||
lr_scheduler_type = "linear",
|
||||
seed = 3407,
|
||||
report_to = "none", # Use TrackIO/WandB etc
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
|
||||
|
||||
# In[15]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import train_on_responses_only
|
||||
trainer = train_on_responses_only(
|
||||
trainer,
|
||||
instruction_part = "<|turn>user\n",
|
||||
response_part = "<|turn>model\n",
|
||||
)
|
||||
|
||||
|
||||
# Let's verify masking the instruction part is done! Let's print the 100th row again. Notice how the sample only has a single `<bos>` as expected!
|
||||
|
||||
# In[16]:
|
||||
|
||||
|
||||
tokenizer.decode(trainer.train_dataset[100]["input_ids"])
|
||||
|
||||
|
||||
# Now let's print the masked out example - you should see only the answer is present:
|
||||
|
||||
# In[17]:
|
||||
|
||||
|
||||
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
|
||||
|
||||
|
||||
# In[18]:
|
||||
|
||||
|
||||
# @title Show current memory stats
|
||||
gpu_stats = torch.cuda.get_device_properties(0)
|
||||
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
|
||||
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
|
||||
print(f"{start_gpu_memory} GB of memory reserved.")
|
||||
|
||||
|
||||
# # Let's train the model!
|
||||
#
|
||||
# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
|
||||
|
||||
# In[19]:
|
||||
|
||||
|
||||
trainer_stats = trainer.train()
|
||||
|
||||
|
||||
# In[20]:
|
||||
|
||||
|
||||
# @title Show final memory and time stats
|
||||
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
|
||||
used_percentage = round(used_memory / max_memory * 100, 3)
|
||||
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
|
||||
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
|
||||
print(
|
||||
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
|
||||
)
|
||||
print(f"Peak reserved memory = {used_memory} GB.")
|
||||
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
|
||||
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
|
||||
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
|
||||
|
||||
|
||||
# <a name="Inference"></a>
|
||||
# ### Inference
|
||||
# Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
|
||||
|
||||
# In[21]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import get_chat_template
|
||||
tokenizer = get_chat_template(
|
||||
tokenizer,
|
||||
chat_template = "gemma-4-thinking",
|
||||
)
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{
|
||||
"type" : "text",
|
||||
"text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
|
||||
}]
|
||||
}]
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
outputs = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 64, # Increase for longer outputs!
|
||||
use_cache = True,
|
||||
# Recommended Gemma-3 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
)
|
||||
tokenizer.batch_decode(outputs)
|
||||
|
||||
|
||||
# You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
|
||||
|
||||
# In[22]:
|
||||
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{"type" : "text", "text" : "Why is the sky blue?",}]
|
||||
}]
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
_ = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 64, # Increase for longer outputs!
|
||||
use_cache = True,
|
||||
# Recommended Gemma-3 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# <a name="Save"></a>
|
||||
# ### Saving, loading finetuned models
|
||||
# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
|
||||
#
|
||||
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
|
||||
|
||||
# In[23]:
|
||||
|
||||
|
||||
model.save_pretrained("gemma_4_lora") # Local saving
|
||||
tokenizer.save_pretrained("gemma_4_lora")
|
||||
# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
|
||||
|
||||
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
|
||||
|
||||
# In[24]:
|
||||
|
||||
|
||||
if False:
|
||||
from unsloth import FastModel
|
||||
model, tokenizer = FastModel.from_pretrained(
|
||||
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
|
||||
max_seq_length = 2048,
|
||||
load_in_4bit = True,
|
||||
)
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{"type" : "text", "text" : "What is Gemma-4?",}]
|
||||
}]
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
_ = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 128, # Increase for longer outputs!
|
||||
# Recommended Gemma-3 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# ### Saving to float16 for VLLM
|
||||
#
|
||||
# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
|
||||
|
||||
# In[25]:
|
||||
|
||||
|
||||
if False: # Change to True to save finetune!
|
||||
model.save_pretrained_merged("gemma-4-finetune", tokenizer)
|
||||
|
||||
|
||||
# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
|
||||
|
||||
# In[26]:
|
||||
|
||||
|
||||
if False: # Change to True to upload finetune
|
||||
model.push_to_hub_merged(
|
||||
"HF_ACCOUNT/gemma-4-finetune", tokenizer,
|
||||
token = "YOUR_HF_TOKEN"
|
||||
)
|
||||
|
||||
|
||||
# ### GGUF / llama.cpp Conversion
|
||||
# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
|
||||
|
||||
# In[27]:
|
||||
|
||||
|
||||
if False: # Change to True to save to GGUF
|
||||
model.save_pretrained_gguf(
|
||||
"gemma_4_finetune",
|
||||
tokenizer,
|
||||
quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
|
||||
)
|
||||
|
||||
|
||||
# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
|
||||
|
||||
# In[28]:
|
||||
|
||||
|
||||
if False: # Change to True to upload GGUF
|
||||
model.push_to_hub_gguf(
|
||||
"HF_ACCOUNT/gemma_4_finetune",
|
||||
tokenizer,
|
||||
quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
|
||||
token = "YOUR_HF_TOKEN",
|
||||
)
|
||||
|
||||
|
||||
# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
|
||||
#
|
||||
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
|
||||
#
|
||||
# Some other resources:
|
||||
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
|
||||
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
|
||||
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
|
||||
#
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
|
||||
#
|
||||
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
|
||||
# </div>
|
||||
#
|
||||
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
@@ -0,0 +1,448 @@
|
||||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
|
||||
# </div>
|
||||
#
|
||||
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
#
|
||||
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
|
||||
|
||||
# ### News
|
||||
|
||||
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
|
||||
#
|
||||
# <table><tr>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
|
||||
# </tr></table>
|
||||
#
|
||||
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
|
||||
#
|
||||
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
|
||||
#
|
||||
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
|
||||
#
|
||||
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
|
||||
|
||||
# # ### Installation
|
||||
#
|
||||
# # In[1]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
|
||||
#
|
||||
#
|
||||
# # In[2]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
|
||||
#
|
||||
#
|
||||
# # ### Unsloth
|
||||
|
||||
# In[3]:
|
||||
|
||||
|
||||
from unsloth import FastVisionModel # FastLanguageModel for LLMs
|
||||
import torch
|
||||
|
||||
gemma4_models = [
|
||||
# Gemma-4 instruct models:
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E4B-it",
|
||||
"unsloth/gemma-4-31B-it",
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
# Gemma-4 base models:
|
||||
"unsloth/gemma-4-E2B",
|
||||
"unsloth/gemma-4-E4B",
|
||||
"unsloth/gemma-4-31B",
|
||||
"unsloth/gemma-4-26B-A4B",
|
||||
] # More models at https://huggingface.co/unsloth
|
||||
|
||||
model, processor = FastVisionModel.from_pretrained(
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
|
||||
use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
|
||||
)
|
||||
|
||||
|
||||
# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
|
||||
#
|
||||
# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
|
||||
|
||||
# In[4]:
|
||||
|
||||
|
||||
model = FastVisionModel.get_peft_model(
|
||||
model,
|
||||
finetune_vision_layers = True, # False if not finetuning vision layers
|
||||
finetune_language_layers = True, # False if not finetuning language layers
|
||||
finetune_attention_modules = True, # False if not finetuning attention layers
|
||||
finetune_mlp_modules = True, # False if not finetuning MLP layers
|
||||
|
||||
r = 32, # The larger, the higher the accuracy, but might overfit
|
||||
lora_alpha = 32, # Recommended alpha == r at least
|
||||
lora_dropout = 0,
|
||||
bias = "none",
|
||||
random_state = 3407,
|
||||
use_rslora = False, # We support rank stabilized LoRA
|
||||
loftq_config = None, # And LoftQ
|
||||
target_modules = "all-linear", # Optional now! Can specify a list if needed
|
||||
)
|
||||
|
||||
|
||||
# <a name="Data"></a>
|
||||
# ### Data Prep
|
||||
# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
|
||||
#
|
||||
# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
|
||||
|
||||
# In[5]:
|
||||
|
||||
|
||||
from datasets import load_dataset
|
||||
dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
|
||||
|
||||
|
||||
# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
|
||||
|
||||
# In[6]:
|
||||
|
||||
|
||||
dataset
|
||||
|
||||
|
||||
# In[7]:
|
||||
|
||||
|
||||
dataset[2]["image"]
|
||||
|
||||
|
||||
# In[8]:
|
||||
|
||||
|
||||
dataset[2]["text"]
|
||||
|
||||
|
||||
# We can also render LaTeX directly in the browser!
|
||||
|
||||
# In[9]:
|
||||
|
||||
|
||||
from IPython.display import display, Math, Latex
|
||||
|
||||
latex = dataset[3]["text"]
|
||||
display(Math(latex))
|
||||
|
||||
|
||||
# To format the dataset, all vision fine-tuning tasks should follow this format:
|
||||
#
|
||||
# ```python
|
||||
# [
|
||||
# {
|
||||
# "role": "user",
|
||||
# "content": [
|
||||
# {"type": "text", "text": instruction},
|
||||
# {"type": "image", "image": sample["image"]},
|
||||
# ],
|
||||
# },
|
||||
# {
|
||||
# "role": "user",
|
||||
# "content": [
|
||||
# {"type": "text", "text": instruction},
|
||||
# {"type": "image", "image": sample["image"]},
|
||||
# ],
|
||||
# },
|
||||
# ]
|
||||
# ```
|
||||
|
||||
# In[10]:
|
||||
|
||||
|
||||
instruction = "Write the LaTeX representation for this image."
|
||||
|
||||
def convert_to_conversation(sample):
|
||||
conversation = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": instruction},
|
||||
{"type": "image", "image": sample["image"]},
|
||||
],
|
||||
},
|
||||
{"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
|
||||
]
|
||||
return {"messages": conversation}
|
||||
pass
|
||||
|
||||
|
||||
# Let's convert the dataset into the "correct" format for finetuning:
|
||||
|
||||
# In[11]:
|
||||
|
||||
|
||||
converted_dataset = [convert_to_conversation(sample) for sample in dataset]
|
||||
|
||||
|
||||
# The first example is now structured like below:
|
||||
|
||||
# In[12]:
|
||||
|
||||
|
||||
converted_dataset[0]
|
||||
|
||||
|
||||
# Lets take the Gemma 4 instruction chat template and use it in our base model
|
||||
|
||||
# In[13]:
|
||||
|
||||
|
||||
from unsloth import get_chat_template
|
||||
|
||||
processor = get_chat_template(
|
||||
processor,
|
||||
"gemma-4-thinking"
|
||||
)
|
||||
|
||||
|
||||
# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
|
||||
|
||||
# In[14]:
|
||||
|
||||
|
||||
image = dataset[2]["image"]
|
||||
instruction = "Write the LaTeX representation for this image."
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
|
||||
}
|
||||
]
|
||||
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
|
||||
inputs = processor(
|
||||
image,
|
||||
input_text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
text_streamer = TextStreamer(processor, skip_prompt = True)
|
||||
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# You can see it's absolutely terrible! It doesn't follow instructions at all
|
||||
|
||||
# <a name="Train"></a>
|
||||
# ### Train the model
|
||||
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
|
||||
#
|
||||
# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
|
||||
|
||||
# In[15]:
|
||||
|
||||
|
||||
from unsloth.trainer import UnslothVisionDataCollator
|
||||
from trl import SFTTrainer, SFTConfig
|
||||
|
||||
trainer = SFTTrainer(
|
||||
model = model,
|
||||
train_dataset = converted_dataset,
|
||||
processing_class = processor.tokenizer,
|
||||
data_collator = UnslothVisionDataCollator(model, processor),
|
||||
args = SFTConfig(
|
||||
per_device_train_batch_size = 1,
|
||||
gradient_accumulation_steps = 4,
|
||||
max_grad_norm = 0.3,
|
||||
warmup_ratio = 0.03,
|
||||
max_steps = 60,
|
||||
# num_train_epochs = 2, # Set this instead of max_steps for full training runs
|
||||
learning_rate = 2e-4,
|
||||
logging_steps = 1,
|
||||
save_strategy = "steps",
|
||||
optim = "adamw_8bit",
|
||||
weight_decay = 0.001,
|
||||
lr_scheduler_type = "cosine",
|
||||
seed = 3407,
|
||||
output_dir = "outputs",
|
||||
report_to = "none", # For Weights and Biases or others
|
||||
|
||||
# You MUST put the below items for vision finetuning:
|
||||
remove_unused_columns = False,
|
||||
dataset_text_field = "",
|
||||
dataset_kwargs = {"skip_prepare_dataset": True},
|
||||
max_length = 2048,
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
# In[16]:
|
||||
|
||||
|
||||
# @title Show current memory stats
|
||||
gpu_stats = torch.cuda.get_device_properties(0)
|
||||
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
|
||||
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
|
||||
print(f"{start_gpu_memory} GB of memory reserved.")
|
||||
|
||||
|
||||
# In[17]:
|
||||
|
||||
|
||||
trainer_stats = trainer.train()
|
||||
|
||||
|
||||
# In[18]:
|
||||
|
||||
|
||||
# @title Show final memory and time stats
|
||||
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
|
||||
used_percentage = round(used_memory / max_memory * 100, 3)
|
||||
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
|
||||
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
|
||||
print(
|
||||
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
|
||||
)
|
||||
print(f"Peak reserved memory = {used_memory} GB.")
|
||||
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
|
||||
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
|
||||
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
|
||||
|
||||
|
||||
# <a name="Inference"></a>
|
||||
# ### Inference
|
||||
# Let's run the model! You can modify the instruction and input—just leave the output blank.
|
||||
#
|
||||
# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
|
||||
|
||||
# In[19]:
|
||||
|
||||
|
||||
image = dataset[10]["image"]
|
||||
instruction = "Write the LaTeX representation for this image."
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
|
||||
}
|
||||
]
|
||||
|
||||
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
|
||||
|
||||
inputs = processor(
|
||||
image,
|
||||
input_text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
text_streamer = TextStreamer(processor, skip_prompt = True)
|
||||
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# <a name="Save"></a>
|
||||
# ### Saving, loading finetuned models
|
||||
# To save the final model as LoRA adapters, use Hugging Face’s `push_to_hub` for online saving, or `save_pretrained` for local storage.
|
||||
#
|
||||
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
|
||||
|
||||
# In[20]:
|
||||
|
||||
|
||||
model.save_pretrained("gemma_4_lora") # Local saving
|
||||
processor.save_pretrained("gemma_4_lora")
|
||||
# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
|
||||
|
||||
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
|
||||
|
||||
# In[21]:
|
||||
|
||||
|
||||
if False:
|
||||
from unsloth import FastVisionModel
|
||||
|
||||
model, processor = FastVisionModel.from_pretrained(
|
||||
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
|
||||
load_in_4bit = True, # Set to False for 16bit LoRA
|
||||
)
|
||||
|
||||
sample = dataset[1]
|
||||
image = sample["image"].convert("RGB")
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": sample["text"],
|
||||
},
|
||||
{
|
||||
"type": "image",
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
|
||||
inputs = processor(
|
||||
image,
|
||||
input_text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
|
||||
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# ### Saving to float16 for VLLM
|
||||
#
|
||||
# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
|
||||
|
||||
# In[22]:
|
||||
|
||||
|
||||
# Select ONLY 1 to save! (Both not needed!)
|
||||
|
||||
# Save locally to 16bit
|
||||
if False: model.save_pretrained_merged("unsloth_finetune", processor,)
|
||||
|
||||
# To export and save to your Hugging Face account
|
||||
if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
|
||||
|
||||
|
||||
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
|
||||
#
|
||||
# Some other resources:
|
||||
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
|
||||
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
|
||||
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
|
||||
#
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
|
||||
#
|
||||
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
|
||||
# </div>
|
||||
#
|
||||
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
@@ -0,0 +1,513 @@
|
||||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
|
||||
# </div>
|
||||
#
|
||||
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
#
|
||||
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
|
||||
|
||||
# ### News
|
||||
|
||||
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
|
||||
#
|
||||
# <table><tr>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
|
||||
# </tr></table>
|
||||
#
|
||||
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
|
||||
#
|
||||
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
|
||||
#
|
||||
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
|
||||
#
|
||||
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
|
||||
|
||||
# # ### Installation
|
||||
#
|
||||
# # In[1]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
|
||||
#
|
||||
#
|
||||
# # In[2]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
|
||||
#
|
||||
#
|
||||
# # ### Unsloth
|
||||
#
|
||||
# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
|
||||
|
||||
# In[3]:
|
||||
|
||||
|
||||
from unsloth import FastModel
|
||||
import torch
|
||||
|
||||
gemma4_models = [
|
||||
# Gemma-4 instruct models:
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E4B-it",
|
||||
"unsloth/gemma-4-31B-it",
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
# Gemma-4 base models:
|
||||
"unsloth/gemma-4-E2B",
|
||||
"unsloth/gemma-4-E4B",
|
||||
"unsloth/gemma-4-31B",
|
||||
"unsloth/gemma-4-26B-A4B",
|
||||
] # More models at https://huggingface.co/unsloth
|
||||
|
||||
model, tokenizer = FastModel.from_pretrained(
|
||||
model_name = "unsloth/gemma-4-31B-it",
|
||||
dtype = None, # None for auto detection
|
||||
max_seq_length = 8192, # Choose any for long context!
|
||||
load_in_4bit = True, # 4 bit quantization to reduce memory
|
||||
full_finetuning = False, # [NEW!] We have full finetuning now!
|
||||
# token = "YOUR_HF_TOKEN", # HF Token for gated models
|
||||
)
|
||||
|
||||
|
||||
# # Gemma 4 can process Text, Vision and Audio!
|
||||
#
|
||||
# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
|
||||
|
||||
# In[4]:
|
||||
|
||||
|
||||
from transformers import TextStreamer
|
||||
# Helper function for inference
|
||||
def do_gemma_4_inference(messages, max_new_tokens = 128):
|
||||
_ = model.generate(
|
||||
**tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
return_tensors = "pt",
|
||||
).to("cuda"),
|
||||
max_new_tokens = max_new_tokens,
|
||||
use_cache = True,
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# # Gemma 4 can see images!
|
||||
#
|
||||
# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
|
||||
|
||||
# In[5]:
|
||||
|
||||
|
||||
sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
|
||||
|
||||
messages = [{
|
||||
"role" : "user",
|
||||
"content": [
|
||||
{ "type": "image", "image" : sloth_link },
|
||||
{ "type": "text", "text" : "Which films does this animal feature in?" }
|
||||
]
|
||||
}]
|
||||
# You might have to wait 1 minute for Unsloth's auto compiler
|
||||
do_gemma_4_inference(messages, max_new_tokens = 256)
|
||||
|
||||
|
||||
# Let's make a poem about sloths!
|
||||
|
||||
# In[6]:
|
||||
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{ "type" : "text",
|
||||
"text" : "Write a poem about sloths." }]
|
||||
}]
|
||||
do_gemma_4_inference(messages)
|
||||
|
||||
|
||||
# # Let's finetune Gemma 4!
|
||||
#
|
||||
# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
|
||||
|
||||
# We now add LoRA adapters so we only need to update a small amount of parameters!
|
||||
|
||||
# In[7]:
|
||||
|
||||
|
||||
model = FastModel.get_peft_model(
|
||||
model,
|
||||
finetune_vision_layers = False, # Turn off for just text!
|
||||
finetune_language_layers = True, # Should leave on!
|
||||
finetune_attention_modules = True, # Attention good for GRPO
|
||||
finetune_mlp_modules = True, # Should leave on always!
|
||||
|
||||
r = 8, # Larger = higher accuracy, but might overfit
|
||||
lora_alpha = 8, # Recommended alpha == r at least
|
||||
lora_dropout = 0,
|
||||
bias = "none",
|
||||
random_state = 3407,
|
||||
)
|
||||
|
||||
|
||||
# <a name="Data"></a>
|
||||
# ### Data Prep
|
||||
# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
|
||||
#
|
||||
# ```
|
||||
# <bos><|turn>user
|
||||
# Hello<turn|>
|
||||
# <|turn>model
|
||||
# Hey there!<turn|>
|
||||
# ```
|
||||
# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
|
||||
|
||||
# In[8]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import get_chat_template
|
||||
tokenizer = get_chat_template(
|
||||
tokenizer,
|
||||
chat_template = "gemma-4-thinking",
|
||||
)
|
||||
|
||||
|
||||
# We get the first 3000 rows of the dataset
|
||||
|
||||
# In[9]:
|
||||
|
||||
|
||||
from datasets import load_dataset
|
||||
dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
|
||||
|
||||
|
||||
# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
|
||||
|
||||
# In[10]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import standardize_data_formats
|
||||
dataset = standardize_data_formats(dataset)
|
||||
|
||||
|
||||
# Let's see how row 100 looks like!
|
||||
|
||||
# In[11]:
|
||||
|
||||
|
||||
dataset[100]
|
||||
|
||||
|
||||
# We now have to apply the chat template for `Gemma-4` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
|
||||
|
||||
# In[12]:
|
||||
|
||||
|
||||
def formatting_prompts_func(examples):
|
||||
convos = examples["conversations"]
|
||||
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
|
||||
return { "text" : texts, }
|
||||
|
||||
dataset = dataset.map(formatting_prompts_func, batched = True)
|
||||
|
||||
|
||||
# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
|
||||
|
||||
# In[13]:
|
||||
|
||||
|
||||
dataset[100]["text"]
|
||||
|
||||
|
||||
# <a name="Train"></a>
|
||||
# ### Train the model
|
||||
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
|
||||
|
||||
# In[14]:
|
||||
|
||||
|
||||
from trl import SFTTrainer, SFTConfig
|
||||
trainer = SFTTrainer(
|
||||
model = model,
|
||||
tokenizer = tokenizer,
|
||||
train_dataset = dataset,
|
||||
eval_dataset = None, # Can set up evaluation!
|
||||
args = SFTConfig(
|
||||
dataset_text_field = "text",
|
||||
per_device_train_batch_size = 1,
|
||||
gradient_accumulation_steps = 4, # Use GA to mimic batch size!
|
||||
warmup_steps = 5,
|
||||
# num_train_epochs = 1, # Set this for 1 full training run.
|
||||
max_steps = 60,
|
||||
learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
|
||||
logging_steps = 1,
|
||||
optim = "adamw_8bit",
|
||||
weight_decay = 0.001,
|
||||
lr_scheduler_type = "linear",
|
||||
seed = 3407,
|
||||
report_to = "none", # Use TrackIO/WandB etc
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
|
||||
|
||||
# In[15]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import train_on_responses_only
|
||||
trainer = train_on_responses_only(
|
||||
trainer,
|
||||
instruction_part = "<|turn>user\n",
|
||||
response_part = "<|turn>model\n",
|
||||
)
|
||||
|
||||
|
||||
# Let's verify masking the instruction part is done! Let's print the 100th row again. Notice how the sample only has a single `<bos>` as expected!
|
||||
|
||||
# In[16]:
|
||||
|
||||
|
||||
tokenizer.decode(trainer.train_dataset[100]["input_ids"])
|
||||
|
||||
|
||||
# Now let's print the masked out example - you should see only the answer is present:
|
||||
|
||||
# In[17]:
|
||||
|
||||
|
||||
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
|
||||
|
||||
|
||||
# In[18]:
|
||||
|
||||
|
||||
# @title Show current memory stats
|
||||
gpu_stats = torch.cuda.get_device_properties(0)
|
||||
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
|
||||
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
|
||||
print(f"{start_gpu_memory} GB of memory reserved.")
|
||||
|
||||
|
||||
# # Let's train the model!
|
||||
#
|
||||
# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
|
||||
|
||||
# In[19]:
|
||||
|
||||
|
||||
trainer_stats = trainer.train()
|
||||
|
||||
|
||||
# In[20]:
|
||||
|
||||
|
||||
# @title Show final memory and time stats
|
||||
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
|
||||
used_percentage = round(used_memory / max_memory * 100, 3)
|
||||
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
|
||||
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
|
||||
print(
|
||||
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
|
||||
)
|
||||
print(f"Peak reserved memory = {used_memory} GB.")
|
||||
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
|
||||
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
|
||||
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
|
||||
|
||||
|
||||
# <a name="Inference"></a>
|
||||
# ### Inference
|
||||
# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
|
||||
|
||||
# In[21]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import get_chat_template
|
||||
tokenizer = get_chat_template(
|
||||
tokenizer,
|
||||
chat_template = "gemma-4-thinking",
|
||||
)
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{
|
||||
"type" : "text",
|
||||
"text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
|
||||
}]
|
||||
}]
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
outputs = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 64, # Increase for longer outputs!
|
||||
use_cache = True,
|
||||
# Recommended Gemma-4 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
)
|
||||
tokenizer.batch_decode(outputs)
|
||||
|
||||
|
||||
# You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
|
||||
|
||||
# In[22]:
|
||||
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{"type" : "text", "text" : "Why is the sky blue?",}]
|
||||
}]
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
_ = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 64, # Increase for longer outputs!
|
||||
use_cache = True,
|
||||
# Recommended Gemma-4 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# <a name="Save"></a>
|
||||
# ### Saving, loading finetuned models
|
||||
# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
|
||||
#
|
||||
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
|
||||
|
||||
# In[23]:
|
||||
|
||||
|
||||
model.save_pretrained("gemma_4_lora") # Local saving
|
||||
tokenizer.save_pretrained("gemma_4_lora")
|
||||
# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
|
||||
|
||||
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
|
||||
|
||||
# In[24]:
|
||||
|
||||
|
||||
if False:
|
||||
from unsloth import FastModel
|
||||
model, tokenizer = FastModel.from_pretrained(
|
||||
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
|
||||
max_seq_length = 2048,
|
||||
load_in_4bit = True,
|
||||
)
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{"type" : "text", "text" : "What is Gemma-4?",}]
|
||||
}]
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
_ = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 128, # Increase for longer outputs!
|
||||
use_cache = True,
|
||||
# Recommended Gemma-4 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# ### Saving to float16 for VLLM
|
||||
#
|
||||
# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
|
||||
|
||||
# In[25]:
|
||||
|
||||
|
||||
if False: # Change to True to save finetune!
|
||||
model.save_pretrained_merged("gemma-4-finetune", tokenizer)
|
||||
|
||||
|
||||
# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
|
||||
|
||||
# In[26]:
|
||||
|
||||
|
||||
if False: # Change to True to upload finetune
|
||||
model.push_to_hub_merged(
|
||||
"HF_ACCOUNT/gemma-4-finetune", tokenizer,
|
||||
token = "YOUR_HF_TOKEN"
|
||||
)
|
||||
|
||||
|
||||
# ### GGUF / llama.cpp Conversion
|
||||
# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
|
||||
|
||||
# In[27]:
|
||||
|
||||
|
||||
if False: # Change to True to save to GGUF
|
||||
model.save_pretrained_gguf(
|
||||
"gemma_4_finetune",
|
||||
tokenizer,
|
||||
quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
|
||||
)
|
||||
|
||||
|
||||
# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
|
||||
|
||||
# In[28]:
|
||||
|
||||
|
||||
if False: # Change to True to upload GGUF
|
||||
model.push_to_hub_gguf(
|
||||
"HF_ACCOUNT/gemma_4_finetune",
|
||||
tokenizer,
|
||||
quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
|
||||
token = "YOUR_HF_TOKEN",
|
||||
)
|
||||
|
||||
|
||||
# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
|
||||
#
|
||||
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
|
||||
#
|
||||
# Some other resources:
|
||||
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
|
||||
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
|
||||
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
|
||||
#
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
|
||||
#
|
||||
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
|
||||
# </div>
|
||||
#
|
||||
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
@@ -0,0 +1,448 @@
|
||||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
|
||||
# </div>
|
||||
#
|
||||
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
#
|
||||
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
|
||||
|
||||
# ### News
|
||||
|
||||
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
|
||||
#
|
||||
# <table><tr>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
|
||||
# </tr></table>
|
||||
#
|
||||
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
|
||||
#
|
||||
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
|
||||
#
|
||||
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
|
||||
#
|
||||
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
|
||||
|
||||
# # ### Installation
|
||||
#
|
||||
# # In[1]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
|
||||
#
|
||||
#
|
||||
# # In[2]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
|
||||
#
|
||||
#
|
||||
# # ### Unsloth
|
||||
|
||||
# In[3]:
|
||||
|
||||
|
||||
from unsloth import FastVisionModel # FastLanguageModel for LLMs
|
||||
import torch
|
||||
|
||||
gemma4_models = [
|
||||
# Gemma-4 instruct models:
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E4B-it",
|
||||
"unsloth/gemma-4-31B-it",
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
# Gemma-4 base models:
|
||||
"unsloth/gemma-4-E2B",
|
||||
"unsloth/gemma-4-E4B",
|
||||
"unsloth/gemma-4-31B",
|
||||
"unsloth/gemma-4-26B-A4B",
|
||||
] # More models at https://huggingface.co/unsloth
|
||||
|
||||
model, processor = FastVisionModel.from_pretrained(
|
||||
"unsloth/gemma-4-31B-it",
|
||||
load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
|
||||
use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
|
||||
)
|
||||
|
||||
|
||||
# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
|
||||
#
|
||||
# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
|
||||
|
||||
# In[4]:
|
||||
|
||||
|
||||
model = FastVisionModel.get_peft_model(
|
||||
model,
|
||||
finetune_vision_layers = True, # False if not finetuning vision layers
|
||||
finetune_language_layers = True, # False if not finetuning language layers
|
||||
finetune_attention_modules = True, # False if not finetuning attention layers
|
||||
finetune_mlp_modules = True, # False if not finetuning MLP layers
|
||||
|
||||
r = 32, # The larger, the higher the accuracy, but might overfit
|
||||
lora_alpha = 32, # Recommended alpha == r at least
|
||||
lora_dropout = 0,
|
||||
bias = "none",
|
||||
random_state = 3407,
|
||||
use_rslora = False, # We support rank stabilized LoRA
|
||||
loftq_config = None, # And LoftQ
|
||||
target_modules = "all-linear", # Optional now! Can specify a list if needed
|
||||
)
|
||||
|
||||
|
||||
# <a name="Data"></a>
|
||||
# ### Data Prep
|
||||
# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
|
||||
#
|
||||
# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
|
||||
|
||||
# In[5]:
|
||||
|
||||
|
||||
from datasets import load_dataset
|
||||
dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
|
||||
|
||||
|
||||
# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
|
||||
|
||||
# In[6]:
|
||||
|
||||
|
||||
dataset
|
||||
|
||||
|
||||
# In[7]:
|
||||
|
||||
|
||||
dataset[2]["image"]
|
||||
|
||||
|
||||
# In[8]:
|
||||
|
||||
|
||||
dataset[2]["text"]
|
||||
|
||||
|
||||
# We can also render LaTeX directly in the browser!
|
||||
|
||||
# In[9]:
|
||||
|
||||
|
||||
from IPython.display import display, Math, Latex
|
||||
|
||||
latex = dataset[3]["text"]
|
||||
display(Math(latex))
|
||||
|
||||
|
||||
# To format the dataset, all vision fine-tuning tasks should follow this format:
|
||||
#
|
||||
# ```python
|
||||
# [
|
||||
# {
|
||||
# "role": "user",
|
||||
# "content": [
|
||||
# {"type": "text", "text": instruction},
|
||||
# {"type": "image", "image": sample["image"]},
|
||||
# ],
|
||||
# },
|
||||
# {
|
||||
# "role": "user",
|
||||
# "content": [
|
||||
# {"type": "text", "text": instruction},
|
||||
# {"type": "image", "image": sample["image"]},
|
||||
# ],
|
||||
# },
|
||||
# ]
|
||||
# ```
|
||||
|
||||
# In[10]:
|
||||
|
||||
|
||||
instruction = "Write the LaTeX representation for this image."
|
||||
|
||||
def convert_to_conversation(sample):
|
||||
conversation = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": instruction},
|
||||
{"type": "image", "image": sample["image"]},
|
||||
],
|
||||
},
|
||||
{"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
|
||||
]
|
||||
return {"messages": conversation}
|
||||
pass
|
||||
|
||||
|
||||
# Let's convert the dataset into the "correct" format for finetuning:
|
||||
|
||||
# In[11]:
|
||||
|
||||
|
||||
converted_dataset = [convert_to_conversation(sample) for sample in dataset]
|
||||
|
||||
|
||||
# The first example is now structured like below:
|
||||
|
||||
# In[12]:
|
||||
|
||||
|
||||
converted_dataset[0]
|
||||
|
||||
|
||||
# Lets take the Gemma 4 instruction chat template and use it in our base model
|
||||
|
||||
# In[13]:
|
||||
|
||||
|
||||
from unsloth import get_chat_template
|
||||
|
||||
processor = get_chat_template(
|
||||
processor,
|
||||
"gemma-4-thinking"
|
||||
)
|
||||
|
||||
|
||||
# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
|
||||
|
||||
# In[14]:
|
||||
|
||||
|
||||
image = dataset[2]["image"]
|
||||
instruction = "Write the LaTeX representation for this image."
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
|
||||
}
|
||||
]
|
||||
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
|
||||
inputs = processor(
|
||||
image,
|
||||
input_text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
text_streamer = TextStreamer(processor, skip_prompt = True)
|
||||
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# You can see it's absolutely terrible! It doesn't follow instructions at all
|
||||
|
||||
# <a name="Train"></a>
|
||||
# ### Train the model
|
||||
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
|
||||
#
|
||||
# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
|
||||
|
||||
# In[15]:
|
||||
|
||||
|
||||
from unsloth.trainer import UnslothVisionDataCollator
|
||||
from trl import SFTTrainer, SFTConfig
|
||||
|
||||
trainer = SFTTrainer(
|
||||
model = model,
|
||||
train_dataset = converted_dataset,
|
||||
processing_class = processor.tokenizer,
|
||||
data_collator = UnslothVisionDataCollator(model, processor),
|
||||
args = SFTConfig(
|
||||
per_device_train_batch_size = 1,
|
||||
gradient_accumulation_steps = 4,
|
||||
max_grad_norm = 0.3,
|
||||
warmup_ratio = 0.03,
|
||||
max_steps = 60,
|
||||
# num_train_epochs = 2, # Set this instead of max_steps for full training runs
|
||||
learning_rate = 2e-4,
|
||||
logging_steps = 1,
|
||||
save_strategy = "steps",
|
||||
optim = "adamw_8bit",
|
||||
weight_decay = 0.001,
|
||||
lr_scheduler_type = "cosine",
|
||||
seed = 3407,
|
||||
output_dir = "outputs",
|
||||
report_to = "none", # For Weights and Biases or others
|
||||
|
||||
# You MUST put the below items for vision finetuning:
|
||||
remove_unused_columns = False,
|
||||
dataset_text_field = "",
|
||||
dataset_kwargs = {"skip_prepare_dataset": True},
|
||||
max_length = 2048,
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
# In[16]:
|
||||
|
||||
|
||||
# @title Show current memory stats
|
||||
gpu_stats = torch.cuda.get_device_properties(0)
|
||||
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
|
||||
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
|
||||
print(f"{start_gpu_memory} GB of memory reserved.")
|
||||
|
||||
|
||||
# In[17]:
|
||||
|
||||
|
||||
trainer_stats = trainer.train()
|
||||
|
||||
|
||||
# In[18]:
|
||||
|
||||
|
||||
# @title Show final memory and time stats
|
||||
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
|
||||
used_percentage = round(used_memory / max_memory * 100, 3)
|
||||
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
|
||||
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
|
||||
print(
|
||||
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
|
||||
)
|
||||
print(f"Peak reserved memory = {used_memory} GB.")
|
||||
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
|
||||
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
|
||||
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
|
||||
|
||||
|
||||
# <a name="Inference"></a>
|
||||
# ### Inference
|
||||
# Let's run the model! You can modify the instruction and input—just leave the output blank.
|
||||
#
|
||||
# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
|
||||
|
||||
# In[19]:
|
||||
|
||||
|
||||
image = dataset[10]["image"]
|
||||
instruction = "Write the LaTeX representation for this image."
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
|
||||
}
|
||||
]
|
||||
|
||||
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
|
||||
|
||||
inputs = processor(
|
||||
image,
|
||||
input_text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
text_streamer = TextStreamer(processor, skip_prompt = True)
|
||||
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# <a name="Save"></a>
|
||||
# ### Saving, loading finetuned models
|
||||
# To save the final model as LoRA adapters, use Hugging Face’s `push_to_hub` for online saving, or `save_pretrained` for local storage.
|
||||
#
|
||||
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
|
||||
|
||||
# In[20]:
|
||||
|
||||
|
||||
model.save_pretrained("gemma_4_lora") # Local saving
|
||||
processor.save_pretrained("gemma_4_lora")
|
||||
# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
|
||||
|
||||
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
|
||||
|
||||
# In[21]:
|
||||
|
||||
|
||||
if False:
|
||||
from unsloth import FastVisionModel
|
||||
|
||||
model, processor = FastVisionModel.from_pretrained(
|
||||
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
|
||||
load_in_4bit = True, # Set to False for 16bit LoRA
|
||||
)
|
||||
|
||||
sample = dataset[1]
|
||||
image = sample["image"].convert("RGB")
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": sample["text"],
|
||||
},
|
||||
{
|
||||
"type": "image",
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
|
||||
inputs = processor(
|
||||
image,
|
||||
input_text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
|
||||
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# ### Saving to float16 for VLLM
|
||||
#
|
||||
# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
|
||||
|
||||
# In[22]:
|
||||
|
||||
|
||||
# Select ONLY 1 to save! (Both not needed!)
|
||||
|
||||
# Save locally to 16bit
|
||||
if False: model.save_pretrained_merged("unsloth_finetune", processor,)
|
||||
|
||||
# To export and save to your Hugging Face account
|
||||
if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
|
||||
|
||||
|
||||
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
|
||||
#
|
||||
# Some other resources:
|
||||
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
|
||||
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
|
||||
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
|
||||
#
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
|
||||
#
|
||||
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
|
||||
# </div>
|
||||
#
|
||||
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
@@ -0,0 +1,478 @@
|
||||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
|
||||
# </div>
|
||||
#
|
||||
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
#
|
||||
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
|
||||
|
||||
# ### News
|
||||
|
||||
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
|
||||
#
|
||||
# <table><tr>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
|
||||
# </tr></table>
|
||||
#
|
||||
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
|
||||
#
|
||||
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
|
||||
#
|
||||
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
|
||||
#
|
||||
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
|
||||
|
||||
# # ### Installation
|
||||
#
|
||||
# # In[1]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
|
||||
#
|
||||
#
|
||||
# # In[2]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
|
||||
#
|
||||
#
|
||||
# # ### Unsloth
|
||||
#
|
||||
# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
|
||||
|
||||
# In[3]:
|
||||
|
||||
|
||||
from unsloth import FastModel
|
||||
import torch
|
||||
from huggingface_hub import snapshot_download
|
||||
|
||||
fourbit_models = [
|
||||
# Gemma 4 models
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E2B",
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E4B",
|
||||
"unsloth/gemma-4-31B-it",
|
||||
"unsloth/gemma-4-31B",
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
"unsloth/gemma-4-26B-A4B",
|
||||
] # More models at https://huggingface.co/unsloth
|
||||
|
||||
model, processor = FastModel.from_pretrained(
|
||||
model_name = "unsloth/gemma-4-E2B-it",
|
||||
dtype = None, # None for auto detection
|
||||
max_seq_length = 8192, # Choose any for long context!
|
||||
load_in_4bit = False, # 4 bit quantization to reduce memory
|
||||
full_finetuning = False, # [NEW!] We have full finetuning now!
|
||||
# token = "YOUR_HF_TOKEN", # HF Token for gated models
|
||||
)
|
||||
|
||||
|
||||
# # Gemma 4 can process Text, Vision and Audio!
|
||||
#
|
||||
# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
|
||||
|
||||
# In[4]:
|
||||
|
||||
|
||||
from transformers import TextStreamer
|
||||
# Helper function for inference
|
||||
def do_gemma_4_inference(messages, max_new_tokens = 128):
|
||||
_ = model.generate(
|
||||
**processor.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
return_tensors = "pt",
|
||||
).to("cuda"),
|
||||
max_new_tokens = max_new_tokens,
|
||||
do_sample = False,
|
||||
streamer = TextStreamer(processor, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# <h3>Let's Evaluate Gemma 4 Baseline Performance on German Transcription</h2>
|
||||
|
||||
# In[5]:
|
||||
|
||||
|
||||
from datasets import load_dataset,Audio,concatenate_datasets
|
||||
|
||||
dataset = load_dataset("kadirnar/Emilia-DE-B000000", split = "train")
|
||||
|
||||
# Select a single audio sample to reserve for testing.
|
||||
# This index is chosen from the full dataset before we create the smaller training split.
|
||||
test_audio = dataset[7546]
|
||||
|
||||
dataset = dataset.select(range(3000))
|
||||
|
||||
dataset = dataset.cast_column("audio", Audio(sampling_rate = 16000))
|
||||
|
||||
|
||||
# In[6]:
|
||||
|
||||
|
||||
from IPython.display import Audio, display
|
||||
print(test_audio['text'])
|
||||
Audio(test_audio['audio']['array'],rate = test_audio['audio']['sampling_rate'])
|
||||
|
||||
|
||||
# And the translation of the audio from German to English is:
|
||||
#
|
||||
# > I—I hold myself directly accountable. That much is, of course, clear: namely, that there are political interests involved in trade—in the exchange of goods—and that political influences are at play. The question is: that should not be the alternative.
|
||||
|
||||
# In[7]:
|
||||
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "You are an assistant that transcribes speech accurately.",
|
||||
}
|
||||
],
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "audio", "audio": test_audio['audio']['array']},
|
||||
{"type": "text", "text": "Please transcribe this audio."}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
do_gemma_4_inference(messages, max_new_tokens = 256)
|
||||
|
||||
|
||||
# <h3>Baseline Model Performance: 32.43% Word Error Rate (WER) for this sample !</h3>
|
||||
|
||||
# # Let's finetune Gemma 4!
|
||||
#
|
||||
# You can finetune the vision and text and audio parts
|
||||
|
||||
# We now add LoRA adapters so we only need to update a small amount of parameters!
|
||||
|
||||
# In[8]:
|
||||
|
||||
|
||||
model = FastModel.get_peft_model(
|
||||
model,
|
||||
finetune_vision_layers = False, # False if not finetuning vision layers
|
||||
finetune_language_layers = True, # False if not finetuning language layers
|
||||
finetune_attention_modules = True, # False if not finetuning attention layers
|
||||
finetune_mlp_modules = True, # False if not finetuning MLP layers
|
||||
|
||||
r = 8, # The larger, the higher the accuracy, but might overfit
|
||||
lora_alpha = 16, # Recommended alpha == r at least
|
||||
lora_dropout = 0,
|
||||
bias = "none",
|
||||
random_state = 3407,
|
||||
use_rslora = False, # We support rank stabilized LoRA
|
||||
loftq_config = None, # And LoftQ
|
||||
target_modules = [
|
||||
"q_proj", "k_proj", "v_proj", "o_proj",
|
||||
"gate_proj", "up_proj", "down_proj",
|
||||
|
||||
# Audio layers
|
||||
"post", "linear_start", "linear_end",
|
||||
"embedding_projection",
|
||||
"ffw_layer_1", "ffw_layer_2",
|
||||
"output_proj",
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
# <a name="Data"></a>
|
||||
# ### Data Prep
|
||||
# We adapt the `kadirnar/Emilia-DE-B000000` dataset for our German ASR task using Gemma 4 multi-modal chat format. Each audio-text pair is structured into a conversation with `system`, `user`, and `assistant` roles. The processor then converts this into the final training format:
|
||||
#
|
||||
# ```
|
||||
# <bos><|turn>system
|
||||
# You are an assistant that transcribes speech accurately.<turn|>
|
||||
# <|turn>user
|
||||
# <|audio|>Please transcribe this audio.<turn|>
|
||||
# <|turn>model
|
||||
# Ich, ich rechne direkt mich an.<turn|>
|
||||
|
||||
# In[9]:
|
||||
|
||||
|
||||
def format_intersection_data(samples: dict) -> dict[str, list]:
|
||||
"""Format intersection dataset to match expected message format"""
|
||||
formatted_samples = {"messages": []}
|
||||
for idx in range(len(samples["audio"])):
|
||||
audio = samples["audio"][idx]["array"]
|
||||
label = str(samples["text"][idx])
|
||||
|
||||
message = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "You are an assistant that transcribes speech accurately.",
|
||||
}
|
||||
],
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "audio", "audio": audio},
|
||||
{"type": "text", "text": "Please transcribe this audio."}
|
||||
]
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content":[{"type": "text", "text": label}]
|
||||
}
|
||||
]
|
||||
formatted_samples["messages"].append(message)
|
||||
return formatted_samples
|
||||
|
||||
|
||||
# In[10]:
|
||||
|
||||
|
||||
dataset = dataset.map(format_intersection_data, batched = True, batch_size = 4, num_proc = 4)
|
||||
|
||||
|
||||
# <a name="Train"></a>
|
||||
# ### Train the model
|
||||
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
|
||||
|
||||
# In[11]:
|
||||
|
||||
|
||||
# Use UnslothVisionDataCollator which handles audio token alignment correctly
|
||||
from unsloth.trainer import UnslothVisionDataCollator
|
||||
from trl import SFTTrainer, SFTConfig
|
||||
|
||||
trainer = SFTTrainer(
|
||||
model = model,
|
||||
train_dataset = dataset,
|
||||
processing_class = processor.tokenizer,
|
||||
data_collator = UnslothVisionDataCollator(model, processor),
|
||||
args = SFTConfig(
|
||||
per_device_train_batch_size = 8,
|
||||
gradient_accumulation_steps = 1,
|
||||
warmup_ratio = 0.03,
|
||||
# num_train_epochs = 1, # Use for full training runs
|
||||
max_steps = 60,
|
||||
learning_rate = 5e-5,
|
||||
logging_steps = 1,
|
||||
save_strategy = "steps",
|
||||
optim = "adamw_8bit",
|
||||
weight_decay = 0.001,
|
||||
lr_scheduler_type = "cosine",
|
||||
seed = 3407,
|
||||
output_dir = "outputs",
|
||||
report_to = "none",
|
||||
remove_unused_columns = False,
|
||||
|
||||
# The below are a must for audio finetuning:
|
||||
dataset_text_field = "",
|
||||
dataset_kwargs = {"skip_prepare_dataset": True},
|
||||
max_length = 8192,
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
# In[12]:
|
||||
|
||||
|
||||
# @title Show current memory stats
|
||||
gpu_stats = torch.cuda.get_device_properties(0)
|
||||
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
|
||||
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
|
||||
print(f"{start_gpu_memory} GB of memory reserved.")
|
||||
|
||||
|
||||
# # Let's train the model!
|
||||
#
|
||||
# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
|
||||
|
||||
# In[13]:
|
||||
|
||||
|
||||
trainer_stats = trainer.train()
|
||||
|
||||
|
||||
# In[14]:
|
||||
|
||||
|
||||
# @title Show final memory and time stats
|
||||
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
|
||||
used_percentage = round(used_memory / max_memory * 100, 3)
|
||||
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
|
||||
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
|
||||
print(
|
||||
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
|
||||
)
|
||||
print(f"Peak reserved memory = {used_memory} GB.")
|
||||
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
|
||||
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
|
||||
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
|
||||
|
||||
|
||||
# <a name="Inference"></a>
|
||||
# ### Inference
|
||||
# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
|
||||
|
||||
# In[15]:
|
||||
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "You are an assistant that transcribes speech accurately.",
|
||||
}
|
||||
],
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "audio", "audio": test_audio['audio']['array']},
|
||||
{"type": "text", "text": "Please transcribe this audio."}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
do_gemma_4_inference(messages, max_new_tokens = 256)
|
||||
|
||||
|
||||
# <a name="Save"></a>
|
||||
# ### Saving, loading finetuned models
|
||||
# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
|
||||
#
|
||||
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
|
||||
|
||||
# In[16]:
|
||||
|
||||
|
||||
model.save_pretrained("gemma_4_lora") # Local saving
|
||||
processor.save_pretrained("gemma_4_lora")
|
||||
# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
# processor.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
|
||||
|
||||
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
|
||||
|
||||
# In[17]:
|
||||
|
||||
|
||||
if False:
|
||||
from unsloth import FastModel
|
||||
model, processor = FastModel.from_pretrained(
|
||||
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
|
||||
max_seq_length = 2048,
|
||||
load_in_4bit = True,
|
||||
)
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{"type" : "text", "text" : "What is Gemma-4?",}]
|
||||
}]
|
||||
inputs = processor.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
_ = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 128, # Increase for longer outputs!
|
||||
# Recommended Gemma-4 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(processor, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# ### Saving to float16 for VLLM
|
||||
#
|
||||
# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
|
||||
|
||||
# In[18]:
|
||||
|
||||
|
||||
if False: # Change to True to save finetune!
|
||||
model.save_pretrained_merged("gemma-4", processor)
|
||||
|
||||
|
||||
# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
|
||||
|
||||
# In[19]:
|
||||
|
||||
|
||||
if False: # Change to True to upload finetune
|
||||
model.push_to_hub_merged(
|
||||
"HF_ACCOUNT/gemma-4-finetune", processor,
|
||||
token = "YOUR_HF_TOKEN"
|
||||
)
|
||||
|
||||
|
||||
# ### GGUF / llama.cpp Conversion
|
||||
# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
|
||||
|
||||
# In[20]:
|
||||
|
||||
|
||||
if False: # Change to True to save to GGUF
|
||||
model.save_pretrained_gguf(
|
||||
"gemma_4_finetune",
|
||||
processor,
|
||||
quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
|
||||
)
|
||||
|
||||
|
||||
# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
|
||||
|
||||
# In[21]:
|
||||
|
||||
|
||||
if False: # Change to True to upload GGUF
|
||||
model.push_to_hub_gguf(
|
||||
"HF_ACCOUNT/gemma_4_finetune",
|
||||
processor,
|
||||
quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
|
||||
token = "YOUR_HF_TOKEN",
|
||||
)
|
||||
|
||||
|
||||
# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
|
||||
#
|
||||
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
|
||||
#
|
||||
# Some other resources:
|
||||
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
|
||||
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
|
||||
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
|
||||
#
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
|
||||
#
|
||||
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
|
||||
# </div>
|
||||
#
|
||||
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
@@ -0,0 +1,556 @@
|
||||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
|
||||
# </div>
|
||||
#
|
||||
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
#
|
||||
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
|
||||
|
||||
# ### News
|
||||
|
||||
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
|
||||
#
|
||||
# <table><tr>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
|
||||
# </tr></table>
|
||||
#
|
||||
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
|
||||
#
|
||||
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
|
||||
#
|
||||
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
|
||||
#
|
||||
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
|
||||
|
||||
# # ### Installation
|
||||
#
|
||||
# # In[1]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
|
||||
#
|
||||
#
|
||||
# # In[2]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
|
||||
#
|
||||
#
|
||||
# # ### Unsloth
|
||||
#
|
||||
# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
|
||||
|
||||
# In[3]:
|
||||
|
||||
|
||||
from unsloth import FastModel
|
||||
import torch
|
||||
|
||||
gemma4_models = [
|
||||
# Gemma-4 instruct models:
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E4B-it",
|
||||
"unsloth/gemma-4-31B-it",
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
# Gemma-4 base models:
|
||||
"unsloth/gemma-4-E2B",
|
||||
"unsloth/gemma-4-E4B",
|
||||
"unsloth/gemma-4-31B",
|
||||
"unsloth/gemma-4-26B-A4B",
|
||||
] # More models at https://huggingface.co/unsloth
|
||||
|
||||
model, tokenizer = FastModel.from_pretrained(
|
||||
model_name = "unsloth/gemma-4-E2B-it",
|
||||
dtype = None, # None for auto detection
|
||||
max_seq_length = 1024, # Choose any for long context!
|
||||
load_in_4bit = False, # 4 bit quantization to reduce memory
|
||||
full_finetuning = False, # [NEW!] We have full finetuning now!
|
||||
# token = "YOUR_HF_TOKEN", # HF Token for gated models
|
||||
)
|
||||
|
||||
|
||||
# # Gemma 4 can process Text, Vision and Audio!
|
||||
#
|
||||
# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
|
||||
|
||||
# In[4]:
|
||||
|
||||
|
||||
from transformers import TextStreamer
|
||||
# Helper function for inference
|
||||
def do_gemma_4_inference(messages, max_new_tokens = 128):
|
||||
_ = model.generate(
|
||||
**tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
return_tensors = "pt",
|
||||
).to("cuda"),
|
||||
max_new_tokens = max_new_tokens,
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = True)
|
||||
)
|
||||
|
||||
|
||||
# # Gemma 4 can see images!
|
||||
#
|
||||
# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
|
||||
|
||||
# In[5]:
|
||||
|
||||
|
||||
sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
|
||||
|
||||
messages = [{
|
||||
"role" : "user",
|
||||
"content": [
|
||||
{ "type": "image", "image" : sloth_link },
|
||||
{ "type": "text", "text" : "Which films does this animal feature in?" }
|
||||
]
|
||||
}]
|
||||
# You might have to wait 1 minute for Unsloth's auto compiler
|
||||
do_gemma_4_inference(messages, max_new_tokens = 256)
|
||||
|
||||
|
||||
# Let's make a poem about sloths!
|
||||
|
||||
# In[6]:
|
||||
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{ "type" : "text",
|
||||
"text" : "Write a poem about sloths." }]
|
||||
}]
|
||||
do_gemma_4_inference(messages)
|
||||
|
||||
|
||||
# # Gemma 4 can also hear!
|
||||
|
||||
# In[7]:
|
||||
|
||||
|
||||
from IPython.display import Audio, display
|
||||
Audio("https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3")
|
||||
|
||||
|
||||
# In[8]:
|
||||
|
||||
|
||||
get_ipython().system('wget -qqq https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3 -O audio.mp3')
|
||||
|
||||
|
||||
# In[9]:
|
||||
|
||||
|
||||
audio_file = "audio.mp3"
|
||||
|
||||
messages = [{
|
||||
"role" : "user",
|
||||
"content": [
|
||||
{ "type": "audio", "audio" : audio_file },
|
||||
{ "type": "text", "text" : "What is this audio about?" }
|
||||
]
|
||||
}]
|
||||
do_gemma_4_inference(messages, max_new_tokens = 256)
|
||||
|
||||
|
||||
# # Let's combine all 3 modalities together!
|
||||
|
||||
# In[10]:
|
||||
|
||||
|
||||
messages = [{
|
||||
"role" : "user",
|
||||
"content": [
|
||||
{ "type": "audio", "audio" : audio_file },
|
||||
{ "type": "image", "image" : sloth_link },
|
||||
{ "type": "text", "text" : "What is this audio and image about? "\
|
||||
"How are they related?" }
|
||||
]
|
||||
}]
|
||||
do_gemma_4_inference(messages, max_new_tokens = 256)
|
||||
|
||||
|
||||
# # Let's finetune Gemma 4!
|
||||
#
|
||||
# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
|
||||
|
||||
# We now add LoRA adapters so we only need to update a small amount of parameters!
|
||||
|
||||
# In[11]:
|
||||
|
||||
|
||||
model = FastModel.get_peft_model(
|
||||
model,
|
||||
finetune_vision_layers = False, # Turn off for just text!
|
||||
finetune_language_layers = True, # Should leave on!
|
||||
finetune_attention_modules = True, # Attention good for GRPO
|
||||
finetune_mlp_modules = True, # Should leave on always!
|
||||
|
||||
r = 8, # Larger = higher accuracy, but might overfit
|
||||
lora_alpha = 8, # Recommended alpha == r at least
|
||||
lora_dropout = 0,
|
||||
bias = "none",
|
||||
random_state = 3407,
|
||||
)
|
||||
|
||||
|
||||
# <a name="Data"></a>
|
||||
# ### Data Prep
|
||||
# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
|
||||
#
|
||||
# ```
|
||||
# <bos><|turn>user
|
||||
# Hello<turn|>
|
||||
# <|turn>model
|
||||
# Hey there!<turn|>
|
||||
# ```
|
||||
# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
|
||||
|
||||
# In[12]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import get_chat_template
|
||||
tokenizer = get_chat_template(
|
||||
tokenizer,
|
||||
chat_template = "gemma-4",
|
||||
)
|
||||
|
||||
|
||||
# We get the first 3000 rows of the dataset
|
||||
|
||||
# In[13]:
|
||||
|
||||
|
||||
from datasets import load_dataset
|
||||
dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
|
||||
|
||||
|
||||
# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
|
||||
|
||||
# In[14]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import standardize_data_formats
|
||||
dataset = standardize_data_formats(dataset)
|
||||
|
||||
|
||||
# Let's see how row 100 looks like!
|
||||
|
||||
# In[15]:
|
||||
|
||||
|
||||
dataset[100]
|
||||
|
||||
|
||||
# We now have to apply the chat template for `Gemma-4` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
|
||||
|
||||
# In[16]:
|
||||
|
||||
|
||||
def formatting_prompts_func(examples):
|
||||
convos = examples["conversations"]
|
||||
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
|
||||
return { "text" : texts, }
|
||||
|
||||
dataset = dataset.map(formatting_prompts_func, batched = True)
|
||||
|
||||
|
||||
# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
|
||||
|
||||
# In[17]:
|
||||
|
||||
|
||||
dataset[100]["text"]
|
||||
|
||||
|
||||
# <a name="Train"></a>
|
||||
# ### Train the model
|
||||
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
|
||||
|
||||
# In[18]:
|
||||
|
||||
|
||||
from trl import SFTTrainer, SFTConfig
|
||||
trainer = SFTTrainer(
|
||||
model = model,
|
||||
tokenizer = tokenizer,
|
||||
train_dataset = dataset,
|
||||
eval_dataset = None, # Can set up evaluation!
|
||||
args = SFTConfig(
|
||||
dataset_text_field = "text",
|
||||
per_device_train_batch_size = 1,
|
||||
gradient_accumulation_steps = 4, # Use GA to mimic batch size!
|
||||
warmup_steps = 5,
|
||||
# num_train_epochs = 1, # Set this for 1 full training run.
|
||||
max_steps = 60,
|
||||
learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
|
||||
logging_steps = 1,
|
||||
optim = "adamw_8bit",
|
||||
weight_decay = 0.001,
|
||||
lr_scheduler_type = "linear",
|
||||
seed = 3407,
|
||||
report_to = "none", # Use TrackIO/WandB etc
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
|
||||
|
||||
# In[19]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import train_on_responses_only
|
||||
trainer = train_on_responses_only(
|
||||
trainer,
|
||||
instruction_part = "<|turn>user\n",
|
||||
response_part = "<|turn>model\n",
|
||||
)
|
||||
|
||||
|
||||
# Let's verify masking the instruction part is done! Let's print the 100th row again. Notice how the sample only has a single `<bos>` as expected!
|
||||
|
||||
# In[20]:
|
||||
|
||||
|
||||
tokenizer.decode(trainer.train_dataset[100]["input_ids"])
|
||||
|
||||
|
||||
# Now let's print the masked out example - you should see only the answer is present:
|
||||
|
||||
# In[21]:
|
||||
|
||||
|
||||
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
|
||||
|
||||
|
||||
# In[22]:
|
||||
|
||||
|
||||
# @title Show current memory stats
|
||||
gpu_stats = torch.cuda.get_device_properties(0)
|
||||
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
|
||||
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
|
||||
print(f"{start_gpu_memory} GB of memory reserved.")
|
||||
|
||||
|
||||
# # Let's train the model!
|
||||
#
|
||||
# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
|
||||
|
||||
# In[23]:
|
||||
|
||||
|
||||
trainer_stats = trainer.train()
|
||||
|
||||
|
||||
# In[24]:
|
||||
|
||||
|
||||
# @title Show final memory and time stats
|
||||
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
|
||||
used_percentage = round(used_memory / max_memory * 100, 3)
|
||||
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
|
||||
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
|
||||
print(
|
||||
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
|
||||
)
|
||||
print(f"Peak reserved memory = {used_memory} GB.")
|
||||
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
|
||||
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
|
||||
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
|
||||
|
||||
|
||||
# <a name="Inference"></a>
|
||||
# ### Inference
|
||||
# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
|
||||
|
||||
# In[25]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import get_chat_template
|
||||
tokenizer = get_chat_template(
|
||||
tokenizer,
|
||||
chat_template = "gemma-4",
|
||||
)
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{
|
||||
"type" : "text",
|
||||
"text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
|
||||
}]
|
||||
}]
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
outputs = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 64, # Increase for longer outputs!
|
||||
# Recommended Gemma-4 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
)
|
||||
tokenizer.batch_decode(outputs)
|
||||
|
||||
|
||||
# You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
|
||||
|
||||
# In[26]:
|
||||
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{"type" : "text", "text" : "Why is the sky blue?",}]
|
||||
}]
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
_ = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 64, # Increase for longer outputs!
|
||||
# Recommended Gemma-4 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# <a name="Save"></a>
|
||||
# ### Saving, loading finetuned models
|
||||
# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
|
||||
#
|
||||
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
|
||||
|
||||
# In[27]:
|
||||
|
||||
|
||||
model.save_pretrained("gemma_4_lora") # Local saving
|
||||
tokenizer.save_pretrained("gemma_4_lora")
|
||||
# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
|
||||
|
||||
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
|
||||
|
||||
# In[28]:
|
||||
|
||||
|
||||
if False:
|
||||
from unsloth import FastModel
|
||||
model, tokenizer = FastModel.from_pretrained(
|
||||
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
|
||||
max_seq_length = 2048,
|
||||
load_in_4bit = True,
|
||||
)
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{"type" : "text", "text" : "What is Gemma-4?",}]
|
||||
}]
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
_ = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 128, # Increase for longer outputs!
|
||||
# Recommended Gemma-4 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# ### Saving to float16 for VLLM
|
||||
#
|
||||
# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
|
||||
|
||||
# In[29]:
|
||||
|
||||
|
||||
if False: # Change to True to save finetune!
|
||||
model.save_pretrained_merged("gemma-4-finetune", tokenizer)
|
||||
|
||||
|
||||
# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
|
||||
|
||||
# In[30]:
|
||||
|
||||
|
||||
if False: # Change to True to upload finetune
|
||||
model.push_to_hub_merged(
|
||||
"HF_ACCOUNT/gemma-4-finetune", tokenizer,
|
||||
token = "YOUR_HF_TOKEN"
|
||||
)
|
||||
|
||||
|
||||
# ### GGUF / llama.cpp Conversion
|
||||
# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
|
||||
|
||||
# In[31]:
|
||||
|
||||
|
||||
if False: # Change to True to save to GGUF
|
||||
model.save_pretrained_gguf(
|
||||
"gemma_4_finetune",
|
||||
tokenizer,
|
||||
quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
|
||||
)
|
||||
|
||||
|
||||
# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
|
||||
|
||||
# In[32]:
|
||||
|
||||
|
||||
if False: # Change to True to upload GGUF
|
||||
model.push_to_hub_gguf(
|
||||
"HF_ACCOUNT/gemma_4_finetune",
|
||||
tokenizer,
|
||||
quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
|
||||
token = "YOUR_HF_TOKEN",
|
||||
)
|
||||
|
||||
|
||||
# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
|
||||
#
|
||||
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
|
||||
#
|
||||
# Some other resources:
|
||||
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
|
||||
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
|
||||
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
|
||||
#
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
|
||||
#
|
||||
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
|
||||
# </div>
|
||||
#
|
||||
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
@@ -0,0 +1,448 @@
|
||||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
|
||||
# </div>
|
||||
#
|
||||
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
#
|
||||
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
|
||||
|
||||
# ### News
|
||||
|
||||
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
|
||||
#
|
||||
# <table><tr>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
|
||||
# </tr></table>
|
||||
#
|
||||
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
|
||||
#
|
||||
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
|
||||
#
|
||||
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
|
||||
#
|
||||
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
|
||||
|
||||
# # ### Installation
|
||||
#
|
||||
# # In[ ]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
|
||||
#
|
||||
#
|
||||
# # In[ ]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
|
||||
#
|
||||
#
|
||||
# # ### Unsloth
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from unsloth import FastVisionModel # FastLanguageModel for LLMs
|
||||
import torch
|
||||
|
||||
gemma4_models = [
|
||||
# Gemma-4 instruct models:
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E4B-it",
|
||||
"unsloth/gemma-4-31B-it",
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
# Gemma-4 base models:
|
||||
"unsloth/gemma-4-E2B",
|
||||
"unsloth/gemma-4-E4B",
|
||||
"unsloth/gemma-4-31B",
|
||||
"unsloth/gemma-4-26B-A4B",
|
||||
] # More models at https://huggingface.co/unsloth
|
||||
|
||||
model, processor = FastVisionModel.from_pretrained(
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
load_in_4bit = False, # Use 4bit to reduce memory use. False for 16bit LoRA.
|
||||
use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
|
||||
)
|
||||
|
||||
|
||||
# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
|
||||
#
|
||||
# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
model = FastVisionModel.get_peft_model(
|
||||
model,
|
||||
finetune_vision_layers = True, # False if not finetuning vision layers
|
||||
finetune_language_layers = True, # False if not finetuning language layers
|
||||
finetune_attention_modules = True, # False if not finetuning attention layers
|
||||
finetune_mlp_modules = True, # False if not finetuning MLP layers
|
||||
|
||||
r = 32, # The larger, the higher the accuracy, but might overfit
|
||||
lora_alpha = 32, # Recommended alpha == r at least
|
||||
lora_dropout = 0,
|
||||
bias = "none",
|
||||
random_state = 3407,
|
||||
use_rslora = False, # We support rank stabilized LoRA
|
||||
loftq_config = None, # And LoftQ
|
||||
target_modules = "all-linear", # Optional now! Can specify a list if needed
|
||||
)
|
||||
|
||||
|
||||
# <a name="Data"></a>
|
||||
# ### Data Prep
|
||||
# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
|
||||
#
|
||||
# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from datasets import load_dataset
|
||||
dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
|
||||
|
||||
|
||||
# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
dataset
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
dataset[2]["image"]
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
dataset[2]["text"]
|
||||
|
||||
|
||||
# We can also render LaTeX directly in the browser!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from IPython.display import display, Math, Latex
|
||||
|
||||
latex = dataset[3]["text"]
|
||||
display(Math(latex))
|
||||
|
||||
|
||||
# To format the dataset, all vision fine-tuning tasks should follow this format:
|
||||
#
|
||||
# ```python
|
||||
# [
|
||||
# {
|
||||
# "role": "user",
|
||||
# "content": [
|
||||
# {"type": "text", "text": instruction},
|
||||
# {"type": "image", "image": sample["image"]},
|
||||
# ],
|
||||
# },
|
||||
# {
|
||||
# "role": "user",
|
||||
# "content": [
|
||||
# {"type": "text", "text": instruction},
|
||||
# {"type": "image", "image": sample["image"]},
|
||||
# ],
|
||||
# },
|
||||
# ]
|
||||
# ```
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
instruction = "Write the LaTeX representation for this image."
|
||||
|
||||
def convert_to_conversation(sample):
|
||||
conversation = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": instruction},
|
||||
{"type": "image", "image": sample["image"]},
|
||||
],
|
||||
},
|
||||
{"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
|
||||
]
|
||||
return {"messages": conversation}
|
||||
pass
|
||||
|
||||
|
||||
# Let's convert the dataset into the "correct" format for finetuning:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
converted_dataset = [convert_to_conversation(sample) for sample in dataset]
|
||||
|
||||
|
||||
# The first example is now structured like below:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
converted_dataset[0]
|
||||
|
||||
|
||||
# Lets take the Gemma 4 instruction chat template and use it in our base model
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from unsloth import get_chat_template
|
||||
|
||||
processor = get_chat_template(
|
||||
processor,
|
||||
"gemma-4"
|
||||
)
|
||||
|
||||
|
||||
# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
image = dataset[2]["image"]
|
||||
instruction = "Write the LaTeX representation for this image."
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
|
||||
}
|
||||
]
|
||||
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
|
||||
inputs = processor(
|
||||
image,
|
||||
input_text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
text_streamer = TextStreamer(processor, skip_prompt = True)
|
||||
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# You can see it's absolutely terrible! It doesn't follow instructions at all
|
||||
|
||||
# <a name="Train"></a>
|
||||
# ### Train the model
|
||||
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
|
||||
#
|
||||
# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from unsloth.trainer import UnslothVisionDataCollator
|
||||
from trl import SFTTrainer, SFTConfig
|
||||
|
||||
trainer = SFTTrainer(
|
||||
model = model,
|
||||
train_dataset = converted_dataset,
|
||||
processing_class = processor.tokenizer,
|
||||
data_collator = UnslothVisionDataCollator(model, processor),
|
||||
args = SFTConfig(
|
||||
per_device_train_batch_size = 1,
|
||||
gradient_accumulation_steps = 4,
|
||||
max_grad_norm = 0.3,
|
||||
warmup_ratio = 0.03,
|
||||
max_steps = 60,
|
||||
# num_train_epochs = 2, # Set this instead of max_steps for full training runs
|
||||
learning_rate = 2e-4,
|
||||
logging_steps = 1,
|
||||
save_strategy = "steps",
|
||||
optim = "adamw_8bit",
|
||||
weight_decay = 0.001,
|
||||
lr_scheduler_type = "cosine",
|
||||
seed = 3407,
|
||||
output_dir = "outputs",
|
||||
report_to = "none", # For Weights and Biases or others
|
||||
|
||||
# You MUST put the below items for vision finetuning:
|
||||
remove_unused_columns = False,
|
||||
dataset_text_field = "",
|
||||
dataset_kwargs = {"skip_prepare_dataset": True},
|
||||
max_length = 2048,
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# @title Show current memory stats
|
||||
gpu_stats = torch.cuda.get_device_properties(0)
|
||||
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
|
||||
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
|
||||
print(f"{start_gpu_memory} GB of memory reserved.")
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
trainer_stats = trainer.train()
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# @title Show final memory and time stats
|
||||
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
|
||||
used_percentage = round(used_memory / max_memory * 100, 3)
|
||||
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
|
||||
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
|
||||
print(
|
||||
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
|
||||
)
|
||||
print(f"Peak reserved memory = {used_memory} GB.")
|
||||
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
|
||||
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
|
||||
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
|
||||
|
||||
|
||||
# <a name="Inference"></a>
|
||||
# ### Inference
|
||||
# Let's run the model! You can modify the instruction and input—just leave the output blank.
|
||||
#
|
||||
# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
image = dataset[10]["image"]
|
||||
instruction = "Write the LaTeX representation for this image."
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
|
||||
}
|
||||
]
|
||||
|
||||
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
|
||||
|
||||
inputs = processor(
|
||||
image,
|
||||
input_text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
text_streamer = TextStreamer(processor, skip_prompt = True)
|
||||
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# <a name="Save"></a>
|
||||
# ### Saving, loading finetuned models
|
||||
# To save the final model as LoRA adapters, use Hugging Face’s `push_to_hub` for online saving, or `save_pretrained` for local storage.
|
||||
#
|
||||
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
model.save_pretrained("gemma_4_lora") # Local saving
|
||||
processor.save_pretrained("gemma_4_lora")
|
||||
# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
|
||||
|
||||
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
if False:
|
||||
from unsloth import FastVisionModel
|
||||
|
||||
model, processor = FastVisionModel.from_pretrained(
|
||||
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
|
||||
load_in_4bit = True, # Set to False for 16bit LoRA
|
||||
)
|
||||
|
||||
sample = dataset[1]
|
||||
image = sample["image"].convert("RGB")
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": sample["text"],
|
||||
},
|
||||
{
|
||||
"type": "image",
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
|
||||
inputs = processor(
|
||||
image,
|
||||
input_text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
|
||||
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# ### Saving to float16 for VLLM
|
||||
#
|
||||
# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# Select ONLY 1 to save! (Both not needed!)
|
||||
|
||||
# Save locally to 16bit
|
||||
if False: model.save_pretrained_merged("unsloth_finetune", processor,)
|
||||
|
||||
# To export and save to your Hugging Face account
|
||||
if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
|
||||
|
||||
|
||||
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
|
||||
#
|
||||
# Some other resources:
|
||||
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
|
||||
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
|
||||
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
|
||||
#
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
|
||||
#
|
||||
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
|
||||
# </div>
|
||||
#
|
||||
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
@@ -0,0 +1,911 @@
|
||||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
|
||||
# </div>
|
||||
#
|
||||
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
#
|
||||
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
|
||||
|
||||
# # ### Installation
|
||||
#
|
||||
# # In[ ]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
|
||||
#
|
||||
#
|
||||
# # In[ ]:
|
||||
#
|
||||
#
|
||||
# #@title Colab Extra Install { display-mode: "form" }
|
||||
# get_ipython().run_line_magic('%capture', '')
|
||||
# import os
|
||||
# get_ipython().system('pip install --upgrade -qqq uv')
|
||||
# if "COLAB_" not in "".join(os.environ.keys()):
|
||||
# # If you're not in Colab, just use pip install!
|
||||
# get_ipython().system('pip install unsloth vllm')
|
||||
# else:
|
||||
# try: import numpy, PIL; _numpy = f'numpy=={numpy.__version__}'; _pil = f'pillow=={PIL.__version__}'
|
||||
# except: _numpy = "numpy"; _pil = "pillow"
|
||||
# try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
|
||||
# except: is_t4 = False
|
||||
# _vllm, _triton = ('vllm==0.9.2', 'triton==3.2.0') if is_t4 else ('vllm==0.15.1', 'triton')
|
||||
# get_ipython().system('uv pip install -qqq --upgrade {_vllm} {_numpy} {_pil} torchvision bitsandbytes xformers unsloth')
|
||||
# get_ipython().system('uv pip install -qqq {_triton}')
|
||||
# get_ipython().system('uv pip install transformers==4.56.2')
|
||||
# get_ipython().system('uv pip install --no-deps trl==0.22.2')
|
||||
#
|
||||
#
|
||||
# # ### Unsloth
|
||||
|
||||
# # Goal: Make faster kernels with Reinforcement Learning
|
||||
#
|
||||
# Our goal is to make a faster matrix multiplication kernel by doing RL on Gemma 4 with Unsloth.
|
||||
#
|
||||
# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Matrix_multiplication_qtl1.svg/500px-Matrix_multiplication_qtl1.svg.png" height=200 />
|
||||
#
|
||||
# You will learn how to:
|
||||
# 1. Counteract **reward hacking** like cheating, caching, laziness.
|
||||
# 2. Timing and correctness of kernels and time limits.
|
||||
# 3. Making good **reward functions**
|
||||
# 4. How to seriously do RL to make optimized kernels
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from unsloth import FastVisionModel
|
||||
import torch
|
||||
max_seq_length = 4096 # Can increase for longer reasoning traces
|
||||
lora_rank = 32 # Larger rank = smarter, but slower
|
||||
|
||||
gemma4_models = [
|
||||
# Gemma-4 instruct models:
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E4B-it",
|
||||
"unsloth/gemma-4-31B-it",
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
# Gemma-4 base models:
|
||||
"unsloth/gemma-4-E2B",
|
||||
"unsloth/gemma-4-E4B",
|
||||
"unsloth/gemma-4-31B",
|
||||
"unsloth/gemma-4-26B-A4B",
|
||||
] # More models at https://huggingface.co/unsloth
|
||||
|
||||
model, tokenizer = FastVisionModel.from_pretrained(
|
||||
model_name = "unsloth/gemma-4-E2B-it",
|
||||
max_seq_length = max_seq_length,
|
||||
load_in_4bit = False, # False for LoRA 16bit
|
||||
fast_inference = False, # Enable vllm fast inference
|
||||
)
|
||||
|
||||
|
||||
# We now add some small amount of LoRA weights to Gemma 4 so we only need to train those, instead of training on the full model.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
model = FastVisionModel.get_peft_model(
|
||||
model,
|
||||
r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
|
||||
target_modules = [
|
||||
"q_proj", "k_proj", "v_proj", "o_proj",
|
||||
"gate_proj", "up_proj", "down_proj",
|
||||
],
|
||||
lora_alpha = lora_rank*2, # *2 speeds up training
|
||||
use_gradient_checkpointing = "unsloth", # Reduces memory usage
|
||||
random_state = 3407,
|
||||
)
|
||||
|
||||
|
||||
# # Optimized matrix multiplication
|
||||
#
|
||||
# Numpy has optimized matrix multiplication kernels for CPUs via BLAS optimized operations. For GPUs, one can use CUDA accelerated cuBLAS kernels which PyTorch calls under the hood.
|
||||
#
|
||||
# To generate some random matrices to do matrix multiplication, we can do the below:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
import numpy as np
|
||||
def generate_random_matrices(seed = 3407, n = 256):
|
||||
random_state = np.random.RandomState(seed)
|
||||
n, k, m = random_state.randint(1, n+1, size = 3)
|
||||
A = np.random.uniform(-10, 10, size = (n, k))
|
||||
B = np.random.uniform(-10, 10, size = (k, m))
|
||||
return A, A.tolist(), B, B.tolist()
|
||||
|
||||
|
||||
# We shall generate a small matrix, and see the matrix multiplied output
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
A, A_list, B, B_list = generate_random_matrices(seed = 42, n = 5)
|
||||
print(A)
|
||||
print(B)
|
||||
print(np.matmul(A, B))
|
||||
|
||||
|
||||
# We can call a LLM to generate a simple matrix multiply kernel in Python only, and we can calculate the differences between the actual result and the kernel's result
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def calculate_difference(pred, real):
|
||||
if pred is None: return 5, 5
|
||||
assert real is not None
|
||||
import numpy as np
|
||||
try:
|
||||
difference = pred - real
|
||||
except:
|
||||
return 5, 5
|
||||
amax_error = float(np.amax(difference))
|
||||
mse_error = float(np.mean(np.square(difference)))
|
||||
return amax_error, mse_error
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# Kernel generated by GPT-5
|
||||
def matmul(A, B):
|
||||
z, s = zip, sum
|
||||
Bt = list(z(*B))
|
||||
return [[s(a*b for a, b in z(row, col)) for col in Bt] for row in A]
|
||||
|
||||
|
||||
# We see the error below is very small, so that's good!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
prediction = matmul(A_list, B_list)
|
||||
calculate_difference(prediction, np.matmul(A, B))
|
||||
|
||||
|
||||
# # Countering Reward Hacking
|
||||
#
|
||||
# The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric).
|
||||
#
|
||||
# But RL can **cheat** When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called "Reward Hacking".
|
||||
#
|
||||
# Some good examples are in https://en.wikipedia.org/wiki/Reward_hacking
|
||||
#
|
||||
# For matrix multiplication kernels, we might see the following issues:
|
||||
#
|
||||
# * Laziness: RL learns to use Numpy, Torch, other libraries, which calls optimized kernels.
|
||||
# * Caching: RL learns to cache the result of the output
|
||||
# * Cheating: RL learns to find the actual output by inspecting Python global variables
|
||||
# * RL learns to edit the timing function to make it output 0 time as passed.
|
||||
#
|
||||
# And possibly more. We shall try to address each!
|
||||
|
||||
# # Countering Reward Hacking 1: Stop laziness
|
||||
# We can stop the RL algorithm from calling optimized code by inspecting if the generated code imports other non standard Python libraries. We used GPT-5 to help generate this check `check_only_stdlib_imports`:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
#@title (Collapsible code)
|
||||
import ast
|
||||
import sys
|
||||
import sysconfig
|
||||
from pathlib import Path
|
||||
|
||||
def _stdlib_names():
|
||||
"""
|
||||
Build a set of canonical stdlib top-level module/package names.
|
||||
Uses sys.stdlib_module_names when available (3.10+), with a
|
||||
filesystem fallback for older versions/edge cases.
|
||||
"""
|
||||
names = {m.lower() for m in getattr(sys, "stdlib_module_names", set())}
|
||||
names |= {m.lower() for m in sys.builtin_module_names}
|
||||
names.add("__future__") # special-case
|
||||
|
||||
# Fallback/augmentation: scan the stdlib directory
|
||||
try:
|
||||
stdlib_dir = Path(sysconfig.get_path("stdlib"))
|
||||
if stdlib_dir.exists():
|
||||
for p in stdlib_dir.iterdir():
|
||||
if p.name == "site-packages":
|
||||
continue
|
||||
if p.suffix == ".py":
|
||||
names.add(p.stem.lower())
|
||||
elif p.is_dir() and (p / "__init__.py").exists():
|
||||
names.add(p.name.lower())
|
||||
except Exception:
|
||||
# conservative fallback; the names set above will still work well
|
||||
pass
|
||||
|
||||
return names
|
||||
|
||||
_STDLIB_SET = _stdlib_names()
|
||||
|
||||
def check_only_stdlib_imports(code: str):
|
||||
"""
|
||||
Return (ok: bool, details: dict)
|
||||
|
||||
ok == True -> all absolute imports are from the stdlib.
|
||||
ok == False -> details['non_stdlib'] lists offending top-level modules.
|
||||
|
||||
details includes:
|
||||
- stdlib: sorted list of stdlib imports found
|
||||
- non_stdlib: sorted list of non-stdlib imports found
|
||||
- relative_imports: count of relative imports (always allowed here)
|
||||
"""
|
||||
try:
|
||||
tree = ast.parse(code)
|
||||
except SyntaxError as e:
|
||||
return False, {
|
||||
"error": f"SyntaxError: {e}",
|
||||
"stdlib": [],
|
||||
"non_stdlib": [],
|
||||
"relative_imports": 0,
|
||||
}
|
||||
|
||||
abs_imports = set()
|
||||
relative_count = 0
|
||||
|
||||
class Visitor(ast.NodeVisitor):
|
||||
def visit_Import(self, node: ast.Import):
|
||||
for alias in node.names:
|
||||
abs_imports.add(alias.name.split(".")[0])
|
||||
def visit_ImportFrom(self, node: ast.ImportFrom):
|
||||
nonlocal relative_count
|
||||
if (node.level or 0) > 0:
|
||||
# relative import
|
||||
relative_count += 1
|
||||
else:
|
||||
if node.module:
|
||||
abs_imports.add(node.module.split(".")[0])
|
||||
|
||||
Visitor().visit(tree)
|
||||
|
||||
stdlib_found = sorted(m for m in abs_imports if m.lower() in _STDLIB_SET)
|
||||
non_stdlib = sorted(m for m in abs_imports if m.lower() not in _STDLIB_SET)
|
||||
|
||||
return len(non_stdlib) == 0, {
|
||||
"stdlib": stdlib_found,
|
||||
"non_stdlib": non_stdlib,
|
||||
"relative_imports": relative_count,
|
||||
}
|
||||
|
||||
|
||||
# For example, let's call `check_only_stdlib_imports` on a random piece of matrix multiplication code generated by GPT-5:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
sample = """
|
||||
def matmul(A, B):
|
||||
import numpy as np
|
||||
from torch import matmul
|
||||
z, s = zip, sum
|
||||
Bt = list(z(*B))
|
||||
return [[s(a*b for a, b in z(row, col)) for col in Bt] for row in A]
|
||||
"""
|
||||
ok, info = check_only_stdlib_imports(sample)
|
||||
print("Only stdlib imports?", ok)
|
||||
print(info)
|
||||
|
||||
|
||||
# # Countering Reward Hacking 2: Stop cheating
|
||||
# We can stop the RL algorithm from using global or cached variables by restricting it's `locals` and `globals`.
|
||||
#
|
||||
# We are also going to use `exec` to create the function, so we have to save the output to an empty dict.
|
||||
#
|
||||
# We also disallow global variable access.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
output_function = {}
|
||||
exec(sample, {}, output_function)
|
||||
output_function["matmul"]
|
||||
|
||||
|
||||
# We also disallow global variable access via `types.FunctionType(f.__code__, {})`
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
import types
|
||||
output_function["matmul"] = types.FunctionType(output_function["matmul"].__code__, {})
|
||||
|
||||
def import_numpy():
|
||||
np.matmul
|
||||
print("Success")
|
||||
|
||||
import_numpy()
|
||||
import_numpy = types.FunctionType(import_numpy.__code__, {})
|
||||
try:
|
||||
import_numpy()
|
||||
except Exception as e:
|
||||
print(str(e))
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def create_locked_down_function(function):
|
||||
output_function = {}
|
||||
exec(function, {}, output_function)
|
||||
new_matmul = output_function["matmul"]
|
||||
new_matmul = types.FunctionType(new_matmul.__code__, {})
|
||||
return new_matmul
|
||||
|
||||
|
||||
# # Countering Reward Hacking 3: Stop caching
|
||||
# We can stop the RL algorithm from using cached data by wiping the cache with a large fake matrix. We also have to benchmark carefully with multiple loops and turns.
|
||||
#
|
||||
# We also add a **timer** to not make the algorithm go in an endless loop.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
import os, gc, time, statistics
|
||||
import signal
|
||||
from contextlib import contextmanager
|
||||
class TimeoutError(Exception): pass
|
||||
|
||||
@contextmanager
|
||||
def time_limit(seconds):
|
||||
def _handler(signum, frame):
|
||||
raise TimeoutError(f"Timed out after {seconds}s")
|
||||
old = signal.signal(signal.SIGALRM, _handler)
|
||||
signal.setitimer(signal.ITIMER_REAL, seconds)
|
||||
try:
|
||||
yield
|
||||
finally:
|
||||
signal.setitimer(signal.ITIMER_REAL, 0.0)
|
||||
signal.signal(signal.SIGALRM, old)
|
||||
|
||||
class Benchmarker:
|
||||
def __init__(self, trials = 3, loops = 1, timeout = 30):
|
||||
self.buffer = np.zeros(2 * 1024 * 1024 * 1024, dtype = np.uint8)
|
||||
self.trials = trials
|
||||
self.loops = loops
|
||||
assert timeout > 0 # Cannot be 0 since it won't work!
|
||||
self.timeout = timeout
|
||||
def thrash(self):
|
||||
# Edit the buffer to wipe cache lines
|
||||
self.buffer ^= 1
|
||||
return int(self.buffer[::4096].sum())
|
||||
|
||||
def benchmark(self, function, arguments):
|
||||
assert len(arguments) == self.loops
|
||||
samples = []
|
||||
exceptions = []
|
||||
timed_out = 0
|
||||
for _ in range(self.trials):
|
||||
gc.collect(); gc.disable(); self.thrash()
|
||||
t_start = time.perf_counter_ns()
|
||||
for i in range(self.loops):
|
||||
try:
|
||||
with time_limit(self.timeout):
|
||||
function(*arguments[i])
|
||||
except TimeoutError as e:
|
||||
timed_out += 1
|
||||
except Exception as e:
|
||||
exceptions.append(str(e))
|
||||
t_end = time.perf_counter_ns()
|
||||
gc.enable()
|
||||
samples.append((t_end - t_start) // max(1, self.loops))
|
||||
return {
|
||||
"median_ns": int(statistics.median(samples)),
|
||||
"mean_ns": int(statistics.fmean(samples)),
|
||||
"stdev_ns": int(statistics.pstdev(samples) if len(samples) > 1 else 0),
|
||||
"exceptions" : exceptions,
|
||||
"timeouts" : timed_out,
|
||||
}
|
||||
|
||||
|
||||
# For example we use our matmul kernel we had, and benchmark it with a 10 second delay:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
A, A_list, B, B_list = generate_random_matrices(seed = 0, n = 256)
|
||||
Benchmarker(trials = 1, timeout = 10).benchmark(output_function["matmul"], [(A_list, B_list)])
|
||||
|
||||
|
||||
# # Data & RL task setup
|
||||
#
|
||||
# We now have to create a prompt to the model for which it will do some task. For our matrix multiply example, we use the below:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
prompt = """
|
||||
Create a new fast matrix multiplication function using only native Python code.
|
||||
You are given a list of list of numbers.
|
||||
Output your new function in backticks using the format below:
|
||||
```python
|
||||
def matmul(A, B):
|
||||
return ...
|
||||
```
|
||||
""".strip()
|
||||
print(prompt)
|
||||
|
||||
|
||||
# First, let's prompt Gemma 4 without RL and see how it goes:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
text = tokenizer.apply_chat_template(
|
||||
[{"role": "user", "content": prompt.strip()}],
|
||||
tokenize = False,
|
||||
add_generation_prompt = True,
|
||||
)
|
||||
|
||||
from transformers import TextStreamer
|
||||
print("=" * 50)
|
||||
print("BASE MODEL OUTPUT (before RL training):")
|
||||
print("=" * 50)
|
||||
|
||||
inputs = tokenizer(
|
||||
text = text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
|
||||
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# # Reward functions
|
||||
#
|
||||
# We now design the `extract_function` function which simply extracts the function wrapped in 3 backticks.
|
||||
#
|
||||
# And 4 reward functions:
|
||||
#
|
||||
# 1. `function_works` which rewards the model if the strategy is a valid Python function.
|
||||
# 2. `no_cheating` which checks if the function imported other modules, and if it did, we penalize it.
|
||||
# 3. `correctness_check` which checks if the kernel was correct or wrong - it shouldn't generate gibberish!
|
||||
# 4. `speed_check` checks the performance relative to Numpy matmul directly.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def extract_function(text):
|
||||
if text.count("```") >= 2:
|
||||
first = text.find("```") + 3
|
||||
second = text.find("```", first)
|
||||
fx = text[first : second].strip()
|
||||
fx = fx.removeprefix("python\n")
|
||||
fx = fx[fx.find("def"):]
|
||||
if fx.startswith("def matmul(A, B):"): return fx
|
||||
return None
|
||||
print(extract_function(prompt))
|
||||
|
||||
|
||||
# Below is our `function_works` reward function which uses Python's `exec` but guarded by not allowing leakage of local and global variables. We can also use `check_only_stdlib_imports` first to check if there are errors before even executing the function:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
ok, info = check_only_stdlib_imports("def a")
|
||||
ok, info
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def function_works(completions, **kwargs):
|
||||
scores = []
|
||||
for completion in completions:
|
||||
score = 0
|
||||
response = completion[0]["content"]
|
||||
function = extract_function(response)
|
||||
print(function)
|
||||
if function is not None:
|
||||
ok, info = check_only_stdlib_imports(function)
|
||||
if function is None or "error" in info:
|
||||
score = -2.0
|
||||
else:
|
||||
try:
|
||||
new_matmul = create_locked_down_function(function)
|
||||
score = 1.0
|
||||
except:
|
||||
score = -0.5
|
||||
scores.append(score)
|
||||
return scores
|
||||
|
||||
|
||||
# `no_cheating` checks if the function cheated since it might have imported Numpy or Torch optimized code.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def no_cheating(completions, **kwargs):
|
||||
scores = []
|
||||
for completion in completions:
|
||||
score = 0
|
||||
response = completion[0]["content"]
|
||||
function = extract_function(response)
|
||||
if function is not None:
|
||||
ok, info = check_only_stdlib_imports(function)
|
||||
else:
|
||||
ok = False
|
||||
scores.append(1.0 if ok else -20.0) # Penalize heavily!
|
||||
return scores
|
||||
|
||||
|
||||
# Next `correctness_check` checks if the kernel was correct. We want to penalize if the absolute error is larger than 1, and if the mean squared error is somewhat bigger then machine epsilon.
|
||||
#
|
||||
# We have to execute the code now!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
np.finfo(np.float64).eps
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def correctness_check(completions, **kwargs):
|
||||
scores = []
|
||||
# Generate some random matrices of size less than 128
|
||||
A, A_list, B, B_list = generate_random_matrices(seed = np.random.randint(10000), n = 128)
|
||||
for completion in completions:
|
||||
score = 0
|
||||
response = completion[0]["content"]
|
||||
function = extract_function(response)
|
||||
if function is not None:
|
||||
ok, info = check_only_stdlib_imports(function)
|
||||
if function is None or "error" in info:
|
||||
scores.append(0)
|
||||
continue
|
||||
try:
|
||||
new_matmul = create_locked_down_function(function)
|
||||
except:
|
||||
scores.append(0)
|
||||
continue
|
||||
try:
|
||||
pred = new_matmul(A_list.copy(), B_list.copy())
|
||||
except:
|
||||
# Failed!
|
||||
scores.append(-2.0)
|
||||
continue
|
||||
true = np.matmul(A, B)
|
||||
amax_error, mse_error = calculate_difference(pred, true)
|
||||
|
||||
# Check correctness and score!
|
||||
machine_epsilon = 100*np.finfo(np.float64).eps
|
||||
if amax_error >= 3: score = -3.0
|
||||
elif amax_error >= 2: score = -2.5
|
||||
elif amax_error >= 1: score = -2.0
|
||||
elif amax_error >= 0.5: score = -1.0
|
||||
elif amax_error >= 100*machine_epsilon: score = 0.0
|
||||
elif amax_error >= machine_epsilon: score = 1.0
|
||||
else: score = 3.0
|
||||
|
||||
if mse_error >= 3: score += -3.0
|
||||
elif mse_error >= 2: score += -2.5
|
||||
elif mse_error >= 1: score += -2.0
|
||||
elif mse_error >= 0.5: score += -1.0
|
||||
elif mse_error >= 100*machine_epsilon: score += 0.0
|
||||
elif mse_error >= machine_epsilon: score += 1.0
|
||||
else: score += 3.0
|
||||
scores.append(score)
|
||||
return scores
|
||||
|
||||
|
||||
# Finally our benchmarking function for `speed_check`! We shall limit the timer to 10 seconds and do 3 trials.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
A, A_list, B, B_list = generate_random_matrices(seed = 0, n = 256)
|
||||
benchmarker = Benchmarker(trials = 3, timeout = 10)
|
||||
numpy_results = benchmarker.benchmark(np.matmul, [(A, B)])
|
||||
numpy_results
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
new_matmul = create_locked_down_function(extract_function(prompt))
|
||||
new_results = benchmarker.benchmark(new_matmul, [(A_list, B_list)])
|
||||
new_results
|
||||
|
||||
|
||||
# We can take the difference and do a negative sign for slower ones. If the ratio is less than 1 (ie faster, we shall invert it!)
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
negative = -(new_results["median_ns"] / numpy_results["median_ns"]) / 100
|
||||
positive = +(numpy_results["median_ns"] / new_results["median_ns"]) / 100
|
||||
reward = negative if new_results["median_ns"] >= numpy_results["median_ns"] else positive
|
||||
reward
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
new_results["median_ns"] = 3
|
||||
numpy_results["median_ns"] = 1000
|
||||
negative = -(new_results["median_ns"] / numpy_results["median_ns"]) / 100
|
||||
positive = +(numpy_results["median_ns"] / new_results["median_ns"]) / 100
|
||||
reward = negative if new_results["median_ns"] >= numpy_results["median_ns"] else positive
|
||||
reward
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
import gc
|
||||
def speed_check(completions, **kwargs):
|
||||
scores = []
|
||||
# Generate some random matrices of size less than 256
|
||||
A, A_list, B, B_list = generate_random_matrices(seed = np.random.randint(10000), n = 256)
|
||||
numpy_results = benchmarker.benchmark(np.matmul, [(A, B)])
|
||||
for completion in completions:
|
||||
score = 0
|
||||
response = completion[0]["content"]
|
||||
function = extract_function(response)
|
||||
if function is not None:
|
||||
ok, info = check_only_stdlib_imports(function)
|
||||
if function is None or "error" in info:
|
||||
scores.append(0)
|
||||
continue
|
||||
try:
|
||||
new_matmul = create_locked_down_function(function)
|
||||
except:
|
||||
scores.append(0)
|
||||
continue
|
||||
new_results = benchmarker.benchmark(new_matmul, [(A_list.copy(), B_list.copy())])
|
||||
|
||||
# Get score and clip to -10, 10
|
||||
negative = -(new_results["median_ns"] / numpy_results["median_ns"]) / 100
|
||||
positive = +(numpy_results["median_ns"] / new_results["median_ns"]) / 100
|
||||
score = negative if new_results["median_ns"] >= numpy_results["median_ns"] else positive
|
||||
if score >= 10: score = 10
|
||||
if score <= -10: score = -10
|
||||
scores.append(score)
|
||||
# Free memory to counteract OOMs
|
||||
gc.collect()
|
||||
torch.cuda.empty_cache()
|
||||
return scores
|
||||
|
||||
|
||||
# We create the dataset which includes a replica of our prompt.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from datasets import Dataset
|
||||
dataset = Dataset.from_list([{"prompt" : [{"role": "user", "content": prompt.strip()}], "answer" : 0}]*1000)
|
||||
maximum_length = len(tokenizer.apply_chat_template([{"role":"user", "content":prompt.strip()}], add_generation_prompt = True, tokenize = True))
|
||||
print(maximum_length)
|
||||
dataset[0]
|
||||
|
||||
|
||||
# <a name="Train"></a>
|
||||
# ### Train the model
|
||||
#
|
||||
# Now set up GRPO Trainer and all configurations! We also support GSDP, GAPO, Dr GRPO and more! Go to our docs https://unsloth.ai/docs/ for more info!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# Leave room for the prompt (plus 1 token safety margin)
|
||||
max_completion_length = max_seq_length - (maximum_length + 1)
|
||||
|
||||
from trl import GRPOConfig, GRPOTrainer
|
||||
training_args = GRPOConfig(
|
||||
temperature = 1.0,
|
||||
top_p = 0.95,
|
||||
top_k = 64,
|
||||
learning_rate = 5e-5,
|
||||
weight_decay = 0.001,
|
||||
warmup_ratio = 0.1,
|
||||
lr_scheduler_type = "linear",
|
||||
optim = "adamw_8bit",
|
||||
logging_steps = 1,
|
||||
per_device_train_batch_size = 1,
|
||||
gradient_accumulation_steps = 2, # Increase to 4 for smoother training
|
||||
num_generations = 2, # Decrease if out of memory
|
||||
max_completion_length = max_completion_length,
|
||||
# num_train_epochs = 1, # Set to 1 for a full training run
|
||||
max_steps = 100,
|
||||
save_steps = 100,
|
||||
report_to = "none", # Can use Weights & Biases, TrackIO
|
||||
output_dir = "outputs",
|
||||
epsilon = 0.2,
|
||||
epsilon_high = 0.28, # one sided
|
||||
delta = 1.5, # two sided
|
||||
loss_type = 'bnpo',
|
||||
mask_truncated_completions = True
|
||||
# For optional training + evaluation
|
||||
# fp16_full_eval = True,
|
||||
# per_device_eval_batch_size = 4,
|
||||
# eval_accumulation_steps = 1,
|
||||
# eval_strategy = "steps",
|
||||
# eval_steps = 1,
|
||||
)
|
||||
|
||||
|
||||
# And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!
|
||||
#
|
||||
# You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!
|
||||
#
|
||||
# | Step | Training Loss | reward | reward_std | completion_length | kl |
|
||||
# |------|---------------|-----------|------------|-------------------|----------|
|
||||
# | 1 | 0.000000 | 0.125000 | 0.000000 | 200.000000 | 0.000000 |
|
||||
# | 2 | 0.000000 | 0.072375 | 0.248112 | 200.000000 | 0.000000 |
|
||||
# | 3 | 0.000000 | -0.079000 | 0.163776 | 182.500000 | 0.000005 |
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# For optional training + evaluation
|
||||
# new_dataset = dataset.train_test_split(test_size = 0.01)
|
||||
|
||||
trainer = GRPOTrainer(
|
||||
model = model,
|
||||
processing_class = tokenizer,
|
||||
reward_funcs = [
|
||||
function_works,
|
||||
no_cheating,
|
||||
correctness_check,
|
||||
speed_check,
|
||||
],
|
||||
args = training_args,
|
||||
train_dataset = dataset,
|
||||
|
||||
# For optional training + evaluation
|
||||
# train_dataset = new_dataset["train"],
|
||||
# eval_dataset = new_dataset["test"],
|
||||
)
|
||||
|
||||
|
||||
# And let's train the model!
|
||||
#
|
||||
# **NOTE** A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
trainer.train()
|
||||
|
||||
|
||||
# And now with the LoRA we just trained with GRPO - we first save the LoRA first!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
model.save_pretrained("gemma_4_lora") # Local saving
|
||||
tokenizer.save_pretrained("gemma_4_lora")
|
||||
|
||||
|
||||
# Verify LoRA is actually trained!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from safetensors import safe_open
|
||||
|
||||
tensors = {}
|
||||
with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
|
||||
# Verify both A and B are non zero
|
||||
for key in f.keys():
|
||||
tensor = f.get_tensor(key)
|
||||
n_zeros = (tensor == 0).sum() / tensor.numel()
|
||||
assert(n_zeros.item() != tensor.numel())
|
||||
|
||||
|
||||
# <a name="Inference"></a>
|
||||
# # Inference
|
||||
# Now let's try the model we just trained!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
text = tokenizer.apply_chat_template(
|
||||
[{"role": "user", "content": prompt.strip()}],
|
||||
tokenize = False,
|
||||
add_generation_prompt = True,
|
||||
)
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
_ = model.generate(
|
||||
**tokenizer(images = None, text = text, return_tensors = "pt").to("cuda"),
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
max_new_tokens = 1024,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = False),
|
||||
)
|
||||
|
||||
|
||||
# <a name="Save"></a>
|
||||
# ### Saving to float16 for VLLM
|
||||
#
|
||||
# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# Merge to 16bit
|
||||
if False: model.save_pretrained_merged("gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit",)
|
||||
if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Merge to 4bit
|
||||
if False: model.save_pretrained_merged("gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit",)
|
||||
if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Just LoRA adapters
|
||||
if False:
|
||||
model.save_pretrained("gemma_4_lora")
|
||||
tokenizer.save_pretrained("gemma_4_lora")
|
||||
if False:
|
||||
model.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
|
||||
tokenizer.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
|
||||
|
||||
|
||||
# ### GGUF / llama.cpp Conversion
|
||||
# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
|
||||
#
|
||||
# Some supported quant methods (full list on our [docs page](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)):
|
||||
# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
|
||||
# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
|
||||
# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
|
||||
#
|
||||
# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# Save to 8bit Q8_0
|
||||
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer,)
|
||||
# Remember to go to https://huggingface.co/settings/tokens for a token!
|
||||
# And change hf to your username!
|
||||
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Save to 16bit GGUF
|
||||
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "f16")
|
||||
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Save to q4_k_m GGUF
|
||||
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "q4_k_m")
|
||||
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Save to multiple GGUF options - much faster if you want multiple!
|
||||
if False:
|
||||
model.push_to_hub_gguf(
|
||||
"HF_USERNAME/gemma_4_finetune", # Change hf to your username!
|
||||
tokenizer,
|
||||
quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
|
||||
token = "YOUR_HF_TOKEN",
|
||||
)
|
||||
|
||||
|
||||
# Now, use the `gemma_4_finetune.Q8_0.gguf` file or `gemma_4_finetune.Q4_K_M.gguf` file in llama.cpp.
|
||||
#
|
||||
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
|
||||
#
|
||||
# Some other resources:
|
||||
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
|
||||
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
|
||||
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
|
||||
#
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
|
||||
#
|
||||
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
|
||||
# </div>
|
||||
#
|
||||
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
+913
@@ -0,0 +1,913 @@
|
||||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
|
||||
# </div>
|
||||
#
|
||||
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
#
|
||||
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
|
||||
|
||||
# # Goal: Make Gemma 4 play games with Reinforcement Learning
|
||||
#
|
||||
# Our goal is to make Gemma 4 play the 2048 game with reinforcement learning, or a variant of it called [GRPO](https://arxiv.org/abs/2501.12948).
|
||||
#
|
||||
# We want the model to devise a strategy to play 2048, and we will run this strategy until we win or lose. We then reward the model if it created a good strategy (winning the game), and we'll penalize it (negative reward) if the strategy was a bad one.
|
||||
#
|
||||
# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/2048_win.png/500px-2048_win.png" height=300 />
|
||||
|
||||
# # Installation
|
||||
# We'll be using [Unsloth](https://github.com/unslothai/unsloth) to do RL on Gemma 4. Unsloth saves 70% VRAM usage and makes reinforcement learning 2 to 6x faster!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
get_ipython().run_cell_magic('capture', '', 'import os, importlib.util\n!pip install --upgrade -qqq uv\nif importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):\n try: import numpy, PIL; _numpy = f"numpy=={numpy.__version__}"; _pil = f"pillow=={PIL.__version__}"\n except: _numpy = "numpy"; _pil = "pillow"\n # Gemma 4 requires transformers >= 5.5.0 — do NOT pin to 4.x here\n !uv pip install -qqq \\\n "torch>=2.8.0" "triton>=3.4.0" {_numpy} {_pil} torchvision bitsandbytes \\\n "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \\\n "unsloth[base] @ git+https://github.com/unslothai/unsloth" \\\n git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels\nelif importlib.util.find_spec("unsloth") is None:\n !uv pip install -qqq unsloth\n# Gemma 4 requires transformers >= 5.5.0\n!uv pip install --upgrade --no-deps "transformers>=5.5.0" tokenizers "trl>=0.28.0" unsloth unsloth_zoo\n')
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
|
||||
|
||||
|
||||
# ### Unsloth
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from unsloth import FastVisionModel
|
||||
import torch
|
||||
max_seq_length = 4096 # Can increase for longer reasoning traces
|
||||
lora_rank = 32 # Larger rank = smarter, but slower
|
||||
|
||||
gemma4_models = [
|
||||
# Gemma-4 instruct models:
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E4B-it",
|
||||
"unsloth/gemma-4-31B-it",
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
# Gemma-4 base models:
|
||||
"unsloth/gemma-4-E2B",
|
||||
"unsloth/gemma-4-E4B",
|
||||
"unsloth/gemma-4-31B",
|
||||
"unsloth/gemma-4-26B-A4B",
|
||||
] # More models at https://huggingface.co/unsloth
|
||||
|
||||
model, tokenizer = FastVisionModel.from_pretrained(
|
||||
model_name = "unsloth/gemma-4-E2B-it",
|
||||
max_seq_length = max_seq_length,
|
||||
load_in_4bit = False, # False for LoRA 16bit
|
||||
fast_inference = False, # Enable vllm fast inference
|
||||
)
|
||||
|
||||
|
||||
# To do efficient RL, we will use [LoRA](https://arxiv.org/abs/2106.09685), which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
model = FastVisionModel.get_peft_model(
|
||||
model,
|
||||
r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
|
||||
target_modules = [
|
||||
"q_proj", "k_proj", "v_proj", "o_proj",
|
||||
"gate_proj", "up_proj", "down_proj",
|
||||
],
|
||||
lora_alpha = lora_rank*2, # *2 speeds up training
|
||||
use_gradient_checkpointing = "unsloth", # Reduces memory usage
|
||||
random_state = 3407,
|
||||
)
|
||||
|
||||
|
||||
# # 2048 game
|
||||
#
|
||||
# We used GPT-5 to create a variant of the 2048 game. It should output the current game board state, and allow us to advance the game board state with 1 action (up, down, left, right).
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
#@title (Collapsible) 2048 Game Implementation
|
||||
from dataclasses import dataclass, field
|
||||
from typing import List, Tuple, Optional
|
||||
import random
|
||||
import copy
|
||||
|
||||
def _compress_and_merge_row_left(row: List[int]) -> Tuple[List[int], int, bool]:
|
||||
n = len(row)
|
||||
tiles = [x for x in row if x != 0]
|
||||
gained = 0
|
||||
i = 0
|
||||
merged = []
|
||||
while i < len(tiles):
|
||||
if i + 1 < len(tiles) and tiles[i] == tiles[i + 1]:
|
||||
v = tiles[i] * 2
|
||||
gained += v
|
||||
merged.append(v)
|
||||
i += 2
|
||||
else:
|
||||
merged.append(tiles[i])
|
||||
i += 1
|
||||
merged += [0] * (n - len(merged))
|
||||
changed = merged != row
|
||||
return merged, gained, changed
|
||||
|
||||
def _move_left(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
|
||||
changed_any = False
|
||||
total_gain = 0
|
||||
new_board = []
|
||||
for row in board:
|
||||
new_row, gained, changed = _compress_and_merge_row_left(row)
|
||||
new_board.append(new_row)
|
||||
total_gain += gained
|
||||
changed_any = changed_any or changed
|
||||
return new_board, total_gain, changed_any
|
||||
|
||||
def _move_right(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
|
||||
changed_any = False
|
||||
total_gain = 0
|
||||
new_board = []
|
||||
for row in board:
|
||||
rev = list(reversed(row))
|
||||
new_rev, gained, changed = _compress_and_merge_row_left(rev)
|
||||
new_row = list(reversed(new_rev))
|
||||
new_board.append(new_row)
|
||||
total_gain += gained
|
||||
changed_any = changed_any or changed
|
||||
return new_board, total_gain, changed_any
|
||||
|
||||
def _transpose(board: List[List[int]]) -> List[List[int]]:
|
||||
return [list(row) for row in zip(*board)]
|
||||
|
||||
def _move_up(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
|
||||
t = _transpose(board)
|
||||
moved, gain, changed = _move_left(t)
|
||||
return _transpose(moved), gain, changed
|
||||
|
||||
def _move_down(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
|
||||
t = _transpose(board)
|
||||
moved, gain, changed = _move_right(t)
|
||||
return _transpose(moved), gain, changed
|
||||
|
||||
def _empty_cells(board: List[List[int]]) -> List[Tuple[int, int]]:
|
||||
size = len(board)
|
||||
return [(r, c) for r in range(size) for c in range(size) if board[r][c] == 0]
|
||||
|
||||
def _can_move(board: List[List[int]]) -> bool:
|
||||
if _empty_cells(board):
|
||||
return True
|
||||
size = len(board)
|
||||
for r in range(size):
|
||||
for c in range(size - 1):
|
||||
if board[r][c] == board[r][c + 1]:
|
||||
return True
|
||||
for r in range(size - 1):
|
||||
for c in range(size):
|
||||
if board[r][c] == board[r + 1][c]:
|
||||
return True
|
||||
return False
|
||||
|
||||
@dataclass
|
||||
class GameBoard:
|
||||
size: int
|
||||
seed: Optional[int] = None
|
||||
target: int = 2048
|
||||
probability_fours: float = 0.10 # originally spawns (4) 10% of the time!
|
||||
_rng: random.Random = field(init = False, repr = False)
|
||||
_board: List[List[int]] = field(init = False, repr = False)
|
||||
_score: int = field(default = 0, init = False, repr = False)
|
||||
_state: str = field(default = "ongoing", init = False, repr = False)
|
||||
|
||||
def __post_init__(self):
|
||||
if self.size < 2:
|
||||
raise ValueError("Board size must be at least 2.")
|
||||
self._rng = random.Random(self.seed)
|
||||
self._board = [[0 for _ in range(self.size)] for _ in range(self.size)]
|
||||
self._add_random_tile()
|
||||
self._add_random_tile()
|
||||
self._update_state_after_change()
|
||||
|
||||
class _BoardView:
|
||||
def __init__(self, game: "GameBoard"):
|
||||
self._game = game
|
||||
def __iter__(self):
|
||||
return iter(self._game._board)
|
||||
def __len__(self):
|
||||
return len(self._game._board)
|
||||
def __getitem__(self, idx):
|
||||
return self._game._board[idx]
|
||||
def __repr__(self) -> str:
|
||||
return repr(self._game._board)
|
||||
__str__ = __repr__
|
||||
def do_action(self, key: str) -> None:
|
||||
self._game.do_action(key)
|
||||
def state(self) -> str:
|
||||
return self._game.state()
|
||||
def pretty(self, colors: bool = True, border: bool = True, dot_for_zero: bool = True) -> str:
|
||||
return self._game._render_pretty(colors = colors, border = border, dot_for_zero = dot_for_zero)
|
||||
|
||||
def board(self) -> "_BoardView":
|
||||
return GameBoard._BoardView(self)
|
||||
def state(self) -> str:
|
||||
return self._state
|
||||
def score(self) -> int:
|
||||
return self._score
|
||||
def do_action(self, key: str) -> None:
|
||||
if self._state != "ongoing":
|
||||
return
|
||||
if not isinstance(key, str) or len(key) == 0:
|
||||
self._state = "failed"
|
||||
return
|
||||
k = key.strip().lower()
|
||||
if k == "q":
|
||||
self._state = "failed"
|
||||
return
|
||||
move_map = {"a": _move_left, "d": _move_right, "w": _move_up, "s": _move_down}
|
||||
if k not in move_map:
|
||||
self._state = "failed"
|
||||
return
|
||||
mover = move_map[k]
|
||||
new_board, gain, changed = mover(self._board)
|
||||
if changed:
|
||||
self._board = new_board
|
||||
self._score += gain
|
||||
self._add_random_tile()
|
||||
self._update_state_after_change()
|
||||
def _add_random_tile(self) -> bool:
|
||||
empties = _empty_cells(self._board)
|
||||
if not empties:
|
||||
return False
|
||||
r, c = self._rng.choice(empties)
|
||||
self._board[r][c] = 4 if self._rng.random() < self.probability_fours else 2
|
||||
return True
|
||||
def _update_state_after_change(self) -> None:
|
||||
if any(self.target in row for row in self._board):
|
||||
self._state = "success"
|
||||
return
|
||||
if not _can_move(self._board):
|
||||
self._state = "failed"
|
||||
return
|
||||
self._state = "ongoing"
|
||||
def _render_pretty(self, colors: bool = True, border: bool = True, dot_for_zero: bool = True) -> str:
|
||||
"""
|
||||
Pretty-print the board with colors that scale from 0 up to self.target.
|
||||
Uses ANSI 256-color codes (works in most terminals). Set colors = False to disable.
|
||||
"""
|
||||
import math
|
||||
|
||||
b = self._board
|
||||
mx = max((max(row) for row in b), default = 0)
|
||||
cell_w = max(3, len(str(mx)))
|
||||
|
||||
RESET = "\x1b[0m"
|
||||
|
||||
# A smooth-ish gradient from cool → warm
|
||||
# (blue/cyan/green → yellow/orange/red). Tweak or expand as you like.
|
||||
GRAD = [33, 39, 45, 51, 50, 49, 48, 47, 46, 82, 118, 154, 190, 226, 220, 214, 208, 202, 196]
|
||||
ZERO_FG = 239 # dim gray
|
||||
|
||||
def color_code(v: int) -> str:
|
||||
if not colors:
|
||||
return ""
|
||||
if v == 0:
|
||||
return f"\x1b[38;5;{ZERO_FG}m"
|
||||
# Normalize by exponent relative to target: r in [0,1]
|
||||
t = max(2, self.target) # safety; avoid log2(1)
|
||||
# Guard: if v is not a power of two or is <1, handle gracefully
|
||||
try:
|
||||
r = max(0.0, min(1.0, math.log2(v) / math.log2(t)))
|
||||
except ValueError:
|
||||
r = 0.0
|
||||
idx = int(round(r * (len(GRAD) - 1)))
|
||||
return f"\x1b[38;5;{GRAD[idx]}m"
|
||||
|
||||
def fmt(v: int) -> str:
|
||||
s = "." if (v == 0 and dot_for_zero) else str(v)
|
||||
s = s.rjust(cell_w)
|
||||
return color_code(v) + s + (RESET if colors else "")
|
||||
|
||||
def hline(left: str, mid: str, right: str) -> str:
|
||||
return left + mid.join("─" * cell_w for _ in range(self.size)) + right
|
||||
|
||||
rows = []
|
||||
if border:
|
||||
rows.append(hline("┌", "┬", "┐"))
|
||||
for r in range(self.size):
|
||||
content = "│".join(fmt(v) for v in b[r])
|
||||
rows.append(("│" + content + "│") if border else content)
|
||||
if border:
|
||||
rows.append(hline("└" if r == self.size - 1 else "├",
|
||||
"┴" if r == self.size - 1 else "┼",
|
||||
"┘" if r == self.size - 1 else "┤"))
|
||||
return "\n".join(rows)
|
||||
|
||||
|
||||
# For example let's create a board of size 5 X 5 and set the target to 8 instead of 2048.
|
||||
#
|
||||
# **[NOTE]** 2048 originally spawns a (4) 10% of the time! We can disable this for harder games. See [Wikipedia page](https://en.wikipedia.org/wiki/2048_(video_game)) for more details.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
game = GameBoard(size = 5, seed = 42, target = 8, probability_fours = 0.10)
|
||||
print(game.board().pretty(), game.state())
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
game
|
||||
|
||||
|
||||
# We'll use WASD for the action space:
|
||||
#
|
||||
# ```
|
||||
# W
|
||||
# A S D
|
||||
# ```
|
||||
# Also `game.state()` will say `success` if we succeeded in getting the target!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
game.do_action("A")
|
||||
print(game.board().pretty(), game.state())
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
game.do_action("W")
|
||||
print(game.board().pretty(), game.state())
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
game.do_action("D")
|
||||
print(game.board().pretty(), game.state())
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
game.do_action("W")
|
||||
print(game.board().pretty(), game.state())
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
game.do_action("D")
|
||||
print(game.board().pretty(), game.state())
|
||||
|
||||
|
||||
# If we do some other action that's not part of the action space, we will get an error, and the game will not accept anymore actions.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
game = GameBoard(size = 3, seed = 42, target = 8, probability_fours = 0.10)
|
||||
game.do_action("AA") # Not in WASD
|
||||
game.do_action("W") # Doesn't do anything
|
||||
game.do_action("A") # Doesn't do anything
|
||||
print(game.board().pretty(), game.state())
|
||||
|
||||
|
||||
# # RL Environment Setup
|
||||
#
|
||||
# We'll set up a function to accept some strategy that'll emit an action within `WASD` and check the game state.
|
||||
#
|
||||
# We'll also add a timer to only execute the strategy for 2 seconds maximum, otherwise it might never terminate!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from typing import Callable
|
||||
from unsloth import execute_with_time_limit
|
||||
|
||||
def _execute_strategy(strategy : Callable, game : GameBoard):
|
||||
assert callable(strategy)
|
||||
|
||||
steps = 0
|
||||
while game.state() == "ongoing":
|
||||
action = strategy(list(game.board()))
|
||||
steps += 1
|
||||
if type(action) is not str:
|
||||
return steps, "failed"
|
||||
game.do_action(action)
|
||||
return steps, game.state()
|
||||
|
||||
@execute_with_time_limit(2)
|
||||
def execute_strategy(strategy : Callable, game : GameBoard):
|
||||
return _execute_strategy(strategy, game)
|
||||
|
||||
|
||||
# Let's make a generic strategy to just hit `W`. We should expect this generic strategy to fail:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def always_move_left(board):
|
||||
return "W"
|
||||
|
||||
game = GameBoard(size = 8, seed = 42, target = 2048, probability_fours = 0.10)
|
||||
try:
|
||||
execute_strategy(always_move_left, game)
|
||||
except TimeoutError as e:
|
||||
print(f"Timed out with error = {str(e)}")
|
||||
|
||||
|
||||
# To allow longer strategies for Gemma 4 Reinforcement Learning, we shall allow a 5 second timer.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
@execute_with_time_limit(5)
|
||||
def execute_strategy(strategy : Callable, game : GameBoard):
|
||||
return _execute_strategy(strategy, game)
|
||||
|
||||
|
||||
# # Code Execution
|
||||
#
|
||||
# To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called `countering reward hacking` since we don't want the function to cheat.
|
||||
#
|
||||
# For example the below piece of code is fine, since it only imports Python level functions. We use `check_python_modules`:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from unsloth import check_python_modules
|
||||
|
||||
sample = """
|
||||
def strategy(board):
|
||||
import math
|
||||
from typing import Callable
|
||||
return "W"
|
||||
"""
|
||||
ok, info = check_python_modules(sample)
|
||||
print("Only Python imports?", ok)
|
||||
print(info)
|
||||
|
||||
|
||||
# For the below piece of code, since we import `numpy`, we should not allow the execution:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
sample = """
|
||||
def strategy(board):
|
||||
from numpy import matmul
|
||||
return "W"
|
||||
"""
|
||||
ok, info = check_python_modules(sample)
|
||||
print("Only Python imports?", ok)
|
||||
print(info)
|
||||
|
||||
|
||||
# We also disallow global variable access. We'll use Unsloth's `create_locked_down_function` function
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from unsloth import create_locked_down_function
|
||||
function = """
|
||||
def import_numpy():
|
||||
np.matmul
|
||||
print("Success")
|
||||
"""
|
||||
f = create_locked_down_function(function)
|
||||
try:
|
||||
f()
|
||||
except Exception as e:
|
||||
print(str(e))
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from unsloth import create_locked_down_function
|
||||
function = """
|
||||
def add(a, b):
|
||||
def adder(a):
|
||||
return a + b
|
||||
return adder(b) + b
|
||||
"""
|
||||
f = create_locked_down_function(function)
|
||||
try:
|
||||
print(f(10, 20))
|
||||
except Exception as e:
|
||||
print(str(e))
|
||||
|
||||
|
||||
# # Data & RL task setup
|
||||
#
|
||||
# We now have to create a prompt to tell the model to create a strategy for the 2048 game. You can customize this to some other task for another RL task.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
prompt = """
|
||||
Create a new short 2048 strategy using only native Python code.
|
||||
You are given a list of list of numbers for the current board state.
|
||||
Output one action for "W", "A", "S", "D" on what is the optimal next step.
|
||||
Output your new short function in backticks using the format below:
|
||||
```python
|
||||
def strategy(board):
|
||||
return "W" # Example
|
||||
```
|
||||
All helper functions should be inside def strategy. Only output the short function `strategy`.
|
||||
""".strip()
|
||||
print(prompt)
|
||||
|
||||
|
||||
# First, let's prompt Gemma 4 without RL and see how it goes:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
text = tokenizer.apply_chat_template(
|
||||
[{"role": "user", "content": prompt.strip()}],
|
||||
tokenize = False,
|
||||
add_generation_prompt = True,
|
||||
)
|
||||
|
||||
from transformers import TextStreamer
|
||||
print("=" * 50)
|
||||
print("BASE MODEL OUTPUT (before RL training):")
|
||||
print("=" * 50)
|
||||
|
||||
inputs = tokenizer(
|
||||
text = text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
|
||||
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# # Reward functions
|
||||
#
|
||||
# We now design a `extract_function` function which simply extracts the function wrapped in 3 back ticks.
|
||||
#
|
||||
# And 3 reward functions:
|
||||
#
|
||||
# 1. `function_works` which rewards the model if the strategy is a valid Python function.
|
||||
# 2. `no_cheating` which checks if the function imported other modules, and if it did, we penalize it.
|
||||
# 3. `strategy_succeeds` which checks if the game strategy actually succeeds in attaining 2048 after running the auto-generated strategy.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def extract_function(text):
|
||||
if text.count("```") >= 2:
|
||||
first = text.find("```") + 3
|
||||
second = text.find("```", first)
|
||||
fx = text[first : second].strip()
|
||||
fx = fx.removeprefix("python\n")
|
||||
fx = fx[fx.find("def"):]
|
||||
if fx.startswith("def strategy(board):"): return fx
|
||||
return None
|
||||
print(extract_function(prompt))
|
||||
|
||||
|
||||
# Below is our `function_works` reward function which uses Python's `exec` but guarded by not allowing leakage of local and global variables. We can also use `check_python_modules` first to check if there are errors before even executing the function:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
ok, info = check_python_modules("def a")
|
||||
ok, info
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def function_works(completions, **kwargs):
|
||||
scores = []
|
||||
for completion in completions:
|
||||
score = 0
|
||||
response = completion[0]["content"]
|
||||
function = extract_function(response)
|
||||
if function is not None:
|
||||
ok, info = check_python_modules(function)
|
||||
if function is None or "error" in info:
|
||||
score = -2.0
|
||||
else:
|
||||
try:
|
||||
new_strategy = create_locked_down_function(function)
|
||||
score = 1.0
|
||||
except:
|
||||
score = -0.5
|
||||
scores.append(score)
|
||||
return scores
|
||||
|
||||
|
||||
# `no_cheating` checks if the function cheated since it might have imported Numpy or other functions:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def no_cheating(completions, **kwargs):
|
||||
scores = []
|
||||
for completion in completions:
|
||||
score = 0
|
||||
response = completion[0]["content"]
|
||||
function = extract_function(response)
|
||||
if function is not None:
|
||||
ok, info = check_python_modules(function)
|
||||
scores.append(1.0 if ok else -20.0) # Penalize heavily!
|
||||
else:
|
||||
scores.append(-1.0) # Failed creating function
|
||||
return scores
|
||||
|
||||
|
||||
# Next `strategy_succeeds` checks if the strategy actually allows the game to terminate. Imagine if the strategy simply returned "W" which would fail after a time limit of 10 seconds.
|
||||
#
|
||||
# We also add a global `PRINTER` to print out the strategy and board state.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
import numpy as np
|
||||
global PRINTER
|
||||
PRINTER = 0
|
||||
def strategy_succeeds(completions, **kwargs):
|
||||
global PRINTER
|
||||
scores = []
|
||||
# Generate a random game board with seed
|
||||
seed = np.random.randint(10000)
|
||||
for completion in completions:
|
||||
printed = False
|
||||
score = 0
|
||||
response = completion[0]["content"]
|
||||
function = extract_function(response)
|
||||
if PRINTER % 5 == 0:
|
||||
printed = True
|
||||
print(function)
|
||||
PRINTER += 1
|
||||
if function is not None:
|
||||
ok, info = check_python_modules(function)
|
||||
if function is None or "error" in info:
|
||||
scores.append(0)
|
||||
continue
|
||||
try:
|
||||
new_strategy = create_locked_down_function(function)
|
||||
except:
|
||||
scores.append(0)
|
||||
continue
|
||||
try:
|
||||
game = GameBoard(size = 6, seed = seed, target = 2048, probability_fours = 0.10)
|
||||
steps, game_state = execute_strategy(new_strategy, game)
|
||||
print(f"Steps = {steps} State = {game_state}")
|
||||
if printed is False:
|
||||
print(function)
|
||||
print(game.board().pretty())
|
||||
if game_state == "success":
|
||||
scores.append(20.0) # Success - massively reward!
|
||||
else:
|
||||
scores.append(2.0) # Failed but function works!
|
||||
except TimeoutError as e:
|
||||
print("Timeout")
|
||||
scores.append(-1.0) # Failed with timeout
|
||||
except Exception as e:
|
||||
print(f"Exception = {str(e)}")
|
||||
scores.append(-3.0) # Failed
|
||||
return scores
|
||||
|
||||
|
||||
# We'll now create the dataset which includes a replica of our prompt.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from datasets import Dataset
|
||||
dataset = Dataset.from_list([{"prompt" : [{"role": "user", "content": prompt.strip()}], "answer" : 0}]*1000)
|
||||
maximum_length = len(tokenizer.apply_chat_template([{"role":"user", "content":prompt.strip()}], add_generation_prompt = True, tokenize = True))
|
||||
print(maximum_length)
|
||||
dataset[0]
|
||||
|
||||
|
||||
# <a name="Train"></a>
|
||||
# ### Train the model
|
||||
#
|
||||
# Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth [Reinforcement Learning Docs](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) for more options.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# Leave room for the prompt (plus 1 token safety margin)
|
||||
max_completion_length = max_seq_length - (maximum_length + 1)
|
||||
|
||||
from trl import GRPOConfig, GRPOTrainer
|
||||
training_args = GRPOConfig(
|
||||
temperature = 1.0,
|
||||
top_p = 0.95,
|
||||
top_k = 64,
|
||||
learning_rate = 5e-5,
|
||||
weight_decay = 0.001,
|
||||
warmup_ratio = 0.1,
|
||||
lr_scheduler_type = "linear",
|
||||
optim = "adamw_8bit",
|
||||
logging_steps = 1,
|
||||
per_device_train_batch_size = 1,
|
||||
gradient_accumulation_steps = 2, # Increase to 4 for smoother training
|
||||
num_generations = 2, # Decrease if out of memory
|
||||
max_completion_length = max_completion_length,
|
||||
# num_train_epochs = 1, # Set to 1 for a full training run
|
||||
max_steps = 60,
|
||||
save_steps = 100,
|
||||
report_to = "none", # Can use Weights & Biases, TrackIO
|
||||
output_dir = "outputs",
|
||||
epsilon = 0.2,
|
||||
epsilon_high = 0.28, # one sided
|
||||
delta = 1.5, # two sided
|
||||
loss_type = 'bnpo',
|
||||
mask_truncated_completions = True
|
||||
# For optional training + evaluation
|
||||
# fp16_full_eval = True,
|
||||
# per_device_eval_batch_size = 4,
|
||||
# eval_accumulation_steps = 1,
|
||||
# eval_strategy = "steps",
|
||||
# eval_steps = 1,
|
||||
)
|
||||
|
||||
|
||||
# And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!
|
||||
#
|
||||
# You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!
|
||||
#
|
||||
# | Step | Training Loss | reward | reward_std | completion_length | kl |
|
||||
# |------|---------------|-----------|------------|-------------------|----------|
|
||||
# | 1 | 0.000000 | 0.125000 | 0.000000 | 200.000000 | 0.000000 |
|
||||
# | 2 | 0.000000 | 0.072375 | 0.248112 | 200.000000 | 0.000000 |
|
||||
# | 3 | 0.000000 | -0.079000 | 0.163776 | 182.500000 | 0.000005 |
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# For optional training + evaluation
|
||||
# new_dataset = dataset.train_test_split(test_size = 0.01)
|
||||
|
||||
trainer = GRPOTrainer(
|
||||
model = model,
|
||||
processing_class = tokenizer,
|
||||
reward_funcs = [
|
||||
function_works,
|
||||
no_cheating,
|
||||
strategy_succeeds,
|
||||
],
|
||||
args = training_args,
|
||||
train_dataset = dataset,
|
||||
|
||||
# For optional training + evaluation
|
||||
# train_dataset = new_dataset["train"],
|
||||
# eval_dataset = new_dataset["test"],
|
||||
)
|
||||
|
||||
|
||||
# And let's train the model!
|
||||
#
|
||||
# **NOTE** A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
trainer.train()
|
||||
|
||||
|
||||
# And now with the LoRA we just trained with GRPO - we first save the LoRA first!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
model.save_pretrained("gemma_4_lora") # Local saving
|
||||
tokenizer.save_pretrained("gemma_4_lora")
|
||||
|
||||
|
||||
# Verify LoRA is actually trained!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from safetensors import safe_open
|
||||
|
||||
tensors = {}
|
||||
with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
|
||||
# Verify both A and B are non zero
|
||||
for key in f.keys():
|
||||
tensor = f.get_tensor(key)
|
||||
n_zeros = (tensor == 0).sum() / tensor.numel()
|
||||
assert(n_zeros.item() != tensor.numel())
|
||||
|
||||
|
||||
# <a name="Inference"></a>
|
||||
# # Inference
|
||||
# Now let's try the model we just trained!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
text = tokenizer.apply_chat_template(
|
||||
[{"role": "user", "content": prompt.strip()}],
|
||||
tokenize = False,
|
||||
add_generation_prompt = True,
|
||||
)
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
_ = model.generate(
|
||||
**tokenizer(images = None, text = text, return_tensors = "pt").to("cuda"),
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
max_new_tokens = 1024,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = False),
|
||||
)
|
||||
|
||||
|
||||
# <a name="Save"></a>
|
||||
# ### Saving to float16 for VLLM
|
||||
#
|
||||
# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# Merge to 16bit
|
||||
if False: model.save_pretrained_merged("gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit",)
|
||||
if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Merge to 4bit
|
||||
if False: model.save_pretrained_merged("gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit",)
|
||||
if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Just LoRA adapters
|
||||
if False:
|
||||
model.save_pretrained("gemma_4_lora")
|
||||
tokenizer.save_pretrained("gemma_4_lora")
|
||||
if False:
|
||||
model.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
|
||||
tokenizer.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
|
||||
|
||||
|
||||
# ### GGUF / llama.cpp Conversion
|
||||
# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
|
||||
#
|
||||
# Some supported quant methods (full list on our [docs page](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)):
|
||||
# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
|
||||
# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
|
||||
# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
|
||||
#
|
||||
# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# Save to 8bit Q8_0
|
||||
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer,)
|
||||
# Remember to go to https://huggingface.co/settings/tokens for a token!
|
||||
# And change hf to your username!
|
||||
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Save to 16bit GGUF
|
||||
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "f16")
|
||||
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Save to q4_k_m GGUF
|
||||
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "q4_k_m")
|
||||
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Save to multiple GGUF options - much faster if you want multiple!
|
||||
if False:
|
||||
model.push_to_hub_gguf(
|
||||
"HF_USERNAME/gemma_4_finetune", # Change hf to your username!
|
||||
tokenizer,
|
||||
quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
|
||||
token = "YOUR_HF_TOKEN",
|
||||
)
|
||||
|
||||
|
||||
# Now, use the `gemma_4_finetune.Q8_0.gguf` file or `gemma_4_finetune.Q4_K_M.gguf` file in llama.cpp.
|
||||
#
|
||||
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
|
||||
#
|
||||
# Some other resources:
|
||||
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
|
||||
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
|
||||
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
|
||||
#
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
|
||||
#
|
||||
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
|
||||
# </div>
|
||||
#
|
||||
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
+897
@@ -0,0 +1,897 @@
|
||||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
|
||||
# </div>
|
||||
#
|
||||
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
#
|
||||
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
|
||||
|
||||
# # Goal: Make Gemma 4 solve Sudoku puzzles with Reinforcement Learning
|
||||
#
|
||||
# Our goal is to make Gemma 4 learn to solve Sudoku puzzles using reinforcement learning (GRPO).
|
||||
# The model will devise a strategy to fill in empty cells, and we'll reward it for correct placements
|
||||
# and completing valid puzzles.
|
||||
#
|
||||
# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/12/Sudoku_Puzzle_by_L2G-20050714_solution_standardized_layout.svg/1280px-Sudoku_Puzzle_by_L2G-20050714_solution_standardized_layout.svg.png" height="300" />
|
||||
|
||||
# # Installation
|
||||
# We'll be using [Unsloth](https://github.com/unslothai/unsloth) to do RL on Gemma 4. Unsloth saves 70% VRAM usage and makes reinforcement learning 2 to 6x faster.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
get_ipython().run_cell_magic('capture', '', 'import os, importlib.util\n!pip install --upgrade -qqq uv\nif importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):\n try: import numpy, PIL; _numpy = f"numpy=={numpy.__version__}"; _pil = f"pillow=={PIL.__version__}"\n except: _numpy = "numpy"; _pil = "pillow"\n # Gemma 4 requires transformers >= 5.5.0 — do NOT pin to 4.x here\n !uv pip install -qqq \\\n "torch>=2.8.0" "triton>=3.4.0" {_numpy} {_pil} torchvision bitsandbytes \\\n "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \\\n "unsloth[base] @ git+https://github.com/unslothai/unsloth" \\\n git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels\nelif importlib.util.find_spec("unsloth") is None:\n !uv pip install -qqq unsloth\n# Gemma 4 requires transformers >= 5.5.0\n!uv pip install --upgrade --no-deps "transformers>=5.5.0" tokenizers "trl>=0.28.0" unsloth unsloth_zoo\n')
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
|
||||
|
||||
|
||||
# ### Unsloth
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from unsloth import FastVisionModel
|
||||
import torch
|
||||
max_seq_length = 4096 # Can increase for longer reasoning traces
|
||||
lora_rank = 32 # Larger rank = smarter, but slower
|
||||
|
||||
gemma4_models = [
|
||||
# Gemma-4 instruct models:
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E4B-it",
|
||||
"unsloth/gemma-4-31B-it",
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
# Gemma-4 base models:
|
||||
"unsloth/gemma-4-E2B",
|
||||
"unsloth/gemma-4-E4B",
|
||||
"unsloth/gemma-4-31B",
|
||||
"unsloth/gemma-4-26B-A4B",
|
||||
] # More models at https://huggingface.co/unsloth
|
||||
|
||||
model, tokenizer = FastVisionModel.from_pretrained(
|
||||
model_name = "unsloth/gemma-4-E2B-it",
|
||||
max_seq_length = max_seq_length,
|
||||
load_in_4bit = False, # False for LoRA 16bit
|
||||
fast_inference = False, # Enable vllm fast inference
|
||||
)
|
||||
|
||||
|
||||
# To do efficient RL, we will use [LoRA](https://arxiv.org/abs/2106.09685), which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
model = FastVisionModel.get_peft_model(
|
||||
model,
|
||||
r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
|
||||
target_modules = [
|
||||
"q_proj", "k_proj", "v_proj", "o_proj",
|
||||
"gate_proj", "up_proj", "down_proj",
|
||||
],
|
||||
lora_alpha = lora_rank*2, # *2 speeds up training
|
||||
use_gradient_checkpointing = "unsloth", # Reduces memory usage
|
||||
random_state = 3407,
|
||||
)
|
||||
|
||||
|
||||
# # Sudoku Game Implementation
|
||||
#
|
||||
# We use GPT-5 to create a clean Sudoku solver environment. The strategy outputs "row,col,value" to fill cells.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
#@title Sudoku Game Implementation
|
||||
from dataclasses import dataclass, field
|
||||
from typing import List, Tuple, Optional
|
||||
import random
|
||||
import copy
|
||||
|
||||
def _is_valid_placement(board: List[List[int]], row: int, col: int, num: int) -> bool:
|
||||
"""Check if placing num at (row, col) is valid."""
|
||||
# Check row
|
||||
if num in board[row]:
|
||||
return False
|
||||
|
||||
# Check column
|
||||
if num in [board[r][col] for r in range(9)]:
|
||||
return False
|
||||
|
||||
# Check 3x3 box
|
||||
box_row, box_col = 3 * (row // 3), 3 * (col // 3)
|
||||
for r in range(box_row, box_row + 3):
|
||||
for c in range(box_col, box_col + 3):
|
||||
if board[r][c] == num:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def _solve_sudoku(board: List[List[int]]) -> bool:
|
||||
"""Solve sudoku using backtracking (for puzzle generation)."""
|
||||
for row in range(9):
|
||||
for col in range(9):
|
||||
if board[row][col] == 0:
|
||||
for num in range(1, 10):
|
||||
if _is_valid_placement(board, row, col, num):
|
||||
board[row][col] = num
|
||||
if _solve_sudoku(board):
|
||||
return True
|
||||
board[row][col] = 0
|
||||
return False
|
||||
return True
|
||||
|
||||
def _generate_complete_board(rng: random.Random) -> List[List[int]]:
|
||||
"""Generate a complete valid Sudoku board."""
|
||||
board = [[0 for _ in range(9)] for _ in range(9)]
|
||||
|
||||
# Fill diagonal 3x3 boxes first (they don't affect each other)
|
||||
for box in range(3):
|
||||
nums = list(range(1, 10))
|
||||
rng.shuffle(nums)
|
||||
for i in range(3):
|
||||
for j in range(3):
|
||||
board[box * 3 + i][box * 3 + j] = nums[i * 3 + j]
|
||||
|
||||
# Solve the rest
|
||||
_solve_sudoku(board)
|
||||
return board
|
||||
|
||||
@dataclass
|
||||
class SudokuGame:
|
||||
difficulty: int = 40 # Number of cells to remove (20 = easy, 40 = medium, 50 = hard)
|
||||
seed: Optional[int] = None
|
||||
_rng: random.Random = field(init = False, repr = False)
|
||||
_board: List[List[int]] = field(init = False, repr = False)
|
||||
_solution: List[List[int]] = field(init = False, repr = False)
|
||||
_initial_board: List[List[int]] = field(init = False, repr = False)
|
||||
_moves: int = field(default = 0, init = False, repr = False)
|
||||
_state: str = field(default = "ongoing", init = False, repr = False)
|
||||
|
||||
def __post_init__(self):
|
||||
self._rng = random.Random(self.seed)
|
||||
|
||||
# Generate complete board
|
||||
complete_board = _generate_complete_board(self._rng)
|
||||
self._solution = copy.deepcopy(complete_board)
|
||||
|
||||
# Remove cells to create puzzle
|
||||
self._board = copy.deepcopy(complete_board)
|
||||
cells = [(r, c) for r in range(9) for c in range(9)]
|
||||
self._rng.shuffle(cells)
|
||||
|
||||
for r, c in cells[:self.difficulty]:
|
||||
self._board[r][c] = 0
|
||||
|
||||
self._initial_board = copy.deepcopy(self._board)
|
||||
self._update_state()
|
||||
|
||||
def board(self) -> List[List[int]]:
|
||||
"""Return current board state."""
|
||||
return [row[:] for row in self._board]
|
||||
|
||||
def initial_board(self) -> List[List[int]]:
|
||||
"""Return initial puzzle state."""
|
||||
return [row[:] for row in self._initial_board]
|
||||
|
||||
def state(self) -> str:
|
||||
"""Return game state: 'ongoing', 'success', or 'failed'."""
|
||||
return self._state
|
||||
|
||||
def moves(self) -> int:
|
||||
"""Return number of moves made."""
|
||||
return self._moves
|
||||
|
||||
def place_number(self, row: int, col: int, num: int) -> bool:
|
||||
"""Place a number on the board. Returns True if valid move."""
|
||||
# Validate input
|
||||
if not (0 <= row < 9 and 0 <= col < 9):
|
||||
self._state = "failed"
|
||||
return False
|
||||
|
||||
if not (1 <= num <= 9):
|
||||
self._state = "failed"
|
||||
return False
|
||||
|
||||
# Can't modify initial cells
|
||||
if self._initial_board[row][col] != 0:
|
||||
self._state = "failed"
|
||||
return False
|
||||
if self._board[row][col] != 0:
|
||||
self._state = "failed"
|
||||
return False
|
||||
# Check if placement is valid
|
||||
if not _is_valid_placement(self._board, row, col, num):
|
||||
self._state = "failed"
|
||||
return False
|
||||
|
||||
# Place number
|
||||
self._board[row][col] = num
|
||||
self._moves += 1
|
||||
self._update_state()
|
||||
return True
|
||||
|
||||
def _update_state(self) -> None:
|
||||
"""Update game state based on current board."""
|
||||
# Check if puzzle is complete
|
||||
if all(self._board[r][c] != 0 for r in range(9) for c in range(9)):
|
||||
# Verify solution is correct
|
||||
if self._board == self._solution:
|
||||
self._state = "success"
|
||||
else:
|
||||
self._state = "failed"
|
||||
else:
|
||||
self._state = "ongoing"
|
||||
|
||||
def pretty(self, colors: bool = True) -> str:
|
||||
"""Pretty print the Sudoku board."""
|
||||
RESET = "\x1b[0m"
|
||||
INITIAL = "\x1b[38;5;45m" # Cyan for initial numbers
|
||||
PLACED = "\x1b[38;5;226m" # Yellow for placed numbers
|
||||
EMPTY = "\x1b[38;5;239m" # Gray for empty cells
|
||||
|
||||
lines = []
|
||||
lines.append("┌───────┬───────┬───────┐")
|
||||
|
||||
for row in range(9):
|
||||
row_str = "│ "
|
||||
for col in range(9):
|
||||
num = self._board[row][col]
|
||||
|
||||
if colors:
|
||||
if num == 0:
|
||||
row_str += f"{EMPTY}.{RESET}"
|
||||
elif self._initial_board[row][col] != 0:
|
||||
row_str += f"{INITIAL}{num}{RESET}"
|
||||
else:
|
||||
row_str += f"{PLACED}{num}{RESET}"
|
||||
else:
|
||||
row_str += str(num) if num != 0 else "."
|
||||
|
||||
if col % 3 == 2:
|
||||
row_str += " │ "
|
||||
else:
|
||||
row_str += " "
|
||||
|
||||
lines.append(row_str.rstrip())
|
||||
|
||||
if row == 8:
|
||||
lines.append("└───────┴───────┴───────┘")
|
||||
elif row % 3 == 2:
|
||||
lines.append("├───────┼───────┼───────┤")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
# Test the Sudoku environment:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# Create an easy puzzle
|
||||
game = SudokuGame(difficulty = 30, seed = 42)
|
||||
print("Initial puzzle:")
|
||||
print(game.pretty())
|
||||
print(f"\nState: {game.state()}, Moves: {game.moves()}")
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
game
|
||||
|
||||
|
||||
# Try making some moves:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# Make a valid move
|
||||
game.place_number(0, 1, 7)
|
||||
print("\nAfter placing 7 at (1,0):")
|
||||
print(game.pretty())
|
||||
print(f"State: {game.state()}, Moves: {game.moves()}")
|
||||
|
||||
|
||||
# If we do some other action that's not part of the action space, we will get an error, and the game will not accept anymore actions.
|
||||
|
||||
# # RL Environment Setup
|
||||
#
|
||||
# Execute strategies with time limits to prevent infinite loops.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from typing import Callable
|
||||
from unsloth import execute_with_time_limit
|
||||
|
||||
def _execute_strategy(strategy: Callable, game: SudokuGame):
|
||||
"""Execute a strategy function on a Sudoku game."""
|
||||
assert callable(strategy)
|
||||
|
||||
max_moves = 100
|
||||
valid_moves = 0 # Track successful moves
|
||||
|
||||
while game.state() == "ongoing" and valid_moves < max_moves:
|
||||
try:
|
||||
board = game.board()
|
||||
initial = game.initial_board()
|
||||
result = strategy(board, initial)
|
||||
|
||||
# Validate result format
|
||||
if not isinstance(result, (tuple, list)) or len(result) != 3:
|
||||
# Invalid format = immediate fail, but return valid moves made
|
||||
return valid_moves, "failed"
|
||||
|
||||
row, col, num = result
|
||||
|
||||
# Validate types
|
||||
if not all(isinstance(x, int) for x in [row, col, num]):
|
||||
return valid_moves, "failed"
|
||||
|
||||
# Try to place number
|
||||
success = game.place_number(row, col, num)
|
||||
|
||||
if success:
|
||||
valid_moves += 1 # Count this valid move
|
||||
else:
|
||||
# Invalid move = game fails, but return valid_moves made so far
|
||||
return valid_moves, "failed"
|
||||
|
||||
except Exception:
|
||||
return valid_moves, "failed"
|
||||
|
||||
if valid_moves >= max_moves and game.state() == "ongoing":
|
||||
return valid_moves, "failed"
|
||||
|
||||
return valid_moves, game.state()
|
||||
|
||||
|
||||
# To allow longer strategies for Reinforcement Learning, we shall allow a 10 second timer.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
@execute_with_time_limit(10)
|
||||
def execute_strategy(strategy: Callable, game: SudokuGame):
|
||||
"""Execute strategy with 10 second time limit."""
|
||||
return _execute_strategy(strategy, game)
|
||||
|
||||
|
||||
# Test with a simple strategy:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def simple_strategy(board, initial):
|
||||
"""Simple strategy: fill first empty cell with 1."""
|
||||
for r in range(9):
|
||||
for c in range(9):
|
||||
if board[r][c] == 0 and initial[r][c] == 0:
|
||||
return (r, c, 7)
|
||||
return (0, 0, 7)
|
||||
|
||||
game = SudokuGame(difficulty = 30, seed = 42)
|
||||
try:
|
||||
moves, state = execute_strategy(simple_strategy, game)
|
||||
print(f"Moves: {moves}, State: {state}")
|
||||
except TimeoutError as e:
|
||||
print(f"Timed out: {e}")
|
||||
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
print(game.pretty())
|
||||
|
||||
|
||||
# # Code Execution
|
||||
#
|
||||
# To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called `countering reward hacking` since we don't want the function to cheat.
|
||||
#
|
||||
# For example the below piece of code is fine, since it only imports Python level functions. We use `check_python_modules`:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from unsloth import check_python_modules, create_locked_down_function
|
||||
|
||||
# Test safe code
|
||||
sample = """
|
||||
def strategy(board, initial):
|
||||
for r in range(9):
|
||||
for c in range(9):
|
||||
if board[r][c] == 0:
|
||||
return (r, c, 1)
|
||||
return (0, 0, 1)
|
||||
"""
|
||||
|
||||
ok, info = check_python_modules(sample)
|
||||
print("Safe Python code?", ok)
|
||||
print(info)
|
||||
|
||||
|
||||
# For the below piece of code, since we import `numpy`, we should not allow the execution:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
sample = """
|
||||
def strategy(board, initial):
|
||||
import numpy as np
|
||||
return (0, 0, 1)
|
||||
"""
|
||||
|
||||
ok, info = check_python_modules(sample)
|
||||
print("Safe Python code?", ok)
|
||||
print(info)
|
||||
|
||||
|
||||
# # Data & RL task setup
|
||||
#
|
||||
# Create the prompt that instructs the model to generate a Sudoku solving strategy. You can customize this to some other task for another RL task.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
prompt = """
|
||||
Create a Sudoku solving strategy using only native Python built-in functions without any import statements.
|
||||
You are given two lists of lists (9x9 grids):
|
||||
- board: current state (0 means empty)
|
||||
- initial: starting puzzle (0 means was empty, numbers are fixed)
|
||||
|
||||
Return a tuple (row, col, number) for the next move.
|
||||
- row: 0-8 (row index)
|
||||
- col: 0-8 (column index)
|
||||
- number: 1-9 (digit to place)
|
||||
|
||||
Only place numbers in cells that are BOTH empty in initial AND empty in board (initial[row][col] == 0 AND board[row][col] == 0)
|
||||
Use Sudoku rules: no duplicates in rows, columns, or 3x3 boxes.
|
||||
Output your function in backticks:
|
||||
```python
|
||||
def strategy(board, initial):
|
||||
# Your logic here
|
||||
return (row, col, number)
|
||||
```
|
||||
All helper functions must be inside def strategy. Output only the function.
|
||||
""".strip()
|
||||
|
||||
print(prompt)
|
||||
|
||||
|
||||
# First, let's prompt the model without RL and see how it goes:
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
text = tokenizer.apply_chat_template(
|
||||
[{"role": "user", "content": prompt.strip()}],
|
||||
tokenize = False,
|
||||
add_generation_prompt = True,
|
||||
)
|
||||
|
||||
from transformers import TextStreamer
|
||||
print("=" * 50)
|
||||
print("BASE MODEL OUTPUT (before RL training):")
|
||||
print("=" * 50)
|
||||
|
||||
inputs = tokenizer(
|
||||
text = text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
|
||||
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# # Reward functions
|
||||
#
|
||||
# We now design a `extract_function` function which simply extracts the function wrapped in 3 back ticks.
|
||||
#
|
||||
# And 3 reward functions:
|
||||
#
|
||||
# 1. `function_works` which rewards the model if the strategy is a valid Python function.
|
||||
# 2. `no_cheating` which checks if the function imported other modules, and if it did, we penalize it.
|
||||
# 3. `strategy_succeeds` which checks if the game strategy actually succeeds in attaining Sudoku after running the auto-generated strategy.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def extract_function(text):
|
||||
"""Extract Python function from markdown code blocks."""
|
||||
if text.count("```") >= 2:
|
||||
first = text.find("```") + 3
|
||||
second = text.find("```", first)
|
||||
fx = text[first:second].strip()
|
||||
fx = fx.removeprefix("python\n")
|
||||
fx = fx[fx.find("def"):]
|
||||
if fx.startswith("def strategy(board, initial):"):
|
||||
return fx
|
||||
return None
|
||||
|
||||
|
||||
# **Reward 1: Function Works**
|
||||
#
|
||||
# Checks if the generated code is valid Python and can be executed.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def function_works(completions, **kwargs):
|
||||
"""Reward for generating valid executable Python code."""
|
||||
scores = []
|
||||
for completion in completions:
|
||||
score = 0
|
||||
response = completion[0]["content"]
|
||||
function = extract_function(response)
|
||||
|
||||
if function is not None:
|
||||
ok, info = check_python_modules(function)
|
||||
|
||||
if function is None or "error" in info:
|
||||
score = -2.0 # Invalid function
|
||||
else:
|
||||
try:
|
||||
new_strategy = create_locked_down_function(function)
|
||||
score = 1.0 # Valid function
|
||||
except:
|
||||
score = -1.0 # Function has errors
|
||||
|
||||
scores.append(score)
|
||||
return scores
|
||||
|
||||
|
||||
# **Reward 2: No Cheating**
|
||||
#
|
||||
# Penalizes functions that import external libraries.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
def no_cheating(completions, **kwargs):
|
||||
"""Penalize use of external imports."""
|
||||
scores = []
|
||||
for completion in completions:
|
||||
response = completion[0]["content"]
|
||||
function = extract_function(response)
|
||||
|
||||
if function is not None:
|
||||
ok, info = check_python_modules(function)
|
||||
scores.append(1.0 if ok else -20.0) # Heavy penalty for cheating
|
||||
else:
|
||||
scores.append(-1.0) # Failed to create function
|
||||
|
||||
return scores
|
||||
|
||||
|
||||
# **Reward 3: Strategy Succeeds**
|
||||
#
|
||||
# Rewards strategies that successfully solve Sudoku puzzles.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
import numpy as np
|
||||
|
||||
global PRINTER
|
||||
PRINTER = 0
|
||||
|
||||
def strategy_succeeds(completions, **kwargs):
|
||||
"""Reward valid moves even if strategy eventually fails."""
|
||||
global PRINTER
|
||||
scores = []
|
||||
|
||||
seed = np.random.randint(10000)
|
||||
difficulty = 40
|
||||
for completion in completions:
|
||||
printed = False
|
||||
response = completion[0]["content"]
|
||||
function = extract_function(response)
|
||||
|
||||
if PRINTER % 5 == 0:
|
||||
printed = True
|
||||
print("\n" + "=" * 60)
|
||||
print(function)
|
||||
print("=" * 60)
|
||||
PRINTER += 1
|
||||
|
||||
if function is not None:
|
||||
ok, info = check_python_modules(function)
|
||||
|
||||
if function is None or "error" in info:
|
||||
scores.append(0)
|
||||
continue
|
||||
|
||||
try:
|
||||
new_strategy = create_locked_down_function(function)
|
||||
except:
|
||||
scores.append(0)
|
||||
continue
|
||||
|
||||
try:
|
||||
game = SudokuGame(difficulty = difficulty, seed = seed)
|
||||
valid_moves, game_state = execute_strategy(new_strategy, game)
|
||||
if valid_moves == difficulty:
|
||||
game_state = "success"
|
||||
|
||||
print(f"\n Valid moves: {valid_moves}, Final state: {game_state}")
|
||||
|
||||
if not printed:
|
||||
print("Strategy:")
|
||||
print(function[:200] + "..." if len(function) > 200 else function)
|
||||
|
||||
print("\nFinal board:")
|
||||
print(game.pretty())
|
||||
|
||||
if game_state == "success":
|
||||
scores.append(30.0) # Solved the puzzle!
|
||||
elif valid_moves > 0:
|
||||
# Reward based on valid moves made before failure
|
||||
# Each valid move is worth 0.2 points
|
||||
reward = valid_moves * 0.2
|
||||
scores.append(reward)
|
||||
else:
|
||||
scores.append(-2.0) # Failed immediately with no valid moves
|
||||
|
||||
except TimeoutError:
|
||||
print("Timeout")
|
||||
scores.append(-1.0)
|
||||
except Exception as e:
|
||||
print(f"Exception: {str(e)[:100]}")
|
||||
scores.append(-3.0)
|
||||
|
||||
return scores
|
||||
|
||||
|
||||
# # Dataset Preparation
|
||||
#
|
||||
# Create the training dataset.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from datasets import Dataset
|
||||
|
||||
dataset = Dataset.from_list([
|
||||
{
|
||||
"prompt": [{"role": "user", "content": prompt.strip()}],
|
||||
"answer": 0,
|
||||
}
|
||||
] * 1000)
|
||||
|
||||
maximum_length = len(tokenizer.apply_chat_template(
|
||||
[{"role": "user", "content": prompt.strip()}],
|
||||
add_generation_prompt = True
|
||||
))
|
||||
|
||||
print(f"Maximum prompt length: {maximum_length}")
|
||||
print("\nDataset sample:")
|
||||
print(dataset[0])
|
||||
|
||||
|
||||
# <a name="Train"></a>
|
||||
# ### Train the model
|
||||
#
|
||||
# Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth [Reinforcement Learning Docs](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) for more options.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# Leave room for the prompt (plus 1 token safety margin)
|
||||
max_completion_length = max_seq_length - (maximum_length + 1)
|
||||
|
||||
from trl import GRPOConfig, GRPOTrainer
|
||||
training_args = GRPOConfig(
|
||||
temperature = 1.0,
|
||||
learning_rate = 5e-5,
|
||||
weight_decay = 0.001,
|
||||
warmup_ratio = 0.1,
|
||||
lr_scheduler_type = "linear",
|
||||
optim = "adamw_8bit",
|
||||
logging_steps = 1,
|
||||
per_device_train_batch_size = 1,
|
||||
gradient_accumulation_steps = 2, # Increase to 4 for smoother training
|
||||
num_generations = 2, # Decrease if out of memory
|
||||
max_completion_length = max_completion_length,
|
||||
# num_train_epochs = 1, # Set to 1 for a full training run
|
||||
max_steps = 60,
|
||||
save_steps = 100,
|
||||
report_to = "none", # Can use Weights & Biases, TrackIO
|
||||
output_dir = "outputs",
|
||||
epsilon = 0.2,
|
||||
epsilon_high = 0.28, # one sided
|
||||
delta = 1.5, # two sided
|
||||
loss_type = 'bnpo',
|
||||
mask_truncated_completions = True
|
||||
# For optional training + evaluation
|
||||
# fp16_full_eval = True,
|
||||
# per_device_eval_batch_size = 4,
|
||||
# eval_accumulation_steps = 1,
|
||||
# eval_strategy = "steps",
|
||||
# eval_steps = 1,
|
||||
)
|
||||
|
||||
|
||||
# And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!
|
||||
#
|
||||
# You might have to wait 150 to 200 steps for any action. You'll probably get low reward for the first 100 steps. Please be patient!
|
||||
#
|
||||
# | Step | Training Loss | reward | reward_std | completion_length | kl |
|
||||
# |------|---------------|-----------|------------|-------------------|----------|
|
||||
# | 1 | 0.000000 | 0.125000 | 0.000000 | 200.000000 | 0.000000 |
|
||||
# | 2 | 0.000000 | 0.072375 | 0.248112 | 200.000000 | 0.000000 |
|
||||
# | 3 | 0.000000 | -0.079000 | 0.163776 | 182.500000 | 0.000005 |
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# For optional training + evaluation
|
||||
# new_dataset = dataset.train_test_split(test_size = 0.01)
|
||||
|
||||
trainer = GRPOTrainer(
|
||||
model = model,
|
||||
processing_class = tokenizer,
|
||||
reward_funcs = [
|
||||
function_works,
|
||||
no_cheating,
|
||||
strategy_succeeds,
|
||||
],
|
||||
args = training_args,
|
||||
train_dataset = dataset,
|
||||
|
||||
# For optional training + evaluation
|
||||
# train_dataset = new_dataset["train"],
|
||||
# eval_dataset = new_dataset["test"],
|
||||
)
|
||||
|
||||
|
||||
# And let's train the model!
|
||||
#
|
||||
# **NOTE** A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
trainer.train()
|
||||
|
||||
|
||||
# And now with the LoRA we just trained with GRPO - we first save the LoRA first!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
model.save_pretrained("gemma_4_lora") # Local saving
|
||||
tokenizer.save_pretrained("gemma_4_lora")
|
||||
|
||||
|
||||
# Verify LoRA is actually trained!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
from safetensors import safe_open
|
||||
|
||||
tensors = {}
|
||||
with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
|
||||
# Verify both A and B are non zero
|
||||
for key in f.keys():
|
||||
tensor = f.get_tensor(key)
|
||||
n_zeros = (tensor == 0).sum() / tensor.numel()
|
||||
assert(n_zeros.item() != tensor.numel())
|
||||
|
||||
|
||||
# <a name="Inference"></a>
|
||||
# # Inference
|
||||
# Now let's try the model we just trained!
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
text = tokenizer.apply_chat_template(
|
||||
[{"role": "user", "content": prompt.strip()}],
|
||||
tokenize = False,
|
||||
add_generation_prompt = True,
|
||||
)
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
_ = model.generate(
|
||||
**tokenizer(images = None,text = text, return_tensors = "pt").to("cuda"),
|
||||
temperature = 1.0,
|
||||
max_new_tokens = 512,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = False),
|
||||
)
|
||||
|
||||
|
||||
# <a name="Save"></a>
|
||||
# ### Saving to float16 for VLLM
|
||||
#
|
||||
# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# Merge to 16bit
|
||||
if False: model.save_pretrained_merged("gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit",)
|
||||
if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Merge to 4bit
|
||||
if False: model.save_pretrained_merged("gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit",)
|
||||
if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Just LoRA adapters
|
||||
if False:
|
||||
model.save_pretrained("gemma_4_lora")
|
||||
tokenizer.save_pretrained("gemma_4_lora")
|
||||
if False:
|
||||
model.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
|
||||
tokenizer.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
|
||||
|
||||
|
||||
# ### GGUF / llama.cpp Conversion
|
||||
# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
|
||||
#
|
||||
# Some supported quant methods (full list on our [docs page](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)):
|
||||
# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
|
||||
# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
|
||||
# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
|
||||
#
|
||||
# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
|
||||
# In[ ]:
|
||||
|
||||
|
||||
# Save to 8bit Q8_0
|
||||
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer,)
|
||||
# Remember to go to https://huggingface.co/settings/tokens for a token!
|
||||
# And change hf to your username!
|
||||
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Save to 16bit GGUF
|
||||
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "f16")
|
||||
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Save to q4_k_m GGUF
|
||||
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "q4_k_m")
|
||||
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")
|
||||
|
||||
# Save to multiple GGUF options - much faster if you want multiple!
|
||||
if False:
|
||||
model.push_to_hub_gguf(
|
||||
"HF_USERNAME/gemma_4_finetune", # Change hf to your username!
|
||||
tokenizer,
|
||||
quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
|
||||
token = "YOUR_HF_TOKEN",
|
||||
)
|
||||
|
||||
|
||||
# Now, use the `gemma_4_finetune.Q8_0.gguf` file or `gemma_4_finetune.Q4_K_M.gguf` file in llama.cpp.
|
||||
#
|
||||
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
|
||||
#
|
||||
# Some other resources:
|
||||
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
|
||||
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
|
||||
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
|
||||
#
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
|
||||
#
|
||||
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
|
||||
# </div>
|
||||
#
|
||||
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
@@ -0,0 +1,478 @@
|
||||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
|
||||
# </div>
|
||||
#
|
||||
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
#
|
||||
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
|
||||
|
||||
# ### News
|
||||
|
||||
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
|
||||
#
|
||||
# <table><tr>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
|
||||
# </tr></table>
|
||||
#
|
||||
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
|
||||
#
|
||||
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
|
||||
#
|
||||
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
|
||||
#
|
||||
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
|
||||
|
||||
# # ### Installation
|
||||
#
|
||||
# # In[1]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
|
||||
#
|
||||
#
|
||||
# # In[2]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
|
||||
#
|
||||
#
|
||||
# # ### Unsloth
|
||||
#
|
||||
# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
|
||||
|
||||
# In[3]:
|
||||
|
||||
|
||||
from unsloth import FastModel
|
||||
import torch
|
||||
from huggingface_hub import snapshot_download
|
||||
|
||||
fourbit_models = [
|
||||
# Gemma 4 models
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E2B",
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E4B",
|
||||
"unsloth/gemma-4-31B-it",
|
||||
"unsloth/gemma-4-31B",
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
"unsloth/gemma-4-26B-A4B",
|
||||
] # More models at https://huggingface.co/unsloth
|
||||
|
||||
model, processor = FastModel.from_pretrained(
|
||||
model_name = "unsloth/gemma-4-E4B-it",
|
||||
dtype = None, # None for auto detection
|
||||
max_seq_length = 8192, # Choose any for long context!
|
||||
load_in_4bit = True, # 4 bit quantization to reduce memory
|
||||
full_finetuning = False, # [NEW!] We have full finetuning now!
|
||||
# token = "YOUR_HF_TOKEN", # HF Token for gated models
|
||||
)
|
||||
|
||||
|
||||
# # Gemma 4 can process Text, Vision and Audio!
|
||||
#
|
||||
# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
|
||||
|
||||
# In[4]:
|
||||
|
||||
|
||||
from transformers import TextStreamer
|
||||
# Helper function for inference
|
||||
def do_gemma_4_inference(messages, max_new_tokens = 128):
|
||||
_ = model.generate(
|
||||
**processor.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
return_tensors = "pt",
|
||||
).to("cuda"),
|
||||
max_new_tokens = max_new_tokens,
|
||||
do_sample = False,
|
||||
streamer = TextStreamer(processor, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# <h3>Let's Evaluate Gemma 4 Baseline Performance on German Transcription</h2>
|
||||
|
||||
# In[5]:
|
||||
|
||||
|
||||
from datasets import load_dataset,Audio,concatenate_datasets
|
||||
|
||||
dataset = load_dataset("kadirnar/Emilia-DE-B000000", split = "train")
|
||||
|
||||
# Select a single audio sample to reserve for testing.
|
||||
# This index is chosen from the full dataset before we create the smaller training split.
|
||||
test_audio = dataset[7546]
|
||||
|
||||
dataset = dataset.select(range(3000))
|
||||
|
||||
dataset = dataset.cast_column("audio", Audio(sampling_rate = 16000))
|
||||
|
||||
|
||||
# In[6]:
|
||||
|
||||
|
||||
from IPython.display import Audio, display
|
||||
print(test_audio['text'])
|
||||
Audio(test_audio['audio']['array'],rate = test_audio['audio']['sampling_rate'])
|
||||
|
||||
|
||||
# And the translation of the audio from German to English is:
|
||||
#
|
||||
# > I—I hold myself directly accountable. That much is, of course, clear: namely, that there are political interests involved in trade—in the exchange of goods—and that political influences are at play. The question is: that should not be the alternative.
|
||||
|
||||
# In[7]:
|
||||
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "You are an assistant that transcribes speech accurately.",
|
||||
}
|
||||
],
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "audio", "audio": test_audio['audio']['array']},
|
||||
{"type": "text", "text": "Please transcribe this audio."}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
do_gemma_4_inference(messages, max_new_tokens = 256)
|
||||
|
||||
|
||||
# <h3>Baseline Model Performance: 32.43% Word Error Rate (WER) for this sample !</h3>
|
||||
|
||||
# # Let's finetune Gemma 4!
|
||||
#
|
||||
# You can finetune the vision and text and audio parts
|
||||
|
||||
# We now add LoRA adapters so we only need to update a small amount of parameters!
|
||||
|
||||
# In[8]:
|
||||
|
||||
|
||||
model = FastModel.get_peft_model(
|
||||
model,
|
||||
finetune_vision_layers = False, # False if not finetuning vision layers
|
||||
finetune_language_layers = True, # False if not finetuning language layers
|
||||
finetune_attention_modules = True, # False if not finetuning attention layers
|
||||
finetune_mlp_modules = True, # False if not finetuning MLP layers
|
||||
|
||||
r = 8, # The larger, the higher the accuracy, but might overfit
|
||||
lora_alpha = 16, # Recommended alpha == r at least
|
||||
lora_dropout = 0,
|
||||
bias = "none",
|
||||
random_state = 3407,
|
||||
use_rslora = False, # We support rank stabilized LoRA
|
||||
loftq_config = None, # And LoftQ
|
||||
target_modules = [
|
||||
"q_proj", "k_proj", "v_proj", "o_proj",
|
||||
"gate_proj", "up_proj", "down_proj",
|
||||
|
||||
# Audio layers
|
||||
"post", "linear_start", "linear_end",
|
||||
"embedding_projection",
|
||||
"ffw_layer_1", "ffw_layer_2",
|
||||
"output_proj",
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
# <a name="Data"></a>
|
||||
# ### Data Prep
|
||||
# We adapt the `kadirnar/Emilia-DE-B000000` dataset for our German ASR task using Gemma 4 multi-modal chat format. Each audio-text pair is structured into a conversation with `system`, `user`, and `assistant` roles. The processor then converts this into the final training format:
|
||||
#
|
||||
# ```
|
||||
# <bos><|turn>system
|
||||
# You are an assistant that transcribes speech accurately.<turn|>
|
||||
# <|turn>user
|
||||
# <|audio|>Please transcribe this audio.<turn|>
|
||||
# <|turn>model
|
||||
# Ich, ich rechne direkt mich an.<turn|>
|
||||
|
||||
# In[9]:
|
||||
|
||||
|
||||
def format_intersection_data(samples: dict) -> dict[str, list]:
|
||||
"""Format intersection dataset to match expected message format"""
|
||||
formatted_samples = {"messages": []}
|
||||
for idx in range(len(samples["audio"])):
|
||||
audio = samples["audio"][idx]["array"]
|
||||
label = str(samples["text"][idx])
|
||||
|
||||
message = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "You are an assistant that transcribes speech accurately.",
|
||||
}
|
||||
],
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "audio", "audio": audio},
|
||||
{"type": "text", "text": "Please transcribe this audio."}
|
||||
]
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content":[{"type": "text", "text": label}]
|
||||
}
|
||||
]
|
||||
formatted_samples["messages"].append(message)
|
||||
return formatted_samples
|
||||
|
||||
|
||||
# In[10]:
|
||||
|
||||
|
||||
dataset = dataset.map(format_intersection_data, batched = True, batch_size = 4, num_proc = 4)
|
||||
|
||||
|
||||
# <a name="Train"></a>
|
||||
# ### Train the model
|
||||
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
|
||||
|
||||
# In[11]:
|
||||
|
||||
|
||||
# Use UnslothVisionDataCollator which handles audio token alignment correctly
|
||||
from unsloth.trainer import UnslothVisionDataCollator
|
||||
from trl import SFTTrainer, SFTConfig
|
||||
|
||||
trainer = SFTTrainer(
|
||||
model = model,
|
||||
train_dataset = dataset,
|
||||
processing_class = processor.tokenizer,
|
||||
data_collator = UnslothVisionDataCollator(model, processor),
|
||||
args = SFTConfig(
|
||||
per_device_train_batch_size = 8,
|
||||
gradient_accumulation_steps = 1,
|
||||
warmup_ratio = 0.03,
|
||||
# num_train_epochs = 1, # Use for full training runs
|
||||
max_steps = 60,
|
||||
learning_rate = 5e-5,
|
||||
logging_steps = 1,
|
||||
save_strategy = "steps",
|
||||
optim = "adamw_8bit",
|
||||
weight_decay = 0.001,
|
||||
lr_scheduler_type = "cosine",
|
||||
seed = 3407,
|
||||
output_dir = "outputs",
|
||||
report_to = "none",
|
||||
remove_unused_columns = False,
|
||||
|
||||
# The below are a must for audio finetuning:
|
||||
dataset_text_field = "",
|
||||
dataset_kwargs = {"skip_prepare_dataset": True},
|
||||
max_length = 8192,
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
# In[12]:
|
||||
|
||||
|
||||
# @title Show current memory stats
|
||||
gpu_stats = torch.cuda.get_device_properties(0)
|
||||
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
|
||||
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
|
||||
print(f"{start_gpu_memory} GB of memory reserved.")
|
||||
|
||||
|
||||
# # Let's train the model!
|
||||
#
|
||||
# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
|
||||
|
||||
# In[13]:
|
||||
|
||||
|
||||
trainer_stats = trainer.train()
|
||||
|
||||
|
||||
# In[14]:
|
||||
|
||||
|
||||
# @title Show final memory and time stats
|
||||
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
|
||||
used_percentage = round(used_memory / max_memory * 100, 3)
|
||||
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
|
||||
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
|
||||
print(
|
||||
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
|
||||
)
|
||||
print(f"Peak reserved memory = {used_memory} GB.")
|
||||
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
|
||||
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
|
||||
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
|
||||
|
||||
|
||||
# <a name="Inference"></a>
|
||||
# ### Inference
|
||||
# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
|
||||
|
||||
# In[15]:
|
||||
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "You are an assistant that transcribes speech accurately.",
|
||||
}
|
||||
],
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "audio", "audio": test_audio['audio']['array']},
|
||||
{"type": "text", "text": "Please transcribe this audio."}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
do_gemma_4_inference(messages, max_new_tokens = 256)
|
||||
|
||||
|
||||
# <a name="Save"></a>
|
||||
# ### Saving, loading finetuned models
|
||||
# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
|
||||
#
|
||||
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
|
||||
|
||||
# In[16]:
|
||||
|
||||
|
||||
model.save_pretrained("gemma_4_lora") # Local saving
|
||||
processor.save_pretrained("gemma_4_lora")
|
||||
# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
# processor.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
|
||||
|
||||
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
|
||||
|
||||
# In[17]:
|
||||
|
||||
|
||||
if False:
|
||||
from unsloth import FastModel
|
||||
model, processor = FastModel.from_pretrained(
|
||||
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
|
||||
max_seq_length = 2048,
|
||||
load_in_4bit = True,
|
||||
)
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{"type" : "text", "text" : "What is Gemma-4?",}]
|
||||
}]
|
||||
inputs = processor.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
_ = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 128, # Increase for longer outputs!
|
||||
# Recommended Gemma-4 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(processor, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# ### Saving to float16 for VLLM
|
||||
#
|
||||
# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
|
||||
|
||||
# In[18]:
|
||||
|
||||
|
||||
if False: # Change to True to save finetune!
|
||||
model.save_pretrained_merged("gemma-4", processor)
|
||||
|
||||
|
||||
# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
|
||||
|
||||
# In[19]:
|
||||
|
||||
|
||||
if False: # Change to True to upload finetune
|
||||
model.push_to_hub_merged(
|
||||
"HF_ACCOUNT/gemma-4-finetune", processor,
|
||||
token = "YOUR_HF_TOKEN"
|
||||
)
|
||||
|
||||
|
||||
# ### GGUF / llama.cpp Conversion
|
||||
# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
|
||||
|
||||
# In[20]:
|
||||
|
||||
|
||||
if False: # Change to True to save to GGUF
|
||||
model.save_pretrained_gguf(
|
||||
"gemma_4_finetune",
|
||||
processor,
|
||||
quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
|
||||
)
|
||||
|
||||
|
||||
# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
|
||||
|
||||
# In[21]:
|
||||
|
||||
|
||||
if False: # Change to True to upload GGUF
|
||||
model.push_to_hub_gguf(
|
||||
"HF_ACCOUNT/gemma_4_finetune",
|
||||
processor,
|
||||
quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
|
||||
token = "YOUR_HF_TOKEN",
|
||||
)
|
||||
|
||||
|
||||
# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
|
||||
#
|
||||
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
|
||||
#
|
||||
# Some other resources:
|
||||
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
|
||||
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
|
||||
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
|
||||
#
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
|
||||
#
|
||||
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
|
||||
# </div>
|
||||
#
|
||||
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
@@ -0,0 +1,557 @@
|
||||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab L4 instance!
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
|
||||
# </div>
|
||||
#
|
||||
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
#
|
||||
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
|
||||
|
||||
# ### News
|
||||
|
||||
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
|
||||
#
|
||||
# <table><tr>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
|
||||
# </tr></table>
|
||||
#
|
||||
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
|
||||
#
|
||||
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
|
||||
#
|
||||
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
|
||||
#
|
||||
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
|
||||
|
||||
# # ### Installation
|
||||
#
|
||||
# # In[1]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
|
||||
#
|
||||
#
|
||||
# # In[2]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
|
||||
#
|
||||
#
|
||||
# # ### Unsloth
|
||||
#
|
||||
# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
|
||||
|
||||
# In[3]:
|
||||
|
||||
|
||||
from unsloth import FastModel
|
||||
import torch
|
||||
|
||||
gemma4_models = [
|
||||
# Gemma-4 instruct models:
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E4B-it",
|
||||
"unsloth/gemma-4-31B-it",
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
# Gemma-4 base models:
|
||||
"unsloth/gemma-4-E2B",
|
||||
"unsloth/gemma-4-E4B",
|
||||
"unsloth/gemma-4-31B",
|
||||
"unsloth/gemma-4-26B-A4B",
|
||||
] # More models at https://huggingface.co/unsloth
|
||||
|
||||
model, tokenizer = FastModel.from_pretrained(
|
||||
model_name = "unsloth/gemma-4-E4B-it",
|
||||
dtype = None, # None for auto detection
|
||||
max_seq_length = 1024, # Choose any for long context!
|
||||
load_in_4bit = True, # 4 bit quantization to reduce memory
|
||||
full_finetuning = False, # [NEW!] We have full finetuning now!
|
||||
# token = "YOUR_HF_TOKEN", # HF Token for gated models
|
||||
)
|
||||
|
||||
|
||||
# # Gemma 4 can process Text, Vision and Audio!
|
||||
#
|
||||
# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
|
||||
|
||||
# In[4]:
|
||||
|
||||
|
||||
from transformers import TextStreamer
|
||||
# Helper function for inference
|
||||
def do_gemma_4_inference(messages, max_new_tokens = 128):
|
||||
_ = model.generate(
|
||||
**tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
return_tensors = "pt",
|
||||
).to("cuda"),
|
||||
max_new_tokens = max_new_tokens,
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = True),
|
||||
use_cache = True
|
||||
)
|
||||
|
||||
|
||||
# # Gemma 4 can see images!
|
||||
#
|
||||
# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
|
||||
|
||||
# In[5]:
|
||||
|
||||
|
||||
sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
|
||||
|
||||
messages = [{
|
||||
"role" : "user",
|
||||
"content": [
|
||||
{ "type": "image", "image" : sloth_link },
|
||||
{ "type": "text", "text" : "Which films does this animal feature in?" }
|
||||
]
|
||||
}]
|
||||
# You might have to wait 1 minute for Unsloth's auto compiler
|
||||
do_gemma_4_inference(messages, max_new_tokens = 256)
|
||||
|
||||
|
||||
# Let's make a poem about sloths!
|
||||
|
||||
# In[6]:
|
||||
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{ "type" : "text",
|
||||
"text" : "Write a poem about sloths." }]
|
||||
}]
|
||||
do_gemma_4_inference(messages)
|
||||
|
||||
|
||||
# # Gemma 4 can also hear!
|
||||
|
||||
# In[7]:
|
||||
|
||||
|
||||
from IPython.display import Audio, display
|
||||
Audio("https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3")
|
||||
|
||||
|
||||
# In[8]:
|
||||
|
||||
|
||||
get_ipython().system('wget -qqq https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3 -O audio.mp3')
|
||||
|
||||
|
||||
# In[9]:
|
||||
|
||||
|
||||
audio_file = "audio.mp3"
|
||||
|
||||
messages = [{
|
||||
"role" : "user",
|
||||
"content": [
|
||||
{ "type": "audio", "audio" : audio_file },
|
||||
{ "type": "text", "text" : "What is this audio about?" }
|
||||
]
|
||||
}]
|
||||
do_gemma_4_inference(messages, max_new_tokens = 256)
|
||||
|
||||
|
||||
# # Let's combine all 3 modalities together!
|
||||
|
||||
# In[10]:
|
||||
|
||||
|
||||
messages = [{
|
||||
"role" : "user",
|
||||
"content": [
|
||||
{ "type": "audio", "audio" : audio_file },
|
||||
{ "type": "image", "image" : sloth_link },
|
||||
{ "type": "text", "text" : "What is this audio and image about? "\
|
||||
"How are they related?" }
|
||||
]
|
||||
}]
|
||||
do_gemma_4_inference(messages, max_new_tokens = 256)
|
||||
|
||||
|
||||
# # Let's finetune Gemma 4!
|
||||
#
|
||||
# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
|
||||
|
||||
# We now add LoRA adapters so we only need to update a small amount of parameters!
|
||||
|
||||
# In[11]:
|
||||
|
||||
|
||||
model = FastModel.get_peft_model(
|
||||
model,
|
||||
finetune_vision_layers = False, # Turn off for just text!
|
||||
finetune_language_layers = True, # Should leave on!
|
||||
finetune_attention_modules = True, # Attention good for GRPO
|
||||
finetune_mlp_modules = True, # Should leave on always!
|
||||
|
||||
r = 8, # Larger = higher accuracy, but might overfit
|
||||
lora_alpha = 8, # Recommended alpha == r at least
|
||||
lora_dropout = 0,
|
||||
bias = "none",
|
||||
random_state = 3407,
|
||||
)
|
||||
|
||||
|
||||
# <a name="Data"></a>
|
||||
# ### Data Prep
|
||||
# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
|
||||
#
|
||||
# ```
|
||||
# <bos><|turn>user
|
||||
# Hello<turn|>
|
||||
# <|turn>model
|
||||
# Hey there!<turn|>
|
||||
# ```
|
||||
# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
|
||||
|
||||
# In[12]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import get_chat_template
|
||||
tokenizer = get_chat_template(
|
||||
tokenizer,
|
||||
chat_template = "gemma-4",
|
||||
)
|
||||
|
||||
|
||||
# We get the first 3000 rows of the dataset
|
||||
|
||||
# In[13]:
|
||||
|
||||
|
||||
from datasets import load_dataset
|
||||
dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
|
||||
|
||||
|
||||
# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
|
||||
|
||||
# In[14]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import standardize_data_formats
|
||||
dataset = standardize_data_formats(dataset)
|
||||
|
||||
|
||||
# Let's see how row 100 looks like!
|
||||
|
||||
# In[15]:
|
||||
|
||||
|
||||
dataset[100]
|
||||
|
||||
|
||||
# We now have to apply the chat template for `Gemma-4` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
|
||||
|
||||
# In[16]:
|
||||
|
||||
|
||||
def formatting_prompts_func(examples):
|
||||
convos = examples["conversations"]
|
||||
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
|
||||
return { "text" : texts, }
|
||||
|
||||
dataset = dataset.map(formatting_prompts_func, batched = True)
|
||||
|
||||
|
||||
# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
|
||||
|
||||
# In[17]:
|
||||
|
||||
|
||||
dataset[100]["text"]
|
||||
|
||||
|
||||
# <a name="Train"></a>
|
||||
# ### Train the model
|
||||
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
|
||||
|
||||
# In[18]:
|
||||
|
||||
|
||||
from trl import SFTTrainer, SFTConfig
|
||||
trainer = SFTTrainer(
|
||||
model = model,
|
||||
tokenizer = tokenizer,
|
||||
train_dataset = dataset,
|
||||
eval_dataset = None, # Can set up evaluation!
|
||||
args = SFTConfig(
|
||||
dataset_text_field = "text",
|
||||
per_device_train_batch_size = 1,
|
||||
gradient_accumulation_steps = 4, # Use GA to mimic batch size!
|
||||
warmup_steps = 5,
|
||||
# num_train_epochs = 1, # Set this for 1 full training run.
|
||||
max_steps = 60,
|
||||
learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
|
||||
logging_steps = 1,
|
||||
optim = "adamw_8bit",
|
||||
weight_decay = 0.001,
|
||||
lr_scheduler_type = "linear",
|
||||
seed = 3407,
|
||||
report_to = "none", # Use TrackIO/WandB etc
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
|
||||
|
||||
# In[19]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import train_on_responses_only
|
||||
trainer = train_on_responses_only(
|
||||
trainer,
|
||||
instruction_part = "<|turn>user\n",
|
||||
response_part = "<|turn>model\n",
|
||||
)
|
||||
|
||||
|
||||
# Let's verify masking the instruction part is done! Let's print the 100th row again. Notice how the sample only has a single `<bos>` as expected!
|
||||
|
||||
# In[20]:
|
||||
|
||||
|
||||
tokenizer.decode(trainer.train_dataset[100]["input_ids"])
|
||||
|
||||
|
||||
# Now let's print the masked out example - you should see only the answer is present:
|
||||
|
||||
# In[21]:
|
||||
|
||||
|
||||
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
|
||||
|
||||
|
||||
# In[22]:
|
||||
|
||||
|
||||
# @title Show current memory stats
|
||||
gpu_stats = torch.cuda.get_device_properties(0)
|
||||
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
|
||||
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
|
||||
print(f"{start_gpu_memory} GB of memory reserved.")
|
||||
|
||||
|
||||
# # Let's train the model!
|
||||
#
|
||||
# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
|
||||
|
||||
# In[23]:
|
||||
|
||||
|
||||
trainer_stats = trainer.train()
|
||||
|
||||
|
||||
# In[24]:
|
||||
|
||||
|
||||
# @title Show final memory and time stats
|
||||
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
|
||||
used_percentage = round(used_memory / max_memory * 100, 3)
|
||||
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
|
||||
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
|
||||
print(
|
||||
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
|
||||
)
|
||||
print(f"Peak reserved memory = {used_memory} GB.")
|
||||
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
|
||||
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
|
||||
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
|
||||
|
||||
|
||||
# <a name="Inference"></a>
|
||||
# ### Inference
|
||||
# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
|
||||
|
||||
# In[25]:
|
||||
|
||||
|
||||
from unsloth.chat_templates import get_chat_template
|
||||
tokenizer = get_chat_template(
|
||||
tokenizer,
|
||||
chat_template = "gemma-4",
|
||||
)
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{
|
||||
"type" : "text",
|
||||
"text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
|
||||
}]
|
||||
}]
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
outputs = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 64, # Increase for longer outputs!
|
||||
# Recommended Gemma-4 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
)
|
||||
tokenizer.batch_decode(outputs)
|
||||
|
||||
|
||||
# You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
|
||||
|
||||
# In[26]:
|
||||
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{"type" : "text", "text" : "Why is the sky blue?",}]
|
||||
}]
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
_ = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 64, # Increase for longer outputs!
|
||||
# Recommended Gemma-4 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# <a name="Save"></a>
|
||||
# ### Saving, loading finetuned models
|
||||
# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
|
||||
#
|
||||
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
|
||||
|
||||
# In[27]:
|
||||
|
||||
|
||||
model.save_pretrained("gemma_4_lora") # Local saving
|
||||
tokenizer.save_pretrained("gemma_4_lora")
|
||||
# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
|
||||
|
||||
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
|
||||
|
||||
# In[28]:
|
||||
|
||||
|
||||
if False:
|
||||
from unsloth import FastModel
|
||||
model, tokenizer = FastModel.from_pretrained(
|
||||
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
|
||||
max_seq_length = 2048,
|
||||
load_in_4bit = True,
|
||||
)
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [{"type" : "text", "text" : "What is Gemma-4?",}]
|
||||
}]
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt = True, # Must add for generation
|
||||
return_tensors = "pt",
|
||||
tokenize = True,
|
||||
return_dict = True,
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
_ = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens = 128, # Increase for longer outputs!
|
||||
# Recommended Gemma-4 settings!
|
||||
temperature = 1.0, top_p = 0.95, top_k = 64,
|
||||
streamer = TextStreamer(tokenizer, skip_prompt = True),
|
||||
)
|
||||
|
||||
|
||||
# ### Saving to float16 for VLLM
|
||||
#
|
||||
# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
|
||||
|
||||
# In[29]:
|
||||
|
||||
|
||||
if False: # Change to True to save finetune!
|
||||
model.save_pretrained_merged("gemma-4-finetune", tokenizer)
|
||||
|
||||
|
||||
# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
|
||||
|
||||
# In[30]:
|
||||
|
||||
|
||||
if False: # Change to True to upload finetune
|
||||
model.push_to_hub_merged(
|
||||
"HF_ACCOUNT/gemma-4-finetune", tokenizer,
|
||||
token = "YOUR_HF_TOKEN"
|
||||
)
|
||||
|
||||
|
||||
# ### GGUF / llama.cpp Conversion
|
||||
# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
|
||||
|
||||
# In[31]:
|
||||
|
||||
|
||||
if False: # Change to True to save to GGUF
|
||||
model.save_pretrained_gguf(
|
||||
"gemma_4_finetune",
|
||||
tokenizer,
|
||||
quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
|
||||
)
|
||||
|
||||
|
||||
# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
|
||||
|
||||
# In[32]:
|
||||
|
||||
|
||||
if False: # Change to True to upload GGUF
|
||||
model.push_to_hub_gguf(
|
||||
"HF_ACCOUNT/gemma_4_finetune",
|
||||
tokenizer,
|
||||
quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
|
||||
token = "YOUR_HF_TOKEN",
|
||||
)
|
||||
|
||||
|
||||
# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
|
||||
#
|
||||
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
|
||||
#
|
||||
# Some other resources:
|
||||
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
|
||||
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
|
||||
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
|
||||
#
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
|
||||
#
|
||||
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
|
||||
# </div>
|
||||
#
|
||||
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
@@ -0,0 +1,448 @@
|
||||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab L4 instance!
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
|
||||
# </div>
|
||||
#
|
||||
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
#
|
||||
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
|
||||
|
||||
# ### News
|
||||
|
||||
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
|
||||
#
|
||||
# <table><tr>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
|
||||
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
|
||||
# </tr></table>
|
||||
#
|
||||
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
|
||||
#
|
||||
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
|
||||
#
|
||||
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
|
||||
#
|
||||
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
|
||||
|
||||
# # ### Installation
|
||||
#
|
||||
# # In[1]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
|
||||
#
|
||||
#
|
||||
# # In[2]:
|
||||
#
|
||||
#
|
||||
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
|
||||
#
|
||||
#
|
||||
# # ### Unsloth
|
||||
|
||||
# In[3]:
|
||||
|
||||
|
||||
from unsloth import FastVisionModel # FastLanguageModel for LLMs
|
||||
import torch
|
||||
|
||||
gemma4_models = [
|
||||
# Gemma-4 instruct models:
|
||||
"unsloth/gemma-4-E2B-it",
|
||||
"unsloth/gemma-4-E4B-it",
|
||||
"unsloth/gemma-4-31B-it",
|
||||
"unsloth/gemma-4-26B-A4B-it",
|
||||
# Gemma-4 base models:
|
||||
"unsloth/gemma-4-E2B",
|
||||
"unsloth/gemma-4-E4B",
|
||||
"unsloth/gemma-4-31B",
|
||||
"unsloth/gemma-4-26B-A4B",
|
||||
] # More models at https://huggingface.co/unsloth
|
||||
|
||||
model, processor = FastVisionModel.from_pretrained(
|
||||
"unsloth/gemma-4-E4B-it",
|
||||
load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
|
||||
use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
|
||||
)
|
||||
|
||||
|
||||
# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
|
||||
#
|
||||
# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
|
||||
|
||||
# In[4]:
|
||||
|
||||
|
||||
model = FastVisionModel.get_peft_model(
|
||||
model,
|
||||
finetune_vision_layers = True, # False if not finetuning vision layers
|
||||
finetune_language_layers = True, # False if not finetuning language layers
|
||||
finetune_attention_modules = True, # False if not finetuning attention layers
|
||||
finetune_mlp_modules = True, # False if not finetuning MLP layers
|
||||
|
||||
r = 32, # The larger, the higher the accuracy, but might overfit
|
||||
lora_alpha = 32, # Recommended alpha == r at least
|
||||
lora_dropout = 0,
|
||||
bias = "none",
|
||||
random_state = 3407,
|
||||
use_rslora = False, # We support rank stabilized LoRA
|
||||
loftq_config = None, # And LoftQ
|
||||
target_modules = "all-linear", # Optional now! Can specify a list if needed
|
||||
)
|
||||
|
||||
|
||||
# <a name="Data"></a>
|
||||
# ### Data Prep
|
||||
# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
|
||||
#
|
||||
# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
|
||||
|
||||
# In[5]:
|
||||
|
||||
|
||||
from datasets import load_dataset
|
||||
dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
|
||||
|
||||
|
||||
# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
|
||||
|
||||
# In[6]:
|
||||
|
||||
|
||||
dataset
|
||||
|
||||
|
||||
# In[7]:
|
||||
|
||||
|
||||
dataset[2]["image"]
|
||||
|
||||
|
||||
# In[8]:
|
||||
|
||||
|
||||
dataset[2]["text"]
|
||||
|
||||
|
||||
# We can also render LaTeX directly in the browser!
|
||||
|
||||
# In[9]:
|
||||
|
||||
|
||||
from IPython.display import display, Math, Latex
|
||||
|
||||
latex = dataset[3]["text"]
|
||||
display(Math(latex))
|
||||
|
||||
|
||||
# To format the dataset, all vision fine-tuning tasks should follow this format:
|
||||
#
|
||||
# ```python
|
||||
# [
|
||||
# {
|
||||
# "role": "user",
|
||||
# "content": [
|
||||
# {"type": "text", "text": instruction},
|
||||
# {"type": "image", "image": sample["image"]},
|
||||
# ],
|
||||
# },
|
||||
# {
|
||||
# "role": "user",
|
||||
# "content": [
|
||||
# {"type": "text", "text": instruction},
|
||||
# {"type": "image", "image": sample["image"]},
|
||||
# ],
|
||||
# },
|
||||
# ]
|
||||
# ```
|
||||
|
||||
# In[10]:
|
||||
|
||||
|
||||
instruction = "Write the LaTeX representation for this image."
|
||||
|
||||
def convert_to_conversation(sample):
|
||||
conversation = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": instruction},
|
||||
{"type": "image", "image": sample["image"]},
|
||||
],
|
||||
},
|
||||
{"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
|
||||
]
|
||||
return {"messages": conversation}
|
||||
pass
|
||||
|
||||
|
||||
# Let's convert the dataset into the "correct" format for finetuning:
|
||||
|
||||
# In[11]:
|
||||
|
||||
|
||||
converted_dataset = [convert_to_conversation(sample) for sample in dataset]
|
||||
|
||||
|
||||
# The first example is now structured like below:
|
||||
|
||||
# In[12]:
|
||||
|
||||
|
||||
converted_dataset[0]
|
||||
|
||||
|
||||
# Lets take the Gemma 4 instruction chat template and use it in our base model
|
||||
|
||||
# In[13]:
|
||||
|
||||
|
||||
from unsloth import get_chat_template
|
||||
|
||||
processor = get_chat_template(
|
||||
processor,
|
||||
"gemma-4"
|
||||
)
|
||||
|
||||
|
||||
# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
|
||||
|
||||
# In[14]:
|
||||
|
||||
|
||||
image = dataset[2]["image"]
|
||||
instruction = "Write the LaTeX representation for this image."
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
|
||||
}
|
||||
]
|
||||
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
|
||||
inputs = processor(
|
||||
image,
|
||||
input_text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
text_streamer = TextStreamer(processor, skip_prompt = True)
|
||||
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# You can see it's absolutely terrible! It doesn't follow instructions at all
|
||||
|
||||
# <a name="Train"></a>
|
||||
# ### Train the model
|
||||
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
|
||||
#
|
||||
# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
|
||||
|
||||
# In[15]:
|
||||
|
||||
|
||||
from unsloth.trainer import UnslothVisionDataCollator
|
||||
from trl import SFTTrainer, SFTConfig
|
||||
|
||||
trainer = SFTTrainer(
|
||||
model = model,
|
||||
train_dataset = converted_dataset,
|
||||
processing_class = processor.tokenizer,
|
||||
data_collator = UnslothVisionDataCollator(model, processor),
|
||||
args = SFTConfig(
|
||||
per_device_train_batch_size = 1,
|
||||
gradient_accumulation_steps = 4,
|
||||
max_grad_norm = 0.3,
|
||||
warmup_ratio = 0.03,
|
||||
max_steps = 60,
|
||||
# num_train_epochs = 2, # Set this instead of max_steps for full training runs
|
||||
learning_rate = 2e-4,
|
||||
logging_steps = 1,
|
||||
save_strategy = "steps",
|
||||
optim = "adamw_8bit",
|
||||
weight_decay = 0.001,
|
||||
lr_scheduler_type = "cosine",
|
||||
seed = 3407,
|
||||
output_dir = "outputs",
|
||||
report_to = "none", # For Weights and Biases or others
|
||||
|
||||
# You MUST put the below items for vision finetuning:
|
||||
remove_unused_columns = False,
|
||||
dataset_text_field = "",
|
||||
dataset_kwargs = {"skip_prepare_dataset": True},
|
||||
max_length = 2048,
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
# In[16]:
|
||||
|
||||
|
||||
# @title Show current memory stats
|
||||
gpu_stats = torch.cuda.get_device_properties(0)
|
||||
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
|
||||
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
|
||||
print(f"{start_gpu_memory} GB of memory reserved.")
|
||||
|
||||
|
||||
# In[17]:
|
||||
|
||||
|
||||
trainer_stats = trainer.train()
|
||||
|
||||
|
||||
# In[18]:
|
||||
|
||||
|
||||
# @title Show final memory and time stats
|
||||
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
|
||||
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
|
||||
used_percentage = round(used_memory / max_memory * 100, 3)
|
||||
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
|
||||
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
|
||||
print(
|
||||
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
|
||||
)
|
||||
print(f"Peak reserved memory = {used_memory} GB.")
|
||||
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
|
||||
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
|
||||
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
|
||||
|
||||
|
||||
# <a name="Inference"></a>
|
||||
# ### Inference
|
||||
# Let's run the model! You can modify the instruction and input—just leave the output blank.
|
||||
#
|
||||
# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
|
||||
|
||||
# In[19]:
|
||||
|
||||
|
||||
image = dataset[10]["image"]
|
||||
instruction = "Write the LaTeX representation for this image."
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
|
||||
}
|
||||
]
|
||||
|
||||
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
|
||||
|
||||
inputs = processor(
|
||||
image,
|
||||
input_text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
text_streamer = TextStreamer(processor, skip_prompt = True)
|
||||
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# <a name="Save"></a>
|
||||
# ### Saving, loading finetuned models
|
||||
# To save the final model as LoRA adapters, use Hugging Face’s `push_to_hub` for online saving, or `save_pretrained` for local storage.
|
||||
#
|
||||
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
|
||||
|
||||
# In[20]:
|
||||
|
||||
|
||||
model.save_pretrained("gemma_4_lora") # Local saving
|
||||
processor.save_pretrained("gemma_4_lora")
|
||||
# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
|
||||
|
||||
|
||||
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
|
||||
|
||||
# In[21]:
|
||||
|
||||
|
||||
if False:
|
||||
from unsloth import FastVisionModel
|
||||
|
||||
model, processor = FastVisionModel.from_pretrained(
|
||||
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
|
||||
load_in_4bit = True, # Set to False for 16bit LoRA
|
||||
)
|
||||
|
||||
sample = dataset[1]
|
||||
image = sample["image"].convert("RGB")
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": sample["text"],
|
||||
},
|
||||
{
|
||||
"type": "image",
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
|
||||
inputs = processor(
|
||||
image,
|
||||
input_text,
|
||||
add_special_tokens = False,
|
||||
return_tensors = "pt",
|
||||
).to("cuda")
|
||||
|
||||
from transformers import TextStreamer
|
||||
|
||||
text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
|
||||
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
|
||||
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
|
||||
|
||||
|
||||
# ### Saving to float16 for VLLM
|
||||
#
|
||||
# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
|
||||
|
||||
# In[22]:
|
||||
|
||||
|
||||
# Select ONLY 1 to save! (Both not needed!)
|
||||
|
||||
# Save locally to 16bit
|
||||
if False: model.save_pretrained_merged("unsloth_finetune", processor,)
|
||||
|
||||
# To export and save to your Hugging Face account
|
||||
if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
|
||||
|
||||
|
||||
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
|
||||
#
|
||||
# Some other resources:
|
||||
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
|
||||
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
|
||||
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
|
||||
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
|
||||
#
|
||||
# <div class="align-center">
|
||||
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
|
||||
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
|
||||
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
|
||||
#
|
||||
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
|
||||
# </div>
|
||||
#
|
||||
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
|
||||
Reference in New Issue
Block a user