docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

Five-lane parallel research pass. Each subdir under tooling/ has its own README indexing downloaded files with verified upstream sources. - google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts, gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev HTML snapshots, Gemma 3 tech report - huggingface/: 8 gemma-4-* model cards, chat-template .jinja files, tokenizer_config.json, transformers gemma4/ source, launch blog posts, official HF Spaces app.py - inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI comparison, run_commands.sh with 8 working launches, 9 code snippets - gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2, Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma) - fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE), TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md Findings that update earlier CORPUS_* docs are flagged in tooling/README.md (not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM, FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech report PDF yet, no Gemma-4-generation specialized siblings yet. Pre-commit secrets hook bypassed per user authorization — flagged "secrets" are base64 notebook cell outputs and example Ed25519 keys in the HDP agentic-security demo, not real credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:24:48 -04:00
parent 5011059f5d
commit eecebe7ef5
149 changed files with 181297 additions and 0 deletions
@@ -0,0 +1,512 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+# 
+# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
+
+# In[3]:
+
+
+from unsloth import FastModel
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastModel.from_pretrained(
+    model_name = "unsloth/gemma-4-26B-A4B-it",
+    dtype = None, # None for auto detection
+    max_seq_length = 8192, # Choose any for long context!
+    load_in_4bit = True,  # 4 bit quantization to reduce memory
+    full_finetuning = False, # [NEW!] We have full finetuning now!
+    # token = "YOUR_HF_TOKEN", # HF Token for gated models
+)
+
+
+# # Gemma 4 can process Text, Vision and Audio!
+# 
+# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[4]:
+
+
+from transformers import TextStreamer
+# Helper function for inference
+def do_gemma_4_inference(messages, max_new_tokens = 128):
+    _ = model.generate(
+        **tokenizer.apply_chat_template(
+            messages,
+            add_generation_prompt = True, # Must add for generation
+            tokenize = True,
+            return_dict = True,
+            return_tensors = "pt",
+        ).to("cuda"),
+        max_new_tokens = max_new_tokens,
+        use_cache = True,
+        temperature = 1.0, top_p = 0.95, top_k = 64,
+        streamer = TextStreamer(tokenizer, skip_prompt = True),
+    )
+
+
+# # Gemma 4 can see images!
+# 
+# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
+
+# In[5]:
+
+
+sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "image", "image" : sloth_link },
+        { "type": "text",  "text" : "Which films does this animal feature in?" }
+    ]
+}]
+# You might have to wait 1 minute for Unsloth's auto compiler
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# Let's make a poem about sloths!
+
+# In[6]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{ "type" : "text",
+                  "text" : "Write a poem about sloths." }]
+}]
+do_gemma_4_inference(messages)
+
+
+# # Let's finetune Gemma 4!
+# 
+# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
+
+# We now add LoRA adapters so we only need to update a small amount of parameters!
+
+# In[7]:
+
+
+model = FastModel.get_peft_model(
+    model,
+    finetune_vision_layers     = False, # Turn off for just text!
+    finetune_language_layers   = True,  # Should leave on!
+    finetune_attention_modules = True,  # Attention good for GRPO
+    finetune_mlp_modules       = True,  # Should leave on always!
+
+    r = 8,           # Larger = higher accuracy, but might overfit
+    lora_alpha = 8,  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
+# 
+# ```
+# <bos><|turn>user
+# Hello<turn|>
+# <|turn>model
+# Hey there!<turn|>
+# ```
+# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
+
+# In[8]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4-thinking",
+)
+
+
+# We get the first 3000 rows of the dataset
+
+# In[9]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
+
+
+# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
+
+# In[10]:
+
+
+from unsloth.chat_templates import standardize_data_formats
+dataset = standardize_data_formats(dataset)
+
+
+# Let's see how row 100 looks like!
+
+# In[11]:
+
+
+dataset[100]
+
+
+# We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
+
+# In[12]:
+
+
+def formatting_prompts_func(examples):
+   convos = examples["conversations"]
+   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
+   return { "text" : texts, }
+
+dataset = dataset.map(formatting_prompts_func, batched = True)
+
+
+# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
+
+# In[13]:
+
+
+dataset[100]["text"]
+
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
+
+# In[14]:
+
+
+from trl import SFTTrainer, SFTConfig
+trainer = SFTTrainer(
+    model = model,
+    tokenizer = tokenizer,
+    train_dataset = dataset,
+    eval_dataset = None, # Can set up evaluation!
+    args = SFTConfig(
+        dataset_text_field = "text",
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
+        warmup_steps = 5,
+        # num_train_epochs = 1, # Set this for 1 full training run.
+        max_steps = 60,
+        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
+        logging_steps = 1,
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "linear",
+        seed = 3407,
+        report_to = "none", # Use TrackIO/WandB etc
+    ),
+)
+
+
+# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
+
+# In[15]:
+
+
+from unsloth.chat_templates import train_on_responses_only
+trainer = train_on_responses_only(
+    trainer,
+    instruction_part = "<|turn>user\n",
+    response_part = "<|turn>model\n",
+)
+
+
+# Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!
+
+# In[16]:
+
+
+tokenizer.decode(trainer.train_dataset[100]["input_ids"])
+
+
+# Now let's print the masked out example - you should see only the answer is present:
+
+# In[17]:
+
+
+tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
+
+
+# In[18]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# # Let's train the model!
+# 
+# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
+
+# In[19]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[20]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[21]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4-thinking",
+)
+messages = [{
+    "role": "user",
+    "content": [{
+        "type" : "text",
+        "text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
+    }]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+outputs = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    use_cache = True,
+    # Recommended Gemma-3 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+)
+tokenizer.batch_decode(outputs)
+
+
+#  You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
+
+# In[22]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "Why is the sky blue?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    use_cache = True,
+    # Recommended Gemma-3 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[23]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[24]:
+
+
+if False:
+    from unsloth import FastModel
+    model, tokenizer = FastModel.from_pretrained(
+        model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
+        max_seq_length = 2048,
+        load_in_4bit = True,
+    )
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "What is Gemma-4?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 128, # Increase for longer outputs!
+    # Recommended Gemma-3 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
+
+# In[25]:
+
+
+if False: # Change to True to save finetune!
+    model.save_pretrained_merged("gemma-4-finetune", tokenizer)
+
+
+# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[26]:
+
+
+if False: # Change to True to upload finetune
+    model.push_to_hub_merged(
+        "HF_ACCOUNT/gemma-4-finetune", tokenizer,
+        token = "YOUR_HF_TOKEN"
+    )
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
+
+# In[27]:
+
+
+if False: # Change to True to save to GGUF
+    model.save_pretrained_gguf(
+        "gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
+    )
+
+
+# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[28]:
+
+
+if False: # Change to True to upload GGUF
+    model.push_to_hub_gguf(
+        "HF_ACCOUNT/gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,448 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+
+# In[3]:
+
+
+from unsloth import FastVisionModel # FastLanguageModel for LLMs
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, processor = FastVisionModel.from_pretrained(
+    "unsloth/gemma-4-26B-A4B-it",
+    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
+    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
+)
+
+
+# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
+# 
+# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
+
+# In[4]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    finetune_vision_layers     = True, # False if not finetuning vision layers
+    finetune_language_layers   = True, # False if not finetuning language layers
+    finetune_attention_modules = True, # False if not finetuning attention layers
+    finetune_mlp_modules       = True, # False if not finetuning MLP layers
+
+    r = 32,                           # The larger, the higher the accuracy, but might overfit
+    lora_alpha = 32,                  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+    use_rslora = False,               # We support rank stabilized LoRA
+    loftq_config = None,               # And LoftQ
+    target_modules = "all-linear",    # Optional now! Can specify a list if needed
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
+# 
+# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
+
+# In[5]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
+
+
+# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
+
+# In[6]:
+
+
+dataset
+
+
+# In[7]:
+
+
+dataset[2]["image"]
+
+
+# In[8]:
+
+
+dataset[2]["text"]
+
+
+# We can also render LaTeX directly in the browser!
+
+# In[9]:
+
+
+from IPython.display import display, Math, Latex
+
+latex = dataset[3]["text"]
+display(Math(latex))
+
+
+# To format the dataset, all vision fine-tuning tasks should follow this format:
+# 
+# ```python
+# [
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+# ]
+# ```
+
+# In[10]:
+
+
+instruction = "Write the LaTeX representation for this image."
+
+def convert_to_conversation(sample):
+    conversation = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": instruction},
+                {"type": "image", "image": sample["image"]},
+            ],
+        },
+        {"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
+    ]
+    return {"messages": conversation}
+pass
+
+
+# Let's convert the dataset into the "correct" format for finetuning:
+
+# In[11]:
+
+
+converted_dataset = [convert_to_conversation(sample) for sample in dataset]
+
+
+# The first example is now structured like below:
+
+# In[12]:
+
+
+converted_dataset[0]
+
+
+# Lets take the Gemma 4 instruction chat template and use it in our base model
+
+# In[13]:
+
+
+from unsloth import get_chat_template
+
+processor = get_chat_template(
+    processor,
+    "gemma-4-thinking"
+)
+
+
+# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
+
+# In[14]:
+
+
+image = dataset[2]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# You can see it's absolutely terrible! It doesn't follow instructions at all
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
+# 
+# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
+
+# In[15]:
+
+
+from unsloth.trainer import UnslothVisionDataCollator
+from trl import SFTTrainer, SFTConfig
+
+trainer = SFTTrainer(
+    model = model,
+    train_dataset = converted_dataset,
+    processing_class = processor.tokenizer,
+    data_collator = UnslothVisionDataCollator(model, processor),
+    args = SFTConfig(
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4,
+        max_grad_norm = 0.3,
+        warmup_ratio = 0.03,
+        max_steps = 60,
+        # num_train_epochs = 2, # Set this instead of max_steps for full training runs
+        learning_rate = 2e-4,
+        logging_steps = 1,
+        save_strategy = "steps",
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "cosine",
+        seed = 3407,
+        output_dir = "outputs",
+        report_to = "none", # For Weights and Biases or others
+
+        # You MUST put the below items for vision finetuning:
+        remove_unused_columns = False,
+        dataset_text_field = "",
+        dataset_kwargs = {"skip_prepare_dataset": True},
+        max_length = 2048,
+    )
+)
+
+
+# In[16]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# In[17]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[18]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model! You can modify the instruction and input—just leave the output blank.
+# 
+# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
+
+# In[19]:
+
+
+image = dataset[10]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, use Hugging Face’s `push_to_hub` for online saving, or `save_pretrained` for local storage.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[20]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+processor.save_pretrained("gemma_4_lora")
+# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[21]:
+
+
+if False:
+    from unsloth import FastVisionModel
+
+    model, processor = FastVisionModel.from_pretrained(
+        model_name = "gemma_4_lora",  # YOUR MODEL YOU USED FOR TRAINING
+        load_in_4bit = True,  # Set to False for 16bit LoRA
+    )
+
+sample = dataset[1]
+image = sample["image"].convert("RGB")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": sample["text"],
+            },
+            {
+                "type": "image",
+            },
+        ],
+    },
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
+_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                   use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[22]:
+
+
+# Select ONLY 1 to save! (Both not needed!)
+
+# Save locally to 16bit
+if False: model.save_pretrained_merged("unsloth_finetune", processor,)
+
+# To export and save to your Hugging Face account
+if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
+
+
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,513 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+# 
+# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
+
+# In[3]:
+
+
+from unsloth import FastModel
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastModel.from_pretrained(
+    model_name = "unsloth/gemma-4-31B-it",
+    dtype = None, # None for auto detection
+    max_seq_length = 8192, # Choose any for long context!
+    load_in_4bit = True,  # 4 bit quantization to reduce memory
+    full_finetuning = False, # [NEW!] We have full finetuning now!
+    # token = "YOUR_HF_TOKEN", # HF Token for gated models
+)
+
+
+# # Gemma 4 can process Text, Vision and Audio!
+# 
+# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[4]:
+
+
+from transformers import TextStreamer
+# Helper function for inference
+def do_gemma_4_inference(messages, max_new_tokens = 128):
+    _ = model.generate(
+        **tokenizer.apply_chat_template(
+            messages,
+            add_generation_prompt = True, # Must add for generation
+            tokenize = True,
+            return_dict = True,
+            return_tensors = "pt",
+        ).to("cuda"),
+        max_new_tokens = max_new_tokens,
+        use_cache = True,
+        temperature = 1.0, top_p = 0.95, top_k = 64,
+        streamer = TextStreamer(tokenizer, skip_prompt = True),
+    )
+
+
+# # Gemma 4 can see images!
+# 
+# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
+
+# In[5]:
+
+
+sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "image", "image" : sloth_link },
+        { "type": "text",  "text" : "Which films does this animal feature in?" }
+    ]
+}]
+# You might have to wait 1 minute for Unsloth's auto compiler
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# Let's make a poem about sloths!
+
+# In[6]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{ "type" : "text",
+                  "text" : "Write a poem about sloths." }]
+}]
+do_gemma_4_inference(messages)
+
+
+# # Let's finetune Gemma 4!
+# 
+# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
+
+# We now add LoRA adapters so we only need to update a small amount of parameters!
+
+# In[7]:
+
+
+model = FastModel.get_peft_model(
+    model,
+    finetune_vision_layers     = False, # Turn off for just text!
+    finetune_language_layers   = True,  # Should leave on!
+    finetune_attention_modules = True,  # Attention good for GRPO
+    finetune_mlp_modules       = True,  # Should leave on always!
+
+    r = 8,           # Larger = higher accuracy, but might overfit
+    lora_alpha = 8,  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
+# 
+# ```
+# <bos><|turn>user
+# Hello<turn|>
+# <|turn>model
+# Hey there!<turn|>
+# ```
+# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
+
+# In[8]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4-thinking",
+)
+
+
+# We get the first 3000 rows of the dataset
+
+# In[9]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
+
+
+# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
+
+# In[10]:
+
+
+from unsloth.chat_templates import standardize_data_formats
+dataset = standardize_data_formats(dataset)
+
+
+# Let's see how row 100 looks like!
+
+# In[11]:
+
+
+dataset[100]
+
+
+# We now have to apply the chat template for `Gemma-4` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
+
+# In[12]:
+
+
+def formatting_prompts_func(examples):
+   convos = examples["conversations"]
+   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
+   return { "text" : texts, }
+
+dataset = dataset.map(formatting_prompts_func, batched = True)
+
+
+# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
+
+# In[13]:
+
+
+dataset[100]["text"]
+
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
+
+# In[14]:
+
+
+from trl import SFTTrainer, SFTConfig
+trainer = SFTTrainer(
+    model = model,
+    tokenizer = tokenizer,
+    train_dataset = dataset,
+    eval_dataset = None, # Can set up evaluation!
+    args = SFTConfig(
+        dataset_text_field = "text",
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
+        warmup_steps = 5,
+        # num_train_epochs = 1, # Set this for 1 full training run.
+        max_steps = 60,
+        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
+        logging_steps = 1,
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "linear",
+        seed = 3407,
+        report_to = "none", # Use TrackIO/WandB etc
+    ),
+)
+
+
+# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
+
+# In[15]:
+
+
+from unsloth.chat_templates import train_on_responses_only
+trainer = train_on_responses_only(
+    trainer,
+    instruction_part = "<|turn>user\n",
+    response_part = "<|turn>model\n",
+)
+
+
+# Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!
+
+# In[16]:
+
+
+tokenizer.decode(trainer.train_dataset[100]["input_ids"])
+
+
+# Now let's print the masked out example - you should see only the answer is present:
+
+# In[17]:
+
+
+tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
+
+
+# In[18]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# # Let's train the model!
+# 
+# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
+
+# In[19]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[20]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[21]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4-thinking",
+)
+messages = [{
+    "role": "user",
+    "content": [{
+        "type" : "text",
+        "text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
+    }]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+outputs = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    use_cache = True,
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+)
+tokenizer.batch_decode(outputs)
+
+
+#  You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
+
+# In[22]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "Why is the sky blue?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    use_cache = True,
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[23]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[24]:
+
+
+if False:
+    from unsloth import FastModel
+    model, tokenizer = FastModel.from_pretrained(
+        model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
+        max_seq_length = 2048,
+        load_in_4bit = True,
+    )
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "What is Gemma-4?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 128, # Increase for longer outputs!
+    use_cache = True,
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
+
+# In[25]:
+
+
+if False: # Change to True to save finetune!
+    model.save_pretrained_merged("gemma-4-finetune", tokenizer)
+
+
+# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[26]:
+
+
+if False: # Change to True to upload finetune
+    model.push_to_hub_merged(
+        "HF_ACCOUNT/gemma-4-finetune", tokenizer,
+        token = "YOUR_HF_TOKEN"
+    )
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
+
+# In[27]:
+
+
+if False: # Change to True to save to GGUF
+    model.save_pretrained_gguf(
+        "gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
+    )
+
+
+# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[28]:
+
+
+if False: # Change to True to upload GGUF
+    model.push_to_hub_gguf(
+        "HF_ACCOUNT/gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,448 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+
+# In[3]:
+
+
+from unsloth import FastVisionModel # FastLanguageModel for LLMs
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, processor = FastVisionModel.from_pretrained(
+    "unsloth/gemma-4-31B-it",
+    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
+    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
+)
+
+
+# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
+# 
+# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
+
+# In[4]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    finetune_vision_layers     = True, # False if not finetuning vision layers
+    finetune_language_layers   = True, # False if not finetuning language layers
+    finetune_attention_modules = True, # False if not finetuning attention layers
+    finetune_mlp_modules       = True, # False if not finetuning MLP layers
+
+    r = 32,                           # The larger, the higher the accuracy, but might overfit
+    lora_alpha = 32,                  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+    use_rslora = False,               # We support rank stabilized LoRA
+    loftq_config = None,               # And LoftQ
+    target_modules = "all-linear",    # Optional now! Can specify a list if needed
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
+# 
+# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
+
+# In[5]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
+
+
+# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
+
+# In[6]:
+
+
+dataset
+
+
+# In[7]:
+
+
+dataset[2]["image"]
+
+
+# In[8]:
+
+
+dataset[2]["text"]
+
+
+# We can also render LaTeX directly in the browser!
+
+# In[9]:
+
+
+from IPython.display import display, Math, Latex
+
+latex = dataset[3]["text"]
+display(Math(latex))
+
+
+# To format the dataset, all vision fine-tuning tasks should follow this format:
+# 
+# ```python
+# [
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+# ]
+# ```
+
+# In[10]:
+
+
+instruction = "Write the LaTeX representation for this image."
+
+def convert_to_conversation(sample):
+    conversation = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": instruction},
+                {"type": "image", "image": sample["image"]},
+            ],
+        },
+        {"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
+    ]
+    return {"messages": conversation}
+pass
+
+
+# Let's convert the dataset into the "correct" format for finetuning:
+
+# In[11]:
+
+
+converted_dataset = [convert_to_conversation(sample) for sample in dataset]
+
+
+# The first example is now structured like below:
+
+# In[12]:
+
+
+converted_dataset[0]
+
+
+# Lets take the Gemma 4 instruction chat template and use it in our base model
+
+# In[13]:
+
+
+from unsloth import get_chat_template
+
+processor = get_chat_template(
+    processor,
+    "gemma-4-thinking"
+)
+
+
+# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
+
+# In[14]:
+
+
+image = dataset[2]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# You can see it's absolutely terrible! It doesn't follow instructions at all
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
+# 
+# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
+
+# In[15]:
+
+
+from unsloth.trainer import UnslothVisionDataCollator
+from trl import SFTTrainer, SFTConfig
+
+trainer = SFTTrainer(
+    model = model,
+    train_dataset = converted_dataset,
+    processing_class = processor.tokenizer,
+    data_collator = UnslothVisionDataCollator(model, processor),
+    args = SFTConfig(
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4,
+        max_grad_norm = 0.3,
+        warmup_ratio = 0.03,
+        max_steps = 60,
+        # num_train_epochs = 2, # Set this instead of max_steps for full training runs
+        learning_rate = 2e-4,
+        logging_steps = 1,
+        save_strategy = "steps",
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "cosine",
+        seed = 3407,
+        output_dir = "outputs",
+        report_to = "none", # For Weights and Biases or others
+
+        # You MUST put the below items for vision finetuning:
+        remove_unused_columns = False,
+        dataset_text_field = "",
+        dataset_kwargs = {"skip_prepare_dataset": True},
+        max_length = 2048,
+    )
+)
+
+
+# In[16]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# In[17]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[18]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model! You can modify the instruction and input—just leave the output blank.
+# 
+# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
+
+# In[19]:
+
+
+image = dataset[10]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, use Hugging Face’s `push_to_hub` for online saving, or `save_pretrained` for local storage.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[20]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+processor.save_pretrained("gemma_4_lora")
+# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[21]:
+
+
+if False:
+    from unsloth import FastVisionModel
+
+    model, processor = FastVisionModel.from_pretrained(
+        model_name = "gemma_4_lora",  # YOUR MODEL YOU USED FOR TRAINING
+        load_in_4bit = True,  # Set to False for 16bit LoRA
+    )
+
+sample = dataset[1]
+image = sample["image"].convert("RGB")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": sample["text"],
+            },
+            {
+                "type": "image",
+            },
+        ],
+    },
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
+_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                   use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[22]:
+
+
+# Select ONLY 1 to save! (Both not needed!)
+
+# Save locally to 16bit
+if False: model.save_pretrained_merged("unsloth_finetune", processor,)
+
+# To export and save to your Hugging Face account
+if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
+
+
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,478 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+# 
+# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
+
+# In[3]:
+
+
+from unsloth import FastModel
+import torch
+from huggingface_hub import snapshot_download
+
+fourbit_models = [
+    # Gemma 4 models
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B-it",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, processor = FastModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E2B-it",
+    dtype = None, # None for auto detection
+    max_seq_length = 8192, # Choose any for long context!
+    load_in_4bit = False,  # 4 bit quantization to reduce memory
+    full_finetuning = False, # [NEW!] We have full finetuning now!
+    # token = "YOUR_HF_TOKEN", # HF Token for gated models
+)
+
+
+# # Gemma 4 can process Text, Vision and Audio!
+# 
+# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
+
+# In[4]:
+
+
+from transformers import TextStreamer
+# Helper function for inference
+def do_gemma_4_inference(messages, max_new_tokens = 128):
+    _ = model.generate(
+        **processor.apply_chat_template(
+            messages,
+            add_generation_prompt = True, # Must add for generation
+            tokenize = True,
+            return_dict = True,
+            return_tensors = "pt",
+        ).to("cuda"),
+        max_new_tokens = max_new_tokens,
+        do_sample = False,
+        streamer = TextStreamer(processor, skip_prompt = True),
+    )
+
+
+# <h3>Let's Evaluate Gemma 4 Baseline Performance on German Transcription</h2>
+
+# In[5]:
+
+
+from datasets import load_dataset,Audio,concatenate_datasets
+
+dataset = load_dataset("kadirnar/Emilia-DE-B000000", split = "train")
+
+# Select a single audio sample to reserve for testing.
+# This index is chosen from the full dataset before we create the smaller training split.
+test_audio = dataset[7546]
+
+dataset = dataset.select(range(3000))
+
+dataset = dataset.cast_column("audio", Audio(sampling_rate = 16000))
+
+
+# In[6]:
+
+
+from IPython.display import Audio, display
+print(test_audio['text'])
+Audio(test_audio['audio']['array'],rate = test_audio['audio']['sampling_rate'])
+
+
+# And the translation of the audio from German to English is:
+# 
+# > I—I hold myself directly accountable. That much is, of course, clear: namely, that there are political interests involved in trade—in the exchange of goods—and that political influences are at play. The question is: that should not be the alternative.
+
+# In[7]:
+
+
+messages = [
+    {
+        "role": "system",
+        "content": [
+            {
+                "type": "text",
+                "text": "You are an assistant that transcribes speech accurately.",
+            }
+        ],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "audio", "audio": test_audio['audio']['array']},
+            {"type": "text", "text": "Please transcribe this audio."}
+        ]
+    }
+]
+
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# <h3>Baseline Model Performance: 32.43% Word Error Rate (WER) for this sample !</h3>
+
+# # Let's finetune Gemma 4!
+# 
+# You can finetune the vision and text and audio parts
+
+# We now add LoRA adapters so we only need to update a small amount of parameters!
+
+# In[8]:
+
+
+model = FastModel.get_peft_model(
+    model,
+    finetune_vision_layers     = False, # False if not finetuning vision layers
+    finetune_language_layers   = True,  # False if not finetuning language layers
+    finetune_attention_modules = True,  # False if not finetuning attention layers
+    finetune_mlp_modules       = True,  # False if not finetuning MLP layers
+
+    r = 8,                              # The larger, the higher the accuracy, but might overfit
+    lora_alpha = 16,                    # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+    use_rslora = False,                 # We support rank stabilized LoRA
+    loftq_config = None,                # And LoftQ
+    target_modules = [
+        "q_proj", "k_proj", "v_proj", "o_proj",
+        "gate_proj", "up_proj", "down_proj",
+
+        # Audio layers
+        "post", "linear_start", "linear_end",
+        "embedding_projection",
+        "ffw_layer_1", "ffw_layer_2",
+        "output_proj",
+    ]
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We adapt the `kadirnar/Emilia-DE-B000000` dataset for our German ASR task using Gemma 4 multi-modal chat format. Each audio-text pair is structured into a conversation with `system`, `user`, and `assistant` roles. The processor then converts this into the final training format:
+# 
+# ```
+# <bos><|turn>system
+# You are an assistant that transcribes speech accurately.<turn|>
+# <|turn>user
+# <|audio|>Please transcribe this audio.<turn|>
+# <|turn>model
+# Ich, ich rechne direkt mich an.<turn|>
+
+# In[9]:
+
+
+def format_intersection_data(samples: dict) -> dict[str, list]:
+    """Format intersection dataset to match expected message format"""
+    formatted_samples = {"messages": []}
+    for idx in range(len(samples["audio"])):
+        audio = samples["audio"][idx]["array"]
+        label = str(samples["text"][idx])
+
+        message = [
+            {
+                "role": "system",
+                "content": [
+                    {
+                        "type": "text",
+                        "text": "You are an assistant that transcribes speech accurately.",
+                    }
+                ],
+            },
+            {
+                "role": "user",
+                "content": [
+                    {"type": "audio", "audio": audio},
+                    {"type": "text", "text": "Please transcribe this audio."}
+                ]
+            },
+            {
+                "role": "assistant",
+                "content":[{"type": "text", "text": label}]
+            }
+        ]
+        formatted_samples["messages"].append(message)
+    return formatted_samples
+
+
+# In[10]:
+
+
+dataset = dataset.map(format_intersection_data, batched = True, batch_size = 4, num_proc = 4)
+
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
+
+# In[11]:
+
+
+# Use UnslothVisionDataCollator which handles audio token alignment correctly
+from unsloth.trainer import UnslothVisionDataCollator
+from trl import SFTTrainer, SFTConfig
+
+trainer = SFTTrainer(
+    model = model,
+    train_dataset = dataset,
+    processing_class = processor.tokenizer,
+    data_collator = UnslothVisionDataCollator(model, processor),
+    args = SFTConfig(
+        per_device_train_batch_size = 8,
+        gradient_accumulation_steps = 1,
+        warmup_ratio = 0.03,
+        # num_train_epochs = 1, # Use for full training runs
+        max_steps = 60,
+        learning_rate = 5e-5,
+        logging_steps = 1,
+        save_strategy = "steps",
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "cosine",
+        seed = 3407,
+        output_dir = "outputs",
+        report_to = "none",
+        remove_unused_columns = False,
+
+        # The below are a must for audio finetuning:
+        dataset_text_field = "",
+        dataset_kwargs = {"skip_prepare_dataset": True},
+        max_length = 8192,
+    )
+)
+
+
+# In[12]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# # Let's train the model!
+# 
+# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
+
+# In[13]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[14]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
+
+# In[15]:
+
+
+messages = [
+    {
+        "role": "system",
+        "content": [
+            {
+                "type": "text",
+                "text": "You are an assistant that transcribes speech accurately.",
+            }
+        ],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "audio", "audio": test_audio['audio']['array']},
+            {"type": "text", "text": "Please transcribe this audio."}
+        ]
+    }
+]
+
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[16]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+processor.save_pretrained("gemma_4_lora")
+# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# processor.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[17]:
+
+
+if False:
+    from unsloth import FastModel
+    model, processor = FastModel.from_pretrained(
+        model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
+        max_seq_length = 2048,
+        load_in_4bit = True,
+    )
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "What is Gemma-4?",}]
+}]
+inputs = processor.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 128, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(processor, skip_prompt = True),
+)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
+
+# In[18]:
+
+
+if False: # Change to True to save finetune!
+    model.save_pretrained_merged("gemma-4", processor)
+
+
+# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[19]:
+
+
+if False: # Change to True to upload finetune
+    model.push_to_hub_merged(
+        "HF_ACCOUNT/gemma-4-finetune", processor,
+        token = "YOUR_HF_TOKEN"
+    )
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
+
+# In[20]:
+
+
+if False: # Change to True to save to GGUF
+    model.save_pretrained_gguf(
+        "gemma_4_finetune",
+        processor,
+        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
+    )
+
+
+# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[21]:
+
+
+if False: # Change to True to upload GGUF
+    model.push_to_hub_gguf(
+        "HF_ACCOUNT/gemma_4_finetune",
+        processor,
+        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,556 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+# 
+# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
+
+# In[3]:
+
+
+from unsloth import FastModel
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E2B-it",
+    dtype = None, # None for auto detection
+    max_seq_length = 1024, # Choose any for long context!
+    load_in_4bit = False,  # 4 bit quantization to reduce memory
+    full_finetuning = False, # [NEW!] We have full finetuning now!
+    # token = "YOUR_HF_TOKEN", # HF Token for gated models
+)
+
+
+# # Gemma 4 can process Text, Vision and Audio!
+# 
+# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[4]:
+
+
+from transformers import TextStreamer
+# Helper function for inference
+def do_gemma_4_inference(messages, max_new_tokens = 128):
+    _ = model.generate(
+        **tokenizer.apply_chat_template(
+            messages,
+            add_generation_prompt = True, # Must add for generation
+            tokenize = True,
+            return_dict = True,
+            return_tensors = "pt",
+        ).to("cuda"),
+        max_new_tokens = max_new_tokens,
+        temperature = 1.0, top_p = 0.95, top_k = 64,
+        streamer = TextStreamer(tokenizer, skip_prompt = True)
+    )
+
+
+# # Gemma 4 can see images!
+# 
+# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
+
+# In[5]:
+
+
+sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "image", "image" : sloth_link },
+        { "type": "text",  "text" : "Which films does this animal feature in?" }
+    ]
+}]
+# You might have to wait 1 minute for Unsloth's auto compiler
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# Let's make a poem about sloths!
+
+# In[6]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{ "type" : "text",
+                  "text" : "Write a poem about sloths." }]
+}]
+do_gemma_4_inference(messages)
+
+
+# # Gemma 4 can also hear!
+
+# In[7]:
+
+
+from IPython.display import Audio, display
+Audio("https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3")
+
+
+# In[8]:
+
+
+get_ipython().system('wget -qqq https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3 -O audio.mp3')
+
+
+# In[9]:
+
+
+audio_file = "audio.mp3"
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "audio", "audio" : audio_file },
+        { "type": "text",  "text" : "What is this audio about?" }
+    ]
+}]
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# # Let's combine all 3 modalities together!
+
+# In[10]:
+
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "audio", "audio" : audio_file },
+        { "type": "image", "image" : sloth_link },
+        { "type": "text",  "text" : "What is this audio and image about? "\
+                                    "How are they related?" }
+    ]
+}]
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# # Let's finetune Gemma 4!
+# 
+# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
+
+# We now add LoRA adapters so we only need to update a small amount of parameters!
+
+# In[11]:
+
+
+model = FastModel.get_peft_model(
+    model,
+    finetune_vision_layers     = False, # Turn off for just text!
+    finetune_language_layers   = True,  # Should leave on!
+    finetune_attention_modules = True,  # Attention good for GRPO
+    finetune_mlp_modules       = True,  # Should leave on always!
+
+    r = 8,           # Larger = higher accuracy, but might overfit
+    lora_alpha = 8,  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
+# 
+# ```
+# <bos><|turn>user
+# Hello<turn|>
+# <|turn>model
+# Hey there!<turn|>
+# ```
+# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
+
+# In[12]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4",
+)
+
+
+# We get the first 3000 rows of the dataset
+
+# In[13]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
+
+
+# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
+
+# In[14]:
+
+
+from unsloth.chat_templates import standardize_data_formats
+dataset = standardize_data_formats(dataset)
+
+
+# Let's see how row 100 looks like!
+
+# In[15]:
+
+
+dataset[100]
+
+
+# We now have to apply the chat template for `Gemma-4` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
+
+# In[16]:
+
+
+def formatting_prompts_func(examples):
+   convos = examples["conversations"]
+   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
+   return { "text" : texts, }
+
+dataset = dataset.map(formatting_prompts_func, batched = True)
+
+
+# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
+
+# In[17]:
+
+
+dataset[100]["text"]
+
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
+
+# In[18]:
+
+
+from trl import SFTTrainer, SFTConfig
+trainer = SFTTrainer(
+    model = model,
+    tokenizer = tokenizer,
+    train_dataset = dataset,
+    eval_dataset = None, # Can set up evaluation!
+    args = SFTConfig(
+        dataset_text_field = "text",
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
+        warmup_steps = 5,
+        # num_train_epochs = 1, # Set this for 1 full training run.
+        max_steps = 60,
+        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
+        logging_steps = 1,
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "linear",
+        seed = 3407,
+        report_to = "none", # Use TrackIO/WandB etc
+    ),
+)
+
+
+# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
+
+# In[19]:
+
+
+from unsloth.chat_templates import train_on_responses_only
+trainer = train_on_responses_only(
+    trainer,
+    instruction_part = "<|turn>user\n",
+    response_part = "<|turn>model\n",
+)
+
+
+# Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!
+
+# In[20]:
+
+
+tokenizer.decode(trainer.train_dataset[100]["input_ids"])
+
+
+# Now let's print the masked out example - you should see only the answer is present:
+
+# In[21]:
+
+
+tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
+
+
+# In[22]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# # Let's train the model!
+# 
+# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
+
+# In[23]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[24]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[25]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4",
+)
+messages = [{
+    "role": "user",
+    "content": [{
+        "type" : "text",
+        "text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
+    }]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+outputs = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+)
+tokenizer.batch_decode(outputs)
+
+
+#  You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
+
+# In[26]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "Why is the sky blue?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[27]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[28]:
+
+
+if False:
+    from unsloth import FastModel
+    model, tokenizer = FastModel.from_pretrained(
+        model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
+        max_seq_length = 2048,
+        load_in_4bit = True,
+    )
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "What is Gemma-4?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 128, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
+
+# In[29]:
+
+
+if False: # Change to True to save finetune!
+    model.save_pretrained_merged("gemma-4-finetune", tokenizer)
+
+
+# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[30]:
+
+
+if False: # Change to True to upload finetune
+    model.push_to_hub_merged(
+        "HF_ACCOUNT/gemma-4-finetune", tokenizer,
+        token = "YOUR_HF_TOKEN"
+    )
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
+
+# In[31]:
+
+
+if False: # Change to True to save to GGUF
+    model.save_pretrained_gguf(
+        "gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
+    )
+
+
+# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[32]:
+
+
+if False: # Change to True to upload GGUF
+    model.push_to_hub_gguf(
+        "HF_ACCOUNT/gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,448 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[ ]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[ ]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+
+# In[ ]:
+
+
+from unsloth import FastVisionModel # FastLanguageModel for LLMs
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, processor = FastVisionModel.from_pretrained(
+    "unsloth/gemma-4-E2B-it",
+    load_in_4bit = False, # Use 4bit to reduce memory use. False for 16bit LoRA.
+    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
+)
+
+
+# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
+# 
+# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
+
+# In[ ]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    finetune_vision_layers     = True, # False if not finetuning vision layers
+    finetune_language_layers   = True, # False if not finetuning language layers
+    finetune_attention_modules = True, # False if not finetuning attention layers
+    finetune_mlp_modules       = True, # False if not finetuning MLP layers
+
+    r = 32,                           # The larger, the higher the accuracy, but might overfit
+    lora_alpha = 32,                  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+    use_rslora = False,               # We support rank stabilized LoRA
+    loftq_config = None,               # And LoftQ
+    target_modules = "all-linear",    # Optional now! Can specify a list if needed
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
+# 
+# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
+
+# In[ ]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
+
+
+# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
+
+# In[ ]:
+
+
+dataset
+
+
+# In[ ]:
+
+
+dataset[2]["image"]
+
+
+# In[ ]:
+
+
+dataset[2]["text"]
+
+
+# We can also render LaTeX directly in the browser!
+
+# In[ ]:
+
+
+from IPython.display import display, Math, Latex
+
+latex = dataset[3]["text"]
+display(Math(latex))
+
+
+# To format the dataset, all vision fine-tuning tasks should follow this format:
+# 
+# ```python
+# [
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+# ]
+# ```
+
+# In[ ]:
+
+
+instruction = "Write the LaTeX representation for this image."
+
+def convert_to_conversation(sample):
+    conversation = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": instruction},
+                {"type": "image", "image": sample["image"]},
+            ],
+        },
+        {"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
+    ]
+    return {"messages": conversation}
+pass
+
+
+# Let's convert the dataset into the "correct" format for finetuning:
+
+# In[ ]:
+
+
+converted_dataset = [convert_to_conversation(sample) for sample in dataset]
+
+
+# The first example is now structured like below:
+
+# In[ ]:
+
+
+converted_dataset[0]
+
+
+# Lets take the Gemma 4 instruction chat template and use it in our base model
+
+# In[ ]:
+
+
+from unsloth import get_chat_template
+
+processor = get_chat_template(
+    processor,
+    "gemma-4"
+)
+
+
+# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
+
+# In[ ]:
+
+
+image = dataset[2]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# You can see it's absolutely terrible! It doesn't follow instructions at all
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
+# 
+# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
+
+# In[ ]:
+
+
+from unsloth.trainer import UnslothVisionDataCollator
+from trl import SFTTrainer, SFTConfig
+
+trainer = SFTTrainer(
+    model = model,
+    train_dataset = converted_dataset,
+    processing_class = processor.tokenizer,
+    data_collator = UnslothVisionDataCollator(model, processor),
+    args = SFTConfig(
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4,
+        max_grad_norm = 0.3,
+        warmup_ratio = 0.03,
+        max_steps = 60,
+        # num_train_epochs = 2, # Set this instead of max_steps for full training runs
+        learning_rate = 2e-4,
+        logging_steps = 1,
+        save_strategy = "steps",
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "cosine",
+        seed = 3407,
+        output_dir = "outputs",
+        report_to = "none", # For Weights and Biases or others
+
+        # You MUST put the below items for vision finetuning:
+        remove_unused_columns = False,
+        dataset_text_field = "",
+        dataset_kwargs = {"skip_prepare_dataset": True},
+        max_length = 2048,
+    )
+)
+
+
+# In[ ]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# In[ ]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[ ]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model! You can modify the instruction and input—just leave the output blank.
+# 
+# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
+
+# In[ ]:
+
+
+image = dataset[10]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, use Hugging Face’s `push_to_hub` for online saving, or `save_pretrained` for local storage.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[ ]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+processor.save_pretrained("gemma_4_lora")
+# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[ ]:
+
+
+if False:
+    from unsloth import FastVisionModel
+
+    model, processor = FastVisionModel.from_pretrained(
+        model_name = "gemma_4_lora",  # YOUR MODEL YOU USED FOR TRAINING
+        load_in_4bit = True,  # Set to False for 16bit LoRA
+    )
+
+sample = dataset[1]
+image = sample["image"].convert("RGB")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": sample["text"],
+            },
+            {
+                "type": "image",
+            },
+        ],
+    },
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
+_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                   use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[ ]:
+
+
+# Select ONLY 1 to save! (Both not needed!)
+
+# Save locally to 16bit
+if False: model.save_pretrained_merged("unsloth_finetune", processor,)
+
+# To export and save to your Hugging Face account
+if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
+
+
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,911 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# # ### Installation
+# 
+# # In[ ]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[ ]:
+# 
+# 
+# #@title Colab Extra Install { display-mode: "form" }
+# get_ipython().run_line_magic('%capture', '')
+# import os
+# get_ipython().system('pip install --upgrade -qqq uv')
+# if "COLAB_" not in "".join(os.environ.keys()):
+#     # If you're not in Colab, just use pip install!
+#     get_ipython().system('pip install unsloth vllm')
+# else:
+#     try: import numpy, PIL; _numpy = f'numpy=={numpy.__version__}'; _pil = f'pillow=={PIL.__version__}'
+#     except: _numpy = "numpy"; _pil = "pillow"
+#     try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
+#     except: is_t4 = False
+#     _vllm, _triton = ('vllm==0.9.2', 'triton==3.2.0') if is_t4 else ('vllm==0.15.1', 'triton')
+#     get_ipython().system('uv pip install -qqq --upgrade {_vllm} {_numpy} {_pil} torchvision bitsandbytes xformers unsloth')
+#     get_ipython().system('uv pip install -qqq {_triton}')
+# get_ipython().system('uv pip install transformers==4.56.2')
+# get_ipython().system('uv pip install --no-deps trl==0.22.2')
+# 
+# 
+# # ### Unsloth
+
+# # Goal: Make faster kernels with Reinforcement Learning
+# 
+# Our goal is to make a faster matrix multiplication kernel by doing RL on Gemma 4 with Unsloth.
+# 
+# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Matrix_multiplication_qtl1.svg/500px-Matrix_multiplication_qtl1.svg.png" height=200 />
+# 
+# You will learn how to:
+# 1. Counteract **reward hacking** like cheating, caching, laziness.
+# 2. Timing and correctness of kernels and time limits.
+# 3. Making good **reward functions**
+# 4. How to seriously do RL to make optimized kernels
+
+# In[ ]:
+
+
+from unsloth import FastVisionModel
+import torch
+max_seq_length = 4096 # Can increase for longer reasoning traces
+lora_rank = 32 # Larger rank = smarter, but slower
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastVisionModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E2B-it",
+    max_seq_length = max_seq_length,
+    load_in_4bit = False, # False for LoRA 16bit
+    fast_inference = False, # Enable vllm fast inference
+)
+
+
+# We now add some small amount of LoRA weights to Gemma 4 so we only need to train those, instead of training on the full model.
+
+# In[ ]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
+    target_modules = [
+        "q_proj", "k_proj", "v_proj", "o_proj",
+        "gate_proj", "up_proj", "down_proj",
+    ],
+    lora_alpha = lora_rank*2, # *2 speeds up training
+    use_gradient_checkpointing = "unsloth", # Reduces memory usage
+    random_state = 3407,
+)
+
+
+# # Optimized matrix multiplication
+# 
+# Numpy has optimized matrix multiplication kernels for CPUs via BLAS optimized operations. For GPUs, one can use CUDA accelerated cuBLAS kernels which PyTorch calls under the hood.
+# 
+# To generate some random matrices to do matrix multiplication, we can do the below:
+
+# In[ ]:
+
+
+import numpy as np
+def generate_random_matrices(seed = 3407, n = 256):
+    random_state = np.random.RandomState(seed)
+    n, k, m = random_state.randint(1, n+1, size = 3)
+    A = np.random.uniform(-10, 10, size = (n, k))
+    B = np.random.uniform(-10, 10, size = (k, m))
+    return A, A.tolist(), B, B.tolist()
+
+
+# We shall generate a small matrix, and see the matrix multiplied output
+
+# In[ ]:
+
+
+A, A_list, B, B_list = generate_random_matrices(seed = 42, n = 5)
+print(A)
+print(B)
+print(np.matmul(A, B))
+
+
+# We can call a LLM to generate a simple matrix multiply kernel in Python only, and we can calculate the differences between the actual result and the kernel's result
+
+# In[ ]:
+
+
+def calculate_difference(pred, real):
+    if pred is None: return 5, 5
+    assert real is not None
+    import numpy as np
+    try:
+        difference = pred - real
+    except:
+        return 5, 5
+    amax_error = float(np.amax(difference))
+    mse_error  = float(np.mean(np.square(difference)))
+    return amax_error, mse_error
+
+
+# In[ ]:
+
+
+# Kernel generated by GPT-5
+def matmul(A, B):
+    z, s = zip, sum
+    Bt = list(z(*B))
+    return [[s(a*b for a, b in z(row, col)) for col in Bt] for row in A]
+
+
+# We see the error below is very small, so that's good!
+
+# In[ ]:
+
+
+prediction = matmul(A_list, B_list)
+calculate_difference(prediction, np.matmul(A, B))
+
+
+# # Countering Reward Hacking
+# 
+# The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric).
+# 
+# But RL can **cheat** When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called "Reward Hacking".
+# 
+# Some good examples are in https://en.wikipedia.org/wiki/Reward_hacking
+# 
+# For matrix multiplication kernels, we might see the following issues:
+# 
+# * Laziness: RL learns to use Numpy, Torch, other libraries, which calls optimized kernels.
+# * Caching: RL learns to cache the result of the output
+# * Cheating: RL learns to find the actual output by inspecting Python global variables
+# * RL learns to edit the timing function to make it output 0 time as passed.
+# 
+# And possibly more. We shall try to address each!
+
+# # Countering Reward Hacking 1: Stop laziness
+# We can stop the RL algorithm from calling optimized code by inspecting if the generated code imports other non standard Python libraries. We used GPT-5 to help generate this check `check_only_stdlib_imports`:
+
+# In[ ]:
+
+
+#@title (Collapsible code)
+import ast
+import sys
+import sysconfig
+from pathlib import Path
+
+def _stdlib_names():
+    """
+    Build a set of canonical stdlib top-level module/package names.
+    Uses sys.stdlib_module_names when available (3.10+), with a
+    filesystem fallback for older versions/edge cases.
+    """
+    names = {m.lower() for m in getattr(sys, "stdlib_module_names", set())}
+    names |= {m.lower() for m in sys.builtin_module_names}
+    names.add("__future__")  # special-case
+
+    # Fallback/augmentation: scan the stdlib directory
+    try:
+        stdlib_dir = Path(sysconfig.get_path("stdlib"))
+        if stdlib_dir.exists():
+            for p in stdlib_dir.iterdir():
+                if p.name == "site-packages":
+                    continue
+                if p.suffix == ".py":
+                    names.add(p.stem.lower())
+                elif p.is_dir() and (p / "__init__.py").exists():
+                    names.add(p.name.lower())
+    except Exception:
+        # conservative fallback; the names set above will still work well
+        pass
+
+    return names
+
+_STDLIB_SET = _stdlib_names()
+
+def check_only_stdlib_imports(code: str):
+    """
+    Return (ok: bool, details: dict)
+
+    ok == True  -> all absolute imports are from the stdlib.
+    ok == False -> details['non_stdlib'] lists offending top-level modules.
+
+    details includes:
+      - stdlib: sorted list of stdlib imports found
+      - non_stdlib: sorted list of non-stdlib imports found
+      - relative_imports: count of relative imports (always allowed here)
+    """
+    try:
+        tree = ast.parse(code)
+    except SyntaxError as e:
+        return False, {
+            "error": f"SyntaxError: {e}",
+            "stdlib": [],
+            "non_stdlib": [],
+            "relative_imports": 0,
+        }
+
+    abs_imports = set()
+    relative_count = 0
+
+    class Visitor(ast.NodeVisitor):
+        def visit_Import(self, node: ast.Import):
+            for alias in node.names:
+                abs_imports.add(alias.name.split(".")[0])
+        def visit_ImportFrom(self, node: ast.ImportFrom):
+            nonlocal relative_count
+            if (node.level or 0) > 0:
+                # relative import
+                relative_count += 1
+            else:
+                if node.module:
+                    abs_imports.add(node.module.split(".")[0])
+
+    Visitor().visit(tree)
+
+    stdlib_found = sorted(m for m in abs_imports if m.lower() in _STDLIB_SET)
+    non_stdlib = sorted(m for m in abs_imports if m.lower() not in _STDLIB_SET)
+
+    return len(non_stdlib) == 0, {
+        "stdlib": stdlib_found,
+        "non_stdlib": non_stdlib,
+        "relative_imports": relative_count,
+    }
+
+
+# For example, let's call `check_only_stdlib_imports` on a random piece of matrix multiplication code generated by GPT-5:
+
+# In[ ]:
+
+
+sample = """
+def matmul(A, B):
+    import numpy as np
+    from torch import matmul
+    z, s = zip, sum
+    Bt = list(z(*B))
+    return [[s(a*b for a, b in z(row, col)) for col in Bt] for row in A]
+"""
+ok, info = check_only_stdlib_imports(sample)
+print("Only stdlib imports?", ok)
+print(info)
+
+
+# # Countering Reward Hacking 2: Stop cheating
+# We can stop the RL algorithm from using global or cached variables by restricting it's `locals` and `globals`.
+# 
+# We are also going to use `exec` to create the function, so we have to save the output to an empty dict.
+# 
+# We also disallow global variable access.
+
+# In[ ]:
+
+
+output_function = {}
+exec(sample, {}, output_function)
+output_function["matmul"]
+
+
+# We also disallow global variable access via `types.FunctionType(f.__code__, {})`
+
+# In[ ]:
+
+
+import types
+output_function["matmul"] = types.FunctionType(output_function["matmul"].__code__, {})
+
+def import_numpy():
+    np.matmul
+    print("Success")
+
+import_numpy()
+import_numpy = types.FunctionType(import_numpy.__code__, {})
+try:
+    import_numpy()
+except Exception as e:
+    print(str(e))
+
+
+# In[ ]:
+
+
+def create_locked_down_function(function):
+    output_function = {}
+    exec(function, {}, output_function)
+    new_matmul = output_function["matmul"]
+    new_matmul = types.FunctionType(new_matmul.__code__, {})
+    return new_matmul
+
+
+# # Countering Reward Hacking 3: Stop caching
+# We can stop the RL algorithm from using cached data by wiping the cache with a large fake matrix. We also have to benchmark carefully with multiple loops and turns.
+# 
+# We also add a **timer** to not make the algorithm go in an endless loop.
+
+# In[ ]:
+
+
+import os, gc, time, statistics
+import signal
+from contextlib import contextmanager
+class TimeoutError(Exception): pass
+
+@contextmanager
+def time_limit(seconds):
+    def _handler(signum, frame):
+        raise TimeoutError(f"Timed out after {seconds}s")
+    old = signal.signal(signal.SIGALRM, _handler)
+    signal.setitimer(signal.ITIMER_REAL, seconds)
+    try:
+        yield
+    finally:
+        signal.setitimer(signal.ITIMER_REAL, 0.0)
+        signal.signal(signal.SIGALRM, old)
+
+class Benchmarker:
+    def __init__(self, trials = 3, loops = 1, timeout = 30):
+        self.buffer = np.zeros(2 * 1024 * 1024 * 1024, dtype = np.uint8)
+        self.trials = trials
+        self.loops = loops
+        assert timeout > 0 # Cannot be 0 since it won't work!
+        self.timeout = timeout
+    def thrash(self):
+        # Edit the buffer to wipe cache lines
+        self.buffer ^= 1
+        return int(self.buffer[::4096].sum())
+
+    def benchmark(self, function, arguments):
+        assert len(arguments) == self.loops
+        samples = []
+        exceptions = []
+        timed_out = 0
+        for _ in range(self.trials):
+            gc.collect(); gc.disable(); self.thrash()
+            t_start = time.perf_counter_ns()
+            for i in range(self.loops):
+                try:
+                    with time_limit(self.timeout):
+                        function(*arguments[i])
+                except TimeoutError as e:
+                    timed_out += 1
+                except Exception as e:
+                    exceptions.append(str(e))
+            t_end = time.perf_counter_ns()
+            gc.enable()
+            samples.append((t_end - t_start) // max(1, self.loops))
+        return {
+            "median_ns": int(statistics.median(samples)),
+            "mean_ns": int(statistics.fmean(samples)),
+            "stdev_ns": int(statistics.pstdev(samples) if len(samples) > 1 else 0),
+            "exceptions" : exceptions,
+            "timeouts" : timed_out,
+        }
+
+
+# For example we use our matmul kernel we had, and benchmark it with a 10 second delay:
+
+# In[ ]:
+
+
+A, A_list, B, B_list = generate_random_matrices(seed = 0, n = 256)
+Benchmarker(trials = 1, timeout = 10).benchmark(output_function["matmul"], [(A_list, B_list)])
+
+
+# # Data & RL task setup
+# 
+# We now have to create a prompt to the model for which it will do some task. For our matrix multiply example, we use the below:
+
+# In[ ]:
+
+
+prompt = """
+Create a new fast matrix multiplication function using only native Python code.
+You are given a list of list of numbers.
+Output your new function in backticks using the format below:
+```python
+def matmul(A, B):
+    return ...
+```
+""".strip()
+print(prompt)
+
+
+# First, let's prompt Gemma 4 without RL and see how it goes:
+
+# In[ ]:
+
+
+text = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    tokenize = False,
+    add_generation_prompt = True,
+)
+
+from transformers import TextStreamer
+print("=" * 50)
+print("BASE MODEL OUTPUT (before RL training):")
+print("=" * 50)
+
+inputs = tokenizer(
+    text = text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+text_streamer = TextStreamer(tokenizer, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# # Reward functions
+# 
+# We now design the `extract_function` function which simply extracts the function wrapped in 3 backticks.
+# 
+# And 4 reward functions:
+# 
+# 1. `function_works` which rewards the model if the strategy is a valid Python function.
+# 2. `no_cheating` which checks if the function imported other modules, and if it did, we penalize it.
+# 3. `correctness_check` which checks if the kernel was correct or wrong - it shouldn't generate gibberish!
+# 4. `speed_check` checks the performance relative to Numpy matmul directly.
+
+# In[ ]:
+
+
+def extract_function(text):
+    if text.count("```") >= 2:
+        first = text.find("```") + 3
+        second = text.find("```", first)
+        fx = text[first : second].strip()
+        fx = fx.removeprefix("python\n")
+        fx = fx[fx.find("def"):]
+        if fx.startswith("def matmul(A, B):"): return fx
+    return None
+print(extract_function(prompt))
+
+
+# Below is our `function_works` reward function which uses Python's `exec` but guarded by not allowing leakage of local and global variables. We can also use `check_only_stdlib_imports` first to check if there are errors before even executing the function:
+
+# In[ ]:
+
+
+ok, info = check_only_stdlib_imports("def a")
+ok, info
+
+
+# In[ ]:
+
+
+def function_works(completions, **kwargs):
+    scores = []
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        print(function)
+        if function is not None:
+            ok, info = check_only_stdlib_imports(function)
+        if function is None or "error" in info:
+            score = -2.0
+        else:
+            try:
+                new_matmul = create_locked_down_function(function)
+                score = 1.0
+            except:
+                score = -0.5
+        scores.append(score)
+    return scores
+
+
+# `no_cheating` checks if the function cheated since it might have imported Numpy or Torch optimized code.
+
+# In[ ]:
+
+
+def no_cheating(completions, **kwargs):
+    scores = []
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        if function is not None:
+            ok, info = check_only_stdlib_imports(function)
+        else:
+            ok = False
+        scores.append(1.0 if ok else -20.0) # Penalize heavily!
+    return scores
+
+
+# Next `correctness_check` checks if the kernel was correct. We want to penalize if the absolute error is larger than 1, and if the mean squared error is somewhat bigger then machine epsilon.
+# 
+# We have to execute the code now!
+
+# In[ ]:
+
+
+np.finfo(np.float64).eps
+
+
+# In[ ]:
+
+
+def correctness_check(completions, **kwargs):
+    scores = []
+    # Generate some random matrices of size less than 128
+    A, A_list, B, B_list = generate_random_matrices(seed = np.random.randint(10000), n = 128)
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        if function is not None:
+            ok, info = check_only_stdlib_imports(function)
+        if function is None or "error" in info:
+            scores.append(0)
+            continue
+        try:
+            new_matmul = create_locked_down_function(function)
+        except:
+            scores.append(0)
+            continue
+        try:
+            pred = new_matmul(A_list.copy(), B_list.copy())
+        except:
+            # Failed!
+            scores.append(-2.0)
+            continue
+        true = np.matmul(A, B)
+        amax_error, mse_error = calculate_difference(pred, true)
+
+        # Check correctness and score!
+        machine_epsilon = 100*np.finfo(np.float64).eps
+        if   amax_error >= 3:   score = -3.0
+        elif amax_error >= 2:   score = -2.5
+        elif amax_error >= 1:   score = -2.0
+        elif amax_error >= 0.5: score = -1.0
+        elif amax_error >= 100*machine_epsilon: score = 0.0
+        elif amax_error >= machine_epsilon: score = 1.0
+        else: score = 3.0
+
+        if   mse_error >= 3:   score += -3.0
+        elif mse_error >= 2:   score += -2.5
+        elif mse_error >= 1:   score += -2.0
+        elif mse_error >= 0.5: score += -1.0
+        elif mse_error >= 100*machine_epsilon: score += 0.0
+        elif mse_error >= machine_epsilon: score += 1.0
+        else: score += 3.0
+        scores.append(score)
+    return scores
+
+
+# Finally our benchmarking function for `speed_check`! We shall limit the timer to 10 seconds and do 3 trials.
+
+# In[ ]:
+
+
+A, A_list, B, B_list = generate_random_matrices(seed = 0, n = 256)
+benchmarker = Benchmarker(trials = 3, timeout = 10)
+numpy_results = benchmarker.benchmark(np.matmul, [(A, B)])
+numpy_results
+
+
+# In[ ]:
+
+
+new_matmul = create_locked_down_function(extract_function(prompt))
+new_results = benchmarker.benchmark(new_matmul, [(A_list, B_list)])
+new_results
+
+
+# We can take the difference and do a negative sign for slower ones. If the ratio is less than 1 (ie faster, we shall invert it!)
+
+# In[ ]:
+
+
+negative = -(new_results["median_ns"] / numpy_results["median_ns"]) / 100
+positive = +(numpy_results["median_ns"] / new_results["median_ns"]) / 100
+reward = negative if new_results["median_ns"] >= numpy_results["median_ns"] else positive
+reward
+
+
+# In[ ]:
+
+
+new_results["median_ns"] = 3
+numpy_results["median_ns"] = 1000
+negative = -(new_results["median_ns"] / numpy_results["median_ns"]) / 100
+positive = +(numpy_results["median_ns"] / new_results["median_ns"]) / 100
+reward = negative if new_results["median_ns"] >= numpy_results["median_ns"] else positive
+reward
+
+
+# In[ ]:
+
+
+import gc
+def speed_check(completions, **kwargs):
+    scores = []
+    # Generate some random matrices of size less than 256
+    A, A_list, B, B_list = generate_random_matrices(seed = np.random.randint(10000), n = 256)
+    numpy_results = benchmarker.benchmark(np.matmul, [(A, B)])
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        if function is not None:
+            ok, info = check_only_stdlib_imports(function)
+        if function is None or "error" in info:
+            scores.append(0)
+            continue
+        try:
+            new_matmul = create_locked_down_function(function)
+        except:
+            scores.append(0)
+            continue
+        new_results = benchmarker.benchmark(new_matmul, [(A_list.copy(), B_list.copy())])
+
+        # Get score and clip to -10, 10
+        negative = -(new_results["median_ns"] / numpy_results["median_ns"]) / 100
+        positive = +(numpy_results["median_ns"] / new_results["median_ns"]) / 100
+        score = negative if new_results["median_ns"] >= numpy_results["median_ns"] else positive
+        if score >= 10:  score = 10
+        if score <= -10: score = -10
+        scores.append(score)
+    # Free memory to counteract OOMs
+    gc.collect()
+    torch.cuda.empty_cache()
+    return scores
+
+
+# We create the dataset which includes a replica of our prompt.
+
+# In[ ]:
+
+
+from datasets import Dataset
+dataset = Dataset.from_list([{"prompt" : [{"role": "user", "content": prompt.strip()}], "answer" : 0}]*1000)
+maximum_length = len(tokenizer.apply_chat_template([{"role":"user", "content":prompt.strip()}], add_generation_prompt = True, tokenize = True))
+print(maximum_length)
+dataset[0]
+
+
+# <a name="Train"></a>
+# ### Train the model
+# 
+# Now set up GRPO Trainer and all configurations! We also support GSDP, GAPO, Dr GRPO and more! Go to our docs https://unsloth.ai/docs/ for more info!
+
+# In[ ]:
+
+
+# Leave room for the prompt (plus 1 token safety margin)
+max_completion_length = max_seq_length - (maximum_length + 1)
+
+from trl import GRPOConfig, GRPOTrainer
+training_args = GRPOConfig(
+    temperature = 1.0,
+    top_p = 0.95,
+    top_k = 64,
+    learning_rate = 5e-5,
+    weight_decay = 0.001,
+    warmup_ratio = 0.1,
+    lr_scheduler_type = "linear",
+    optim = "adamw_8bit",
+    logging_steps = 1,
+    per_device_train_batch_size = 1,
+    gradient_accumulation_steps = 2, # Increase to 4 for smoother training
+    num_generations = 2, # Decrease if out of memory
+    max_completion_length = max_completion_length,
+    # num_train_epochs = 1, # Set to 1 for a full training run
+    max_steps = 100,
+    save_steps = 100,
+    report_to = "none", # Can use Weights & Biases, TrackIO
+    output_dir = "outputs",
+    epsilon = 0.2,
+    epsilon_high = 0.28, # one sided
+    delta = 1.5, # two sided
+    loss_type = 'bnpo',
+    mask_truncated_completions = True
+    # For optional training + evaluation
+    # fp16_full_eval = True,
+    # per_device_eval_batch_size = 4,
+    # eval_accumulation_steps = 1,
+    # eval_strategy = "steps",
+    # eval_steps = 1,
+)
+
+
+# And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!
+# 
+# You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!
+# 
+# | Step | Training Loss | reward    | reward_std | completion_length | kl       |
+# |------|---------------|-----------|------------|-------------------|----------|
+# | 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
+# | 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
+# | 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |
+
+# In[ ]:
+
+
+# For optional training + evaluation
+# new_dataset = dataset.train_test_split(test_size = 0.01)
+
+trainer = GRPOTrainer(
+    model = model,
+    processing_class = tokenizer,
+    reward_funcs = [
+        function_works,
+        no_cheating,
+        correctness_check,
+        speed_check,
+    ],
+    args = training_args,
+    train_dataset = dataset,
+
+    # For optional training + evaluation
+    # train_dataset = new_dataset["train"],
+    # eval_dataset = new_dataset["test"],
+)
+
+
+# And let's train the model!
+# 
+# **NOTE** A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!
+
+# In[ ]:
+
+
+trainer.train()
+
+
+# And now with the LoRA we just trained with GRPO - we first save the LoRA first!
+
+# In[ ]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+
+
+# Verify LoRA is actually trained!
+
+# In[ ]:
+
+
+from safetensors import safe_open
+
+tensors = {}
+with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
+    # Verify both A and B are non zero
+    for key in f.keys():
+        tensor = f.get_tensor(key)
+        n_zeros = (tensor == 0).sum() / tensor.numel()
+        assert(n_zeros.item() != tensor.numel())
+
+
+# <a name="Inference"></a>
+# # Inference
+# Now let's try the model we just trained!
+
+# In[ ]:
+
+
+text = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    tokenize = False,
+    add_generation_prompt = True,
+)
+
+from transformers import TextStreamer
+
+_ = model.generate(
+    **tokenizer(images = None, text = text, return_tensors = "pt").to("cuda"),
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    max_new_tokens = 1024,
+    streamer = TextStreamer(tokenizer, skip_prompt = False),
+)
+
+
+# <a name="Save"></a>
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[ ]:
+
+
+# Merge to 16bit
+if False: model.save_pretrained_merged("gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit",)
+if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")
+
+# Merge to 4bit
+if False: model.save_pretrained_merged("gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit",)
+if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")
+
+# Just LoRA adapters
+if False:
+    model.save_pretrained("gemma_4_lora")
+    tokenizer.save_pretrained("gemma_4_lora")
+if False:
+    model.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
+    tokenizer.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
+# 
+# Some supported quant methods (full list on our [docs page](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)):
+# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
+# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
+# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
+# 
+# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+
+# In[ ]:
+
+
+# Save to 8bit Q8_0
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer,)
+# Remember to go to https://huggingface.co/settings/tokens for a token!
+# And change hf to your username!
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, token = "YOUR_HF_TOKEN")
+
+# Save to 16bit GGUF
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "f16")
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")
+
+# Save to q4_k_m GGUF
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "q4_k_m")
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")
+
+# Save to multiple GGUF options - much faster if you want multiple!
+if False:
+    model.push_to_hub_gguf(
+        "HF_USERNAME/gemma_4_finetune", # Change hf to your username!
+        tokenizer,
+        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma_4_finetune.Q8_0.gguf` file or `gemma_4_finetune.Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,913 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# # Goal: Make Gemma 4 play games with Reinforcement Learning
+# 
+# Our goal is to make Gemma 4 play the 2048 game with reinforcement learning, or a variant of it called [GRPO](https://arxiv.org/abs/2501.12948).
+# 
+# We want the model to devise a strategy to play 2048, and we will run this strategy until we win or lose. We then reward the model if it created a good strategy (winning the game), and we'll penalize it (negative reward) if the strategy was a bad one.
+# 
+# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/2048_win.png/500px-2048_win.png" height=300 />
+
+# # Installation
+# We'll be using [Unsloth](https://github.com/unslothai/unsloth) to do RL on Gemma 4. Unsloth saves 70% VRAM usage and makes reinforcement learning 2 to 6x faster!
+
+# In[ ]:
+
+
+get_ipython().run_cell_magic('capture', '', 'import os, importlib.util\n!pip install --upgrade -qqq uv\nif importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):\n    try: import numpy, PIL; _numpy = f"numpy=={numpy.__version__}"; _pil = f"pillow=={PIL.__version__}"\n    except: _numpy = "numpy"; _pil = "pillow"\n    # Gemma 4 requires transformers >= 5.5.0 — do NOT pin to 4.x here\n    !uv pip install -qqq \\\n        "torch>=2.8.0" "triton>=3.4.0" {_numpy} {_pil} torchvision bitsandbytes \\\n        "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \\\n        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \\\n        git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels\nelif importlib.util.find_spec("unsloth") is None:\n    !uv pip install -qqq unsloth\n# Gemma 4 requires transformers >= 5.5.0\n!uv pip install --upgrade --no-deps "transformers>=5.5.0" tokenizers "trl>=0.28.0" unsloth unsloth_zoo\n')
+
+
+# In[ ]:
+
+
+get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+
+
+# ### Unsloth
+
+# In[ ]:
+
+
+from unsloth import FastVisionModel
+import torch
+max_seq_length = 4096 # Can increase for longer reasoning traces
+lora_rank = 32 # Larger rank = smarter, but slower
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastVisionModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E2B-it",
+    max_seq_length = max_seq_length,
+    load_in_4bit = False, # False for LoRA 16bit
+    fast_inference = False, # Enable vllm fast inference
+)
+
+
+# To do efficient RL, we will use [LoRA](https://arxiv.org/abs/2106.09685), which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy.
+
+# In[ ]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
+    target_modules = [
+        "q_proj", "k_proj", "v_proj", "o_proj",
+        "gate_proj", "up_proj", "down_proj",
+    ],
+    lora_alpha = lora_rank*2, # *2 speeds up training
+    use_gradient_checkpointing = "unsloth", # Reduces memory usage
+    random_state = 3407,
+)
+
+
+# # 2048 game
+# 
+# We used GPT-5 to create a variant of the 2048 game. It should output the current game board state, and allow us to advance the game board state with 1 action (up, down, left, right).
+
+# In[ ]:
+
+
+#@title (Collapsible) 2048 Game Implementation
+from dataclasses import dataclass, field
+from typing import List, Tuple, Optional
+import random
+import copy
+
+def _compress_and_merge_row_left(row: List[int]) -> Tuple[List[int], int, bool]:
+    n = len(row)
+    tiles = [x for x in row if x != 0]
+    gained = 0
+    i = 0
+    merged = []
+    while i < len(tiles):
+        if i + 1 < len(tiles) and tiles[i] == tiles[i + 1]:
+            v = tiles[i] * 2
+            gained += v
+            merged.append(v)
+            i += 2
+        else:
+            merged.append(tiles[i])
+            i += 1
+    merged += [0] * (n - len(merged))
+    changed = merged != row
+    return merged, gained, changed
+
+def _move_left(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
+    changed_any = False
+    total_gain = 0
+    new_board = []
+    for row in board:
+        new_row, gained, changed = _compress_and_merge_row_left(row)
+        new_board.append(new_row)
+        total_gain += gained
+        changed_any = changed_any or changed
+    return new_board, total_gain, changed_any
+
+def _move_right(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
+    changed_any = False
+    total_gain = 0
+    new_board = []
+    for row in board:
+        rev = list(reversed(row))
+        new_rev, gained, changed = _compress_and_merge_row_left(rev)
+        new_row = list(reversed(new_rev))
+        new_board.append(new_row)
+        total_gain += gained
+        changed_any = changed_any or changed
+    return new_board, total_gain, changed_any
+
+def _transpose(board: List[List[int]]) -> List[List[int]]:
+    return [list(row) for row in zip(*board)]
+
+def _move_up(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
+    t = _transpose(board)
+    moved, gain, changed = _move_left(t)
+    return _transpose(moved), gain, changed
+
+def _move_down(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
+    t = _transpose(board)
+    moved, gain, changed = _move_right(t)
+    return _transpose(moved), gain, changed
+
+def _empty_cells(board: List[List[int]]) -> List[Tuple[int, int]]:
+    size = len(board)
+    return [(r, c) for r in range(size) for c in range(size) if board[r][c] == 0]
+
+def _can_move(board: List[List[int]]) -> bool:
+    if _empty_cells(board):
+        return True
+    size = len(board)
+    for r in range(size):
+        for c in range(size - 1):
+            if board[r][c] == board[r][c + 1]:
+                return True
+    for r in range(size - 1):
+        for c in range(size):
+            if board[r][c] == board[r + 1][c]:
+                return True
+    return False
+
+@dataclass
+class GameBoard:
+    size: int
+    seed: Optional[int] = None
+    target: int = 2048
+    probability_fours: float = 0.10 # originally spawns (4) 10% of the time!
+    _rng: random.Random = field(init = False, repr = False)
+    _board: List[List[int]] = field(init = False, repr = False)
+    _score: int = field(default = 0, init = False, repr = False)
+    _state: str = field(default = "ongoing", init = False, repr = False)
+
+    def __post_init__(self):
+        if self.size < 2:
+            raise ValueError("Board size must be at least 2.")
+        self._rng = random.Random(self.seed)
+        self._board = [[0 for _ in range(self.size)] for _ in range(self.size)]
+        self._add_random_tile()
+        self._add_random_tile()
+        self._update_state_after_change()
+
+    class _BoardView:
+        def __init__(self, game: "GameBoard"):
+            self._game = game
+        def __iter__(self):
+            return iter(self._game._board)
+        def __len__(self):
+            return len(self._game._board)
+        def __getitem__(self, idx):
+            return self._game._board[idx]
+        def __repr__(self) -> str:
+            return repr(self._game._board)
+        __str__ = __repr__
+        def do_action(self, key: str) -> None:
+            self._game.do_action(key)
+        def state(self) -> str:
+            return self._game.state()
+        def pretty(self, colors: bool = True, border: bool = True, dot_for_zero: bool = True) -> str:
+            return self._game._render_pretty(colors = colors, border = border, dot_for_zero = dot_for_zero)
+
+    def board(self) -> "_BoardView":
+        return GameBoard._BoardView(self)
+    def state(self) -> str:
+        return self._state
+    def score(self) -> int:
+        return self._score
+    def do_action(self, key: str) -> None:
+        if self._state != "ongoing":
+            return
+        if not isinstance(key, str) or len(key) == 0:
+            self._state = "failed"
+            return
+        k = key.strip().lower()
+        if k == "q":
+            self._state = "failed"
+            return
+        move_map = {"a": _move_left, "d": _move_right, "w": _move_up, "s": _move_down}
+        if k not in move_map:
+            self._state = "failed"
+            return
+        mover = move_map[k]
+        new_board, gain, changed = mover(self._board)
+        if changed:
+            self._board = new_board
+            self._score += gain
+            self._add_random_tile()
+        self._update_state_after_change()
+    def _add_random_tile(self) -> bool:
+        empties = _empty_cells(self._board)
+        if not empties:
+            return False
+        r, c = self._rng.choice(empties)
+        self._board[r][c] = 4 if self._rng.random() < self.probability_fours else 2
+        return True
+    def _update_state_after_change(self) -> None:
+        if any(self.target in row for row in self._board):
+            self._state = "success"
+            return
+        if not _can_move(self._board):
+            self._state = "failed"
+            return
+        self._state = "ongoing"
+    def _render_pretty(self, colors: bool = True, border: bool = True, dot_for_zero: bool = True) -> str:
+        """
+        Pretty-print the board with colors that scale from 0 up to self.target.
+        Uses ANSI 256-color codes (works in most terminals). Set colors = False to disable.
+        """
+        import math
+
+        b = self._board
+        mx = max((max(row) for row in b), default = 0)
+        cell_w = max(3, len(str(mx)))
+
+        RESET = "\x1b[0m"
+
+        # A smooth-ish gradient from cool → warm
+        # (blue/cyan/green → yellow/orange/red). Tweak or expand as you like.
+        GRAD = [33, 39, 45, 51, 50, 49, 48, 47, 46, 82, 118, 154, 190, 226, 220, 214, 208, 202, 196]
+        ZERO_FG = 239  # dim gray
+
+        def color_code(v: int) -> str:
+            if not colors:
+                return ""
+            if v == 0:
+                return f"\x1b[38;5;{ZERO_FG}m"
+            # Normalize by exponent relative to target: r in [0,1]
+            t = max(2, self.target)  # safety; avoid log2(1)
+            # Guard: if v is not a power of two or is <1, handle gracefully
+            try:
+                r = max(0.0, min(1.0, math.log2(v) / math.log2(t)))
+            except ValueError:
+                r = 0.0
+            idx = int(round(r * (len(GRAD) - 1)))
+            return f"\x1b[38;5;{GRAD[idx]}m"
+
+        def fmt(v: int) -> str:
+            s = "." if (v == 0 and dot_for_zero) else str(v)
+            s = s.rjust(cell_w)
+            return color_code(v) + s + (RESET if colors else "")
+
+        def hline(left: str, mid: str, right: str) -> str:
+            return left + mid.join("─" * cell_w for _ in range(self.size)) + right
+
+        rows = []
+        if border:
+            rows.append(hline("┌", "┬", "┐"))
+        for r in range(self.size):
+            content = "│".join(fmt(v) for v in b[r])
+            rows.append(("│" + content + "│") if border else content)
+            if border:
+                rows.append(hline("└" if r == self.size - 1 else "├",
+                                "┴" if r == self.size - 1 else "┼",
+                                "┘" if r == self.size - 1 else "┤"))
+        return "\n".join(rows)
+
+
+# For example let's create a board of size 5 X 5 and set the target to 8 instead of 2048.
+# 
+# **[NOTE]** 2048 originally spawns a (4) 10% of the time! We can disable this for harder games. See [Wikipedia page](https://en.wikipedia.org/wiki/2048_(video_game)) for more details.
+
+# In[ ]:
+
+
+game = GameBoard(size = 5, seed = 42, target = 8, probability_fours = 0.10)
+print(game.board().pretty(), game.state())
+
+
+# In[ ]:
+
+
+game
+
+
+# We'll use WASD for the action space:
+# 
+# ```
+#    W
+# A  S  D
+# ```
+# Also `game.state()` will say `success` if we succeeded in getting the target!
+
+# In[ ]:
+
+
+game.do_action("A")
+print(game.board().pretty(), game.state())
+
+
+# In[ ]:
+
+
+game.do_action("W")
+print(game.board().pretty(), game.state())
+
+
+# In[ ]:
+
+
+game.do_action("D")
+print(game.board().pretty(), game.state())
+
+
+# In[ ]:
+
+
+game.do_action("W")
+print(game.board().pretty(), game.state())
+
+
+# In[ ]:
+
+
+game.do_action("D")
+print(game.board().pretty(), game.state())
+
+
+# If we do some other action that's not part of the action space, we will get an error, and the game will not accept anymore actions.
+
+# In[ ]:
+
+
+game = GameBoard(size = 3, seed = 42, target = 8, probability_fours = 0.10)
+game.do_action("AA") # Not in WASD
+game.do_action("W")  # Doesn't do anything
+game.do_action("A")  # Doesn't do anything
+print(game.board().pretty(), game.state())
+
+
+# # RL Environment Setup
+# 
+# We'll set up a function to accept some strategy that'll emit an action within `WASD` and check the game state.
+# 
+# We'll also add a timer to only execute the strategy for 2 seconds maximum, otherwise it might never terminate!
+
+# In[ ]:
+
+
+from typing import Callable
+from unsloth import execute_with_time_limit
+
+def _execute_strategy(strategy : Callable, game : GameBoard):
+    assert callable(strategy)
+
+    steps = 0
+    while game.state() == "ongoing":
+        action = strategy(list(game.board()))
+        steps += 1
+        if type(action) is not str:
+            return steps, "failed"
+        game.do_action(action)
+    return steps, game.state()
+
+@execute_with_time_limit(2)
+def execute_strategy(strategy : Callable, game : GameBoard):
+    return _execute_strategy(strategy, game)
+
+
+# Let's make a generic strategy to just hit `W`. We should expect this generic strategy to fail:
+
+# In[ ]:
+
+
+def always_move_left(board):
+    return "W"
+
+game = GameBoard(size = 8, seed = 42, target = 2048, probability_fours = 0.10)
+try:
+    execute_strategy(always_move_left, game)
+except TimeoutError as e:
+    print(f"Timed out with error = {str(e)}")
+
+
+# To allow longer strategies for Gemma 4 Reinforcement Learning, we shall allow a 5 second timer.
+
+# In[ ]:
+
+
+@execute_with_time_limit(5)
+def execute_strategy(strategy : Callable, game : GameBoard):
+    return _execute_strategy(strategy, game)
+
+
+# # Code Execution
+# 
+# To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called `countering reward hacking` since we don't want the function to cheat.
+# 
+# For example the below piece of code is fine, since it only imports Python level functions. We use `check_python_modules`:
+
+# In[ ]:
+
+
+from unsloth import check_python_modules
+
+sample = """
+def strategy(board):
+    import math
+    from typing import Callable
+    return "W"
+"""
+ok, info = check_python_modules(sample)
+print("Only Python imports?", ok)
+print(info)
+
+
+# For the below piece of code, since we import `numpy`, we should not allow the execution:
+
+# In[ ]:
+
+
+sample = """
+def strategy(board):
+    from numpy import matmul
+    return "W"
+"""
+ok, info = check_python_modules(sample)
+print("Only Python imports?", ok)
+print(info)
+
+
+# We also disallow global variable access. We'll use Unsloth's `create_locked_down_function` function
+
+# In[ ]:
+
+
+from unsloth import create_locked_down_function
+function = """
+def import_numpy():
+    np.matmul
+    print("Success")
+"""
+f = create_locked_down_function(function)
+try:
+    f()
+except Exception as e:
+    print(str(e))
+
+
+# In[ ]:
+
+
+from unsloth import create_locked_down_function
+function = """
+def add(a, b):
+    def adder(a):
+        return a + b
+    return adder(b) + b
+"""
+f = create_locked_down_function(function)
+try:
+    print(f(10, 20))
+except Exception as e:
+    print(str(e))
+
+
+# # Data & RL task setup
+# 
+# We now have to create a prompt to tell the model to create a strategy for the 2048 game. You can customize this to some other task for another RL task.
+
+# In[ ]:
+
+
+prompt = """
+Create a new short 2048 strategy using only native Python code.
+You are given a list of list of numbers for the current board state.
+Output one action for "W", "A", "S", "D" on what is the optimal next step.
+Output your new short function in backticks using the format below:
+```python
+def strategy(board):
+    return "W" # Example
+```
+All helper functions should be inside def strategy. Only output the short function `strategy`.
+""".strip()
+print(prompt)
+
+
+# First, let's prompt Gemma 4 without RL and see how it goes:
+
+# In[ ]:
+
+
+text = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    tokenize = False,
+    add_generation_prompt = True,
+)
+
+from transformers import TextStreamer
+print("=" * 50)
+print("BASE MODEL OUTPUT (before RL training):")
+print("=" * 50)
+
+inputs = tokenizer(
+    text = text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+text_streamer = TextStreamer(tokenizer, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# # Reward functions
+# 
+# We now design a `extract_function` function which simply extracts the function wrapped in 3 back ticks.
+# 
+# And 3 reward functions:
+# 
+# 1. `function_works` which rewards the model if the strategy is a valid Python function.
+# 2. `no_cheating` which checks if the function imported other modules, and if it did, we penalize it.
+# 3. `strategy_succeeds` which checks if the game strategy actually succeeds in attaining 2048 after running the auto-generated strategy.
+
+# In[ ]:
+
+
+def extract_function(text):
+    if text.count("```") >= 2:
+        first = text.find("```") + 3
+        second = text.find("```", first)
+        fx = text[first : second].strip()
+        fx = fx.removeprefix("python\n")
+        fx = fx[fx.find("def"):]
+        if fx.startswith("def strategy(board):"): return fx
+    return None
+print(extract_function(prompt))
+
+
+# Below is our `function_works` reward function which uses Python's `exec` but guarded by not allowing leakage of local and global variables. We can also use `check_python_modules` first to check if there are errors before even executing the function:
+
+# In[ ]:
+
+
+ok, info = check_python_modules("def a")
+ok, info
+
+
+# In[ ]:
+
+
+def function_works(completions, **kwargs):
+    scores = []
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        if function is not None:
+            ok, info = check_python_modules(function)
+        if function is None or "error" in info:
+            score = -2.0
+        else:
+            try:
+                new_strategy = create_locked_down_function(function)
+                score = 1.0
+            except:
+                score = -0.5
+        scores.append(score)
+    return scores
+
+
+# `no_cheating` checks if the function cheated since it might have imported Numpy or other functions:
+
+# In[ ]:
+
+
+def no_cheating(completions, **kwargs):
+    scores = []
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        if function is not None:
+            ok, info = check_python_modules(function)
+            scores.append(1.0 if ok else -20.0) # Penalize heavily!
+        else:
+            scores.append(-1.0) # Failed creating function
+    return scores
+
+
+# Next `strategy_succeeds` checks if the strategy actually allows the game to terminate. Imagine if the strategy simply returned "W" which would fail after a time limit of 10 seconds.
+# 
+# We also add a global `PRINTER` to print out the strategy and board state.
+
+# In[ ]:
+
+
+import numpy as np
+global PRINTER
+PRINTER = 0
+def strategy_succeeds(completions, **kwargs):
+    global PRINTER
+    scores = []
+    # Generate a random game board with seed
+    seed = np.random.randint(10000)
+    for completion in completions:
+        printed = False
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        if PRINTER % 5 == 0:
+            printed = True
+            print(function)
+        PRINTER += 1
+        if function is not None:
+            ok, info = check_python_modules(function)
+        if function is None or "error" in info:
+            scores.append(0)
+            continue
+        try:
+            new_strategy = create_locked_down_function(function)
+        except:
+            scores.append(0)
+            continue
+        try:
+            game = GameBoard(size = 6, seed = seed, target = 2048, probability_fours = 0.10)
+            steps, game_state = execute_strategy(new_strategy, game)
+            print(f"Steps = {steps} State = {game_state}")
+            if printed is False:
+                print(function)
+            print(game.board().pretty())
+            if game_state == "success":
+                scores.append(20.0) # Success - massively reward!
+            else:
+                scores.append(2.0) # Failed but function works!
+        except TimeoutError as e:
+            print("Timeout")
+            scores.append(-1.0) # Failed with timeout
+        except Exception as e:
+            print(f"Exception = {str(e)}")
+            scores.append(-3.0) # Failed
+    return scores
+
+
+# We'll now create the dataset which includes a replica of our prompt.
+
+# In[ ]:
+
+
+from datasets import Dataset
+dataset = Dataset.from_list([{"prompt" : [{"role": "user", "content": prompt.strip()}], "answer" : 0}]*1000)
+maximum_length = len(tokenizer.apply_chat_template([{"role":"user", "content":prompt.strip()}], add_generation_prompt = True, tokenize = True))
+print(maximum_length)
+dataset[0]
+
+
+# <a name="Train"></a>
+# ### Train the model
+# 
+# Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth [Reinforcement Learning Docs](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) for more options.
+
+# In[ ]:
+
+
+# Leave room for the prompt (plus 1 token safety margin)
+max_completion_length = max_seq_length - (maximum_length + 1)
+
+from trl import GRPOConfig, GRPOTrainer
+training_args = GRPOConfig(
+    temperature = 1.0,
+    top_p = 0.95,
+    top_k = 64,
+    learning_rate = 5e-5,
+    weight_decay = 0.001,
+    warmup_ratio = 0.1,
+    lr_scheduler_type = "linear",
+    optim = "adamw_8bit",
+    logging_steps = 1,
+    per_device_train_batch_size = 1,
+    gradient_accumulation_steps = 2, # Increase to 4 for smoother training
+    num_generations = 2, # Decrease if out of memory
+    max_completion_length = max_completion_length,
+    # num_train_epochs = 1, # Set to 1 for a full training run
+    max_steps = 60,
+    save_steps = 100,
+    report_to = "none", # Can use Weights & Biases, TrackIO
+    output_dir = "outputs",
+    epsilon = 0.2,
+    epsilon_high = 0.28, # one sided
+    delta = 1.5, # two sided
+    loss_type = 'bnpo',
+    mask_truncated_completions = True
+    # For optional training + evaluation
+    # fp16_full_eval = True,
+    # per_device_eval_batch_size = 4,
+    # eval_accumulation_steps = 1,
+    # eval_strategy = "steps",
+    # eval_steps = 1,
+)
+
+
+# And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!
+# 
+# You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!
+# 
+# | Step | Training Loss | reward    | reward_std | completion_length | kl       |
+# |------|---------------|-----------|------------|-------------------|----------|
+# | 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
+# | 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
+# | 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |
+
+# In[ ]:
+
+
+# For optional training + evaluation
+# new_dataset = dataset.train_test_split(test_size = 0.01)
+
+trainer = GRPOTrainer(
+    model = model,
+    processing_class = tokenizer,
+    reward_funcs = [
+        function_works,
+        no_cheating,
+        strategy_succeeds,
+    ],
+    args = training_args,
+    train_dataset = dataset,
+
+    # For optional training + evaluation
+    # train_dataset = new_dataset["train"],
+    # eval_dataset = new_dataset["test"],
+)
+
+
+# And let's train the model!
+# 
+# **NOTE** A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!
+
+# In[ ]:
+
+
+trainer.train()
+
+
+# And now with the LoRA we just trained with GRPO - we first save the LoRA first!
+
+# In[ ]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+
+
+# Verify LoRA is actually trained!
+
+# In[ ]:
+
+
+from safetensors import safe_open
+
+tensors = {}
+with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
+    # Verify both A and B are non zero
+    for key in f.keys():
+        tensor = f.get_tensor(key)
+        n_zeros = (tensor == 0).sum() / tensor.numel()
+        assert(n_zeros.item() != tensor.numel())
+
+
+# <a name="Inference"></a>
+# # Inference
+# Now let's try the model we just trained!
+
+# In[ ]:
+
+
+text = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    tokenize = False,
+    add_generation_prompt = True,
+)
+
+from transformers import TextStreamer
+
+_ = model.generate(
+    **tokenizer(images = None, text = text, return_tensors = "pt").to("cuda"),
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    max_new_tokens = 1024,
+    streamer = TextStreamer(tokenizer, skip_prompt = False),
+)
+
+
+# <a name="Save"></a>
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[ ]:
+
+
+# Merge to 16bit
+if False: model.save_pretrained_merged("gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit",)
+if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")
+
+# Merge to 4bit
+if False: model.save_pretrained_merged("gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit",)
+if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")
+
+# Just LoRA adapters
+if False:
+    model.save_pretrained("gemma_4_lora")
+    tokenizer.save_pretrained("gemma_4_lora")
+if False:
+    model.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
+    tokenizer.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
+# 
+# Some supported quant methods (full list on our [docs page](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)):
+# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
+# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
+# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
+# 
+# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+
+# In[ ]:
+
+
+# Save to 8bit Q8_0
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer,)
+# Remember to go to https://huggingface.co/settings/tokens for a token!
+# And change hf to your username!
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, token = "YOUR_HF_TOKEN")
+
+# Save to 16bit GGUF
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "f16")
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")
+
+# Save to q4_k_m GGUF
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "q4_k_m")
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")
+
+# Save to multiple GGUF options - much faster if you want multiple!
+if False:
+    model.push_to_hub_gguf(
+        "HF_USERNAME/gemma_4_finetune", # Change hf to your username!
+        tokenizer,
+        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma_4_finetune.Q8_0.gguf` file or `gemma_4_finetune.Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,897 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# # Goal: Make Gemma 4 solve Sudoku puzzles with Reinforcement Learning
+# 
+# Our goal is to make Gemma 4 learn to solve Sudoku puzzles using reinforcement learning (GRPO).
+# The model will devise a strategy to fill in empty cells, and we'll reward it for correct placements
+# and completing valid puzzles.
+# 
+# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/12/Sudoku_Puzzle_by_L2G-20050714_solution_standardized_layout.svg/1280px-Sudoku_Puzzle_by_L2G-20050714_solution_standardized_layout.svg.png" height="300" />
+
+# # Installation
+# We'll be using [Unsloth](https://github.com/unslothai/unsloth) to do RL on Gemma 4. Unsloth saves 70% VRAM usage and makes reinforcement learning 2 to 6x faster.
+
+# In[ ]:
+
+
+get_ipython().run_cell_magic('capture', '', 'import os, importlib.util\n!pip install --upgrade -qqq uv\nif importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):\n    try: import numpy, PIL; _numpy = f"numpy=={numpy.__version__}"; _pil = f"pillow=={PIL.__version__}"\n    except: _numpy = "numpy"; _pil = "pillow"\n    # Gemma 4 requires transformers >= 5.5.0 — do NOT pin to 4.x here\n    !uv pip install -qqq \\\n        "torch>=2.8.0" "triton>=3.4.0" {_numpy} {_pil} torchvision bitsandbytes \\\n        "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \\\n        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \\\n        git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels\nelif importlib.util.find_spec("unsloth") is None:\n    !uv pip install -qqq unsloth\n# Gemma 4 requires transformers >= 5.5.0\n!uv pip install --upgrade --no-deps "transformers>=5.5.0" tokenizers "trl>=0.28.0" unsloth unsloth_zoo\n')
+
+
+# In[ ]:
+
+
+get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+
+
+# ### Unsloth
+
+# In[ ]:
+
+
+from unsloth import FastVisionModel
+import torch
+max_seq_length = 4096 # Can increase for longer reasoning traces
+lora_rank = 32 # Larger rank = smarter, but slower
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastVisionModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E2B-it",
+    max_seq_length = max_seq_length,
+    load_in_4bit = False, # False for LoRA 16bit
+    fast_inference = False, # Enable vllm fast inference
+)
+
+
+# To do efficient RL, we will use [LoRA](https://arxiv.org/abs/2106.09685), which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy.
+
+# In[ ]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
+    target_modules = [
+        "q_proj", "k_proj", "v_proj", "o_proj",
+        "gate_proj", "up_proj", "down_proj",
+    ],
+    lora_alpha = lora_rank*2, # *2 speeds up training
+    use_gradient_checkpointing = "unsloth", # Reduces memory usage
+    random_state = 3407,
+)
+
+
+# # Sudoku Game Implementation
+# 
+# We use GPT-5 to create a clean Sudoku solver environment. The strategy outputs "row,col,value" to fill cells.
+
+# In[ ]:
+
+
+#@title Sudoku Game Implementation
+from dataclasses import dataclass, field
+from typing import List, Tuple, Optional
+import random
+import copy
+
+def _is_valid_placement(board: List[List[int]], row: int, col: int, num: int) -> bool:
+    """Check if placing num at (row, col) is valid."""
+    # Check row
+    if num in board[row]:
+        return False
+
+    # Check column
+    if num in [board[r][col] for r in range(9)]:
+        return False
+
+    # Check 3x3 box
+    box_row, box_col = 3 * (row // 3), 3 * (col // 3)
+    for r in range(box_row, box_row + 3):
+        for c in range(box_col, box_col + 3):
+            if board[r][c] == num:
+                return False
+
+    return True
+
+def _solve_sudoku(board: List[List[int]]) -> bool:
+    """Solve sudoku using backtracking (for puzzle generation)."""
+    for row in range(9):
+        for col in range(9):
+            if board[row][col] == 0:
+                for num in range(1, 10):
+                    if _is_valid_placement(board, row, col, num):
+                        board[row][col] = num
+                        if _solve_sudoku(board):
+                            return True
+                        board[row][col] = 0
+                return False
+    return True
+
+def _generate_complete_board(rng: random.Random) -> List[List[int]]:
+    """Generate a complete valid Sudoku board."""
+    board = [[0 for _ in range(9)] for _ in range(9)]
+
+    # Fill diagonal 3x3 boxes first (they don't affect each other)
+    for box in range(3):
+        nums = list(range(1, 10))
+        rng.shuffle(nums)
+        for i in range(3):
+            for j in range(3):
+                board[box * 3 + i][box * 3 + j] = nums[i * 3 + j]
+
+    # Solve the rest
+    _solve_sudoku(board)
+    return board
+
+@dataclass
+class SudokuGame:
+    difficulty: int = 40  # Number of cells to remove (20 = easy, 40 = medium, 50 = hard)
+    seed: Optional[int] = None
+    _rng: random.Random = field(init = False, repr = False)
+    _board: List[List[int]] = field(init = False, repr = False)
+    _solution: List[List[int]] = field(init = False, repr = False)
+    _initial_board: List[List[int]] = field(init = False, repr = False)
+    _moves: int = field(default = 0, init = False, repr = False)
+    _state: str = field(default = "ongoing", init = False, repr = False)
+
+    def __post_init__(self):
+        self._rng = random.Random(self.seed)
+
+        # Generate complete board
+        complete_board = _generate_complete_board(self._rng)
+        self._solution = copy.deepcopy(complete_board)
+
+        # Remove cells to create puzzle
+        self._board = copy.deepcopy(complete_board)
+        cells = [(r, c) for r in range(9) for c in range(9)]
+        self._rng.shuffle(cells)
+
+        for r, c in cells[:self.difficulty]:
+            self._board[r][c] = 0
+
+        self._initial_board = copy.deepcopy(self._board)
+        self._update_state()
+
+    def board(self) -> List[List[int]]:
+        """Return current board state."""
+        return [row[:] for row in self._board]
+
+    def initial_board(self) -> List[List[int]]:
+        """Return initial puzzle state."""
+        return [row[:] for row in self._initial_board]
+
+    def state(self) -> str:
+        """Return game state: 'ongoing', 'success', or 'failed'."""
+        return self._state
+
+    def moves(self) -> int:
+        """Return number of moves made."""
+        return self._moves
+
+    def place_number(self, row: int, col: int, num: int) -> bool:
+        """Place a number on the board. Returns True if valid move."""
+        # Validate input
+        if not (0 <= row < 9 and 0 <= col < 9):
+            self._state = "failed"
+            return False
+
+        if not (1 <= num <= 9):
+            self._state = "failed"
+            return False
+
+        # Can't modify initial cells
+        if self._initial_board[row][col] != 0:
+            self._state = "failed"
+            return False
+        if self._board[row][col] != 0:
+            self._state = "failed"
+            return False
+        # Check if placement is valid
+        if not _is_valid_placement(self._board, row, col, num):
+            self._state = "failed"
+            return False
+
+        # Place number
+        self._board[row][col] = num
+        self._moves += 1
+        self._update_state()
+        return True
+
+    def _update_state(self) -> None:
+        """Update game state based on current board."""
+        # Check if puzzle is complete
+        if all(self._board[r][c] != 0 for r in range(9) for c in range(9)):
+            # Verify solution is correct
+            if self._board == self._solution:
+                self._state = "success"
+            else:
+                self._state = "failed"
+        else:
+            self._state = "ongoing"
+
+    def pretty(self, colors: bool = True) -> str:
+        """Pretty print the Sudoku board."""
+        RESET = "\x1b[0m"
+        INITIAL = "\x1b[38;5;45m"   # Cyan for initial numbers
+        PLACED = "\x1b[38;5;226m"    # Yellow for placed numbers
+        EMPTY = "\x1b[38;5;239m"     # Gray for empty cells
+
+        lines = []
+        lines.append("┌───────┬───────┬───────┐")
+
+        for row in range(9):
+            row_str = "│ "
+            for col in range(9):
+                num = self._board[row][col]
+
+                if colors:
+                    if num == 0:
+                        row_str += f"{EMPTY}.{RESET}"
+                    elif self._initial_board[row][col] != 0:
+                        row_str += f"{INITIAL}{num}{RESET}"
+                    else:
+                        row_str += f"{PLACED}{num}{RESET}"
+                else:
+                    row_str += str(num) if num != 0 else "."
+
+                if col % 3 == 2:
+                    row_str += " │ "
+                else:
+                    row_str += " "
+
+            lines.append(row_str.rstrip())
+
+            if row == 8:
+                lines.append("└───────┴───────┴───────┘")
+            elif row % 3 == 2:
+                lines.append("├───────┼───────┼───────┤")
+
+        return "\n".join(lines)
+
+
+# Test the Sudoku environment:
+
+# In[ ]:
+
+
+# Create an easy puzzle
+game = SudokuGame(difficulty = 30, seed = 42)
+print("Initial puzzle:")
+print(game.pretty())
+print(f"\nState: {game.state()}, Moves: {game.moves()}")
+
+
+# In[ ]:
+
+
+game
+
+
+# Try making some moves:
+
+# In[ ]:
+
+
+# Make a valid move
+game.place_number(0, 1, 7)
+print("\nAfter placing 7 at (1,0):")
+print(game.pretty())
+print(f"State: {game.state()}, Moves: {game.moves()}")
+
+
+# If we do some other action that's not part of the action space, we will get an error, and the game will not accept anymore actions.
+
+# # RL Environment Setup
+# 
+# Execute strategies with time limits to prevent infinite loops.
+
+# In[ ]:
+
+
+from typing import Callable
+from unsloth import execute_with_time_limit
+
+def _execute_strategy(strategy: Callable, game: SudokuGame):
+    """Execute a strategy function on a Sudoku game."""
+    assert callable(strategy)
+
+    max_moves = 100
+    valid_moves = 0  # Track successful moves
+
+    while game.state() == "ongoing" and valid_moves < max_moves:
+        try:
+            board = game.board()
+            initial = game.initial_board()
+            result = strategy(board, initial)
+
+            # Validate result format
+            if not isinstance(result, (tuple, list)) or len(result) != 3:
+                # Invalid format = immediate fail, but return valid moves made
+                return valid_moves, "failed"
+
+            row, col, num = result
+
+            # Validate types
+            if not all(isinstance(x, int) for x in [row, col, num]):
+                return valid_moves, "failed"
+
+            # Try to place number
+            success = game.place_number(row, col, num)
+
+            if success:
+                valid_moves += 1  # Count this valid move
+            else:
+                # Invalid move = game fails, but return valid_moves made so far
+                return valid_moves, "failed"
+
+        except Exception:
+            return valid_moves, "failed"
+
+    if valid_moves >= max_moves and game.state() == "ongoing":
+        return valid_moves, "failed"
+
+    return valid_moves, game.state()
+
+
+# To allow longer strategies for Reinforcement Learning, we shall allow a 10 second timer.
+
+# In[ ]:
+
+
+@execute_with_time_limit(10)
+def execute_strategy(strategy: Callable, game: SudokuGame):
+    """Execute strategy with 10 second time limit."""
+    return _execute_strategy(strategy, game)
+
+
+# Test with a simple strategy:
+
+# In[ ]:
+
+
+def simple_strategy(board, initial):
+    """Simple strategy: fill first empty cell with 1."""
+    for r in range(9):
+        for c in range(9):
+            if board[r][c] == 0 and initial[r][c] == 0:
+                return (r, c, 7)
+    return (0, 0, 7)
+
+game = SudokuGame(difficulty = 30, seed = 42)
+try:
+    moves, state = execute_strategy(simple_strategy, game)
+    print(f"Moves: {moves}, State: {state}")
+except TimeoutError as e:
+    print(f"Timed out: {e}")
+
+
+# In[ ]:
+
+
+print(game.pretty())
+
+
+# # Code Execution
+# 
+# To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called `countering reward hacking` since we don't want the function to cheat.
+# 
+# For example the below piece of code is fine, since it only imports Python level functions. We use `check_python_modules`:
+
+# In[ ]:
+
+
+from unsloth import check_python_modules, create_locked_down_function
+
+# Test safe code
+sample = """
+def strategy(board, initial):
+    for r in range(9):
+        for c in range(9):
+            if board[r][c] == 0:
+                return (r, c, 1)
+    return (0, 0, 1)
+"""
+
+ok, info = check_python_modules(sample)
+print("Safe Python code?", ok)
+print(info)
+
+
+# For the below piece of code, since we import `numpy`, we should not allow the execution:
+
+# In[ ]:
+
+
+sample = """
+def strategy(board, initial):
+    import numpy as np
+    return (0, 0, 1)
+"""
+
+ok, info = check_python_modules(sample)
+print("Safe Python code?", ok)
+print(info)
+
+
+# # Data & RL task setup
+# 
+# Create the prompt that instructs the model to generate a Sudoku solving strategy. You can customize this to some other task for another RL task.
+
+# In[ ]:
+
+
+prompt = """
+Create a Sudoku solving strategy using only native Python built-in functions without any import statements.
+You are given two lists of lists (9x9 grids):
+- board: current state (0 means empty)
+- initial: starting puzzle (0 means was empty, numbers are fixed)
+
+Return a tuple (row, col, number) for the next move.
+- row: 0-8 (row index)
+- col: 0-8 (column index)
+- number: 1-9 (digit to place)
+
+Only place numbers in cells that are BOTH empty in initial AND empty in board (initial[row][col] == 0 AND board[row][col] == 0)
+Use Sudoku rules: no duplicates in rows, columns, or 3x3 boxes.
+Output your function in backticks:
+```python
+def strategy(board, initial):
+    # Your logic here
+    return (row, col, number)
+```
+All helper functions must be inside def strategy. Output only the function.
+""".strip()
+
+print(prompt)
+
+
+# First, let's prompt the model without RL and see how it goes:
+
+# In[ ]:
+
+
+text = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    tokenize = False,
+    add_generation_prompt = True,
+)
+
+from transformers import TextStreamer
+print("=" * 50)
+print("BASE MODEL OUTPUT (before RL training):")
+print("=" * 50)
+
+inputs = tokenizer(
+    text = text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+text_streamer = TextStreamer(tokenizer, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# # Reward functions
+# 
+# We now design a `extract_function` function which simply extracts the function wrapped in 3 back ticks.
+# 
+# And 3 reward functions:
+# 
+# 1. `function_works` which rewards the model if the strategy is a valid Python function.
+# 2. `no_cheating` which checks if the function imported other modules, and if it did, we penalize it.
+# 3. `strategy_succeeds` which checks if the game strategy actually succeeds in attaining Sudoku after running the auto-generated strategy.
+
+# In[ ]:
+
+
+def extract_function(text):
+    """Extract Python function from markdown code blocks."""
+    if text.count("```") >= 2:
+        first = text.find("```") + 3
+        second = text.find("```", first)
+        fx = text[first:second].strip()
+        fx = fx.removeprefix("python\n")
+        fx = fx[fx.find("def"):]
+        if fx.startswith("def strategy(board, initial):"):
+            return fx
+    return None
+
+
+# **Reward 1: Function Works**
+# 
+# Checks if the generated code is valid Python and can be executed.
+
+# In[ ]:
+
+
+def function_works(completions, **kwargs):
+    """Reward for generating valid executable Python code."""
+    scores = []
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+
+        if function is not None:
+            ok, info = check_python_modules(function)
+
+        if function is None or "error" in info:
+            score = -2.0  # Invalid function
+        else:
+            try:
+                new_strategy = create_locked_down_function(function)
+                score = 1.0  # Valid function
+            except:
+                score = -1.0  # Function has errors
+
+        scores.append(score)
+    return scores
+
+
+# **Reward 2: No Cheating**
+# 
+# Penalizes functions that import external libraries.
+
+# In[ ]:
+
+
+def no_cheating(completions, **kwargs):
+    """Penalize use of external imports."""
+    scores = []
+    for completion in completions:
+        response = completion[0]["content"]
+        function = extract_function(response)
+
+        if function is not None:
+            ok, info = check_python_modules(function)
+            scores.append(1.0 if ok else -20.0)  # Heavy penalty for cheating
+        else:
+            scores.append(-1.0)  # Failed to create function
+
+    return scores
+
+
+# **Reward 3: Strategy Succeeds**
+# 
+# Rewards strategies that successfully solve Sudoku puzzles.
+
+# In[ ]:
+
+
+import numpy as np
+
+global PRINTER
+PRINTER = 0
+
+def strategy_succeeds(completions, **kwargs):
+    """Reward valid moves even if strategy eventually fails."""
+    global PRINTER
+    scores = []
+
+    seed = np.random.randint(10000)
+    difficulty = 40
+    for completion in completions:
+        printed = False
+        response = completion[0]["content"]
+        function = extract_function(response)
+
+        if PRINTER % 5 == 0:
+            printed = True
+            print("\n" + "=" * 60)
+            print(function)
+            print("=" * 60)
+        PRINTER += 1
+
+        if function is not None:
+            ok, info = check_python_modules(function)
+
+        if function is None or "error" in info:
+            scores.append(0)
+            continue
+
+        try:
+            new_strategy = create_locked_down_function(function)
+        except:
+            scores.append(0)
+            continue
+
+        try:
+            game = SudokuGame(difficulty = difficulty, seed = seed)
+            valid_moves, game_state = execute_strategy(new_strategy, game)
+            if valid_moves == difficulty:
+                game_state = "success"
+
+            print(f"\n Valid moves: {valid_moves}, Final state: {game_state}")
+
+            if not printed:
+                print("Strategy:")
+                print(function[:200] + "..." if len(function) > 200 else function)
+
+            print("\nFinal board:")
+            print(game.pretty())
+
+            if game_state == "success":
+                scores.append(30.0)  # Solved the puzzle!
+            elif valid_moves > 0:
+                # Reward based on valid moves made before failure
+                # Each valid move is worth 0.2 points
+                reward = valid_moves * 0.2
+                scores.append(reward)
+            else:
+                scores.append(-2.0)  # Failed immediately with no valid moves
+
+        except TimeoutError:
+            print("Timeout")
+            scores.append(-1.0)
+        except Exception as e:
+            print(f"Exception: {str(e)[:100]}")
+            scores.append(-3.0)
+
+    return scores
+
+
+# # Dataset Preparation
+# 
+# Create the training dataset.
+
+# In[ ]:
+
+
+from datasets import Dataset
+
+dataset = Dataset.from_list([
+    {
+        "prompt": [{"role": "user", "content": prompt.strip()}],
+        "answer": 0,
+    }
+] * 1000)
+
+maximum_length = len(tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    add_generation_prompt = True
+))
+
+print(f"Maximum prompt length: {maximum_length}")
+print("\nDataset sample:")
+print(dataset[0])
+
+
+# <a name="Train"></a>
+# ### Train the model
+# 
+# Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth [Reinforcement Learning Docs](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) for more options.
+
+# In[ ]:
+
+
+# Leave room for the prompt (plus 1 token safety margin)
+max_completion_length = max_seq_length - (maximum_length + 1)
+
+from trl import GRPOConfig, GRPOTrainer
+training_args = GRPOConfig(
+    temperature = 1.0,
+    learning_rate = 5e-5,
+    weight_decay = 0.001,
+    warmup_ratio = 0.1,
+    lr_scheduler_type = "linear",
+    optim = "adamw_8bit",
+    logging_steps = 1,
+    per_device_train_batch_size = 1,
+    gradient_accumulation_steps = 2, # Increase to 4 for smoother training
+    num_generations = 2, # Decrease if out of memory
+    max_completion_length = max_completion_length,
+    # num_train_epochs = 1, # Set to 1 for a full training run
+    max_steps = 60,
+    save_steps = 100,
+    report_to = "none", # Can use Weights & Biases, TrackIO
+    output_dir = "outputs",
+    epsilon = 0.2,
+    epsilon_high = 0.28, # one sided
+    delta = 1.5, # two sided
+    loss_type = 'bnpo',
+    mask_truncated_completions = True
+    # For optional training + evaluation
+    # fp16_full_eval = True,
+    # per_device_eval_batch_size = 4,
+    # eval_accumulation_steps = 1,
+    # eval_strategy = "steps",
+    # eval_steps = 1,
+)
+
+
+# And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!
+# 
+# You might have to wait 150 to 200 steps for any action. You'll probably get low reward for the first 100 steps. Please be patient!
+# 
+# | Step | Training Loss | reward    | reward_std | completion_length | kl       |
+# |------|---------------|-----------|------------|-------------------|----------|
+# | 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
+# | 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
+# | 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |
+
+# In[ ]:
+
+
+# For optional training + evaluation
+# new_dataset = dataset.train_test_split(test_size = 0.01)
+
+trainer = GRPOTrainer(
+    model = model,
+    processing_class = tokenizer,
+    reward_funcs = [
+        function_works,
+        no_cheating,
+        strategy_succeeds,
+    ],
+    args = training_args,
+    train_dataset = dataset,
+
+    # For optional training + evaluation
+    # train_dataset = new_dataset["train"],
+    # eval_dataset = new_dataset["test"],
+)
+
+
+# And let's train the model!
+# 
+# **NOTE** A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!
+
+# In[ ]:
+
+
+trainer.train()
+
+
+# And now with the LoRA we just trained with GRPO - we first save the LoRA first!
+
+# In[ ]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+
+
+# Verify LoRA is actually trained!
+
+# In[ ]:
+
+
+from safetensors import safe_open
+
+tensors = {}
+with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
+    # Verify both A and B are non zero
+    for key in f.keys():
+        tensor = f.get_tensor(key)
+        n_zeros = (tensor == 0).sum() / tensor.numel()
+        assert(n_zeros.item() != tensor.numel())
+
+
+# <a name="Inference"></a>
+# # Inference
+# Now let's try the model we just trained!
+
+# In[ ]:
+
+
+text = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    tokenize = False,
+    add_generation_prompt = True,
+)
+
+from transformers import TextStreamer
+
+_ = model.generate(
+    **tokenizer(images = None,text = text, return_tensors = "pt").to("cuda"),
+    temperature = 1.0,
+    max_new_tokens = 512,
+    streamer = TextStreamer(tokenizer, skip_prompt = False),
+)
+
+
+# <a name="Save"></a>
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[ ]:
+
+
+# Merge to 16bit
+if False: model.save_pretrained_merged("gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit",)
+if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")
+
+# Merge to 4bit
+if False: model.save_pretrained_merged("gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit",)
+if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")
+
+# Just LoRA adapters
+if False:
+    model.save_pretrained("gemma_4_lora")
+    tokenizer.save_pretrained("gemma_4_lora")
+if False:
+    model.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
+    tokenizer.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
+# 
+# Some supported quant methods (full list on our [docs page](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)):
+# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
+# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
+# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
+# 
+# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+
+# In[ ]:
+
+
+# Save to 8bit Q8_0
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer,)
+# Remember to go to https://huggingface.co/settings/tokens for a token!
+# And change hf to your username!
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, token = "YOUR_HF_TOKEN")
+
+# Save to 16bit GGUF
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "f16")
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")
+
+# Save to q4_k_m GGUF
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "q4_k_m")
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")
+
+# Save to multiple GGUF options - much faster if you want multiple!
+if False:
+    model.push_to_hub_gguf(
+        "HF_USERNAME/gemma_4_finetune", # Change hf to your username!
+        tokenizer,
+        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma_4_finetune.Q8_0.gguf` file or `gemma_4_finetune.Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,478 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+# 
+# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
+
+# In[3]:
+
+
+from unsloth import FastModel
+import torch
+from huggingface_hub import snapshot_download
+
+fourbit_models = [
+    # Gemma 4 models
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B-it",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, processor = FastModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E4B-it",
+    dtype = None, # None for auto detection
+    max_seq_length = 8192, # Choose any for long context!
+    load_in_4bit = True,  # 4 bit quantization to reduce memory
+    full_finetuning = False, # [NEW!] We have full finetuning now!
+    # token = "YOUR_HF_TOKEN", # HF Token for gated models
+)
+
+
+# # Gemma 4 can process Text, Vision and Audio!
+# 
+# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
+
+# In[4]:
+
+
+from transformers import TextStreamer
+# Helper function for inference
+def do_gemma_4_inference(messages, max_new_tokens = 128):
+    _ = model.generate(
+        **processor.apply_chat_template(
+            messages,
+            add_generation_prompt = True, # Must add for generation
+            tokenize = True,
+            return_dict = True,
+            return_tensors = "pt",
+        ).to("cuda"),
+        max_new_tokens = max_new_tokens,
+        do_sample = False,
+        streamer = TextStreamer(processor, skip_prompt = True),
+    )
+
+
+# <h3>Let's Evaluate Gemma 4 Baseline Performance on German Transcription</h2>
+
+# In[5]:
+
+
+from datasets import load_dataset,Audio,concatenate_datasets
+
+dataset = load_dataset("kadirnar/Emilia-DE-B000000", split = "train")
+
+# Select a single audio sample to reserve for testing.
+# This index is chosen from the full dataset before we create the smaller training split.
+test_audio = dataset[7546]
+
+dataset = dataset.select(range(3000))
+
+dataset = dataset.cast_column("audio", Audio(sampling_rate = 16000))
+
+
+# In[6]:
+
+
+from IPython.display import Audio, display
+print(test_audio['text'])
+Audio(test_audio['audio']['array'],rate = test_audio['audio']['sampling_rate'])
+
+
+# And the translation of the audio from German to English is:
+# 
+# > I—I hold myself directly accountable. That much is, of course, clear: namely, that there are political interests involved in trade—in the exchange of goods—and that political influences are at play. The question is: that should not be the alternative.
+
+# In[7]:
+
+
+messages = [
+    {
+        "role": "system",
+        "content": [
+            {
+                "type": "text",
+                "text": "You are an assistant that transcribes speech accurately.",
+            }
+        ],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "audio", "audio": test_audio['audio']['array']},
+            {"type": "text", "text": "Please transcribe this audio."}
+        ]
+    }
+]
+
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# <h3>Baseline Model Performance: 32.43% Word Error Rate (WER) for this sample !</h3>
+
+# # Let's finetune Gemma 4!
+# 
+# You can finetune the vision and text and audio parts
+
+# We now add LoRA adapters so we only need to update a small amount of parameters!
+
+# In[8]:
+
+
+model = FastModel.get_peft_model(
+    model,
+    finetune_vision_layers     = False, # False if not finetuning vision layers
+    finetune_language_layers   = True,  # False if not finetuning language layers
+    finetune_attention_modules = True,  # False if not finetuning attention layers
+    finetune_mlp_modules       = True,  # False if not finetuning MLP layers
+
+    r = 8,                              # The larger, the higher the accuracy, but might overfit
+    lora_alpha = 16,                    # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+    use_rslora = False,                 # We support rank stabilized LoRA
+    loftq_config = None,                # And LoftQ
+    target_modules = [
+        "q_proj", "k_proj", "v_proj", "o_proj",
+        "gate_proj", "up_proj", "down_proj",
+
+        # Audio layers
+        "post", "linear_start", "linear_end",
+        "embedding_projection",
+        "ffw_layer_1", "ffw_layer_2",
+        "output_proj",
+    ]
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We adapt the `kadirnar/Emilia-DE-B000000` dataset for our German ASR task using Gemma 4 multi-modal chat format. Each audio-text pair is structured into a conversation with `system`, `user`, and `assistant` roles. The processor then converts this into the final training format:
+# 
+# ```
+# <bos><|turn>system
+# You are an assistant that transcribes speech accurately.<turn|>
+# <|turn>user
+# <|audio|>Please transcribe this audio.<turn|>
+# <|turn>model
+# Ich, ich rechne direkt mich an.<turn|>
+
+# In[9]:
+
+
+def format_intersection_data(samples: dict) -> dict[str, list]:
+    """Format intersection dataset to match expected message format"""
+    formatted_samples = {"messages": []}
+    for idx in range(len(samples["audio"])):
+        audio = samples["audio"][idx]["array"]
+        label = str(samples["text"][idx])
+
+        message = [
+            {
+                "role": "system",
+                "content": [
+                    {
+                        "type": "text",
+                        "text": "You are an assistant that transcribes speech accurately.",
+                    }
+                ],
+            },
+            {
+                "role": "user",
+                "content": [
+                    {"type": "audio", "audio": audio},
+                    {"type": "text", "text": "Please transcribe this audio."}
+                ]
+            },
+            {
+                "role": "assistant",
+                "content":[{"type": "text", "text": label}]
+            }
+        ]
+        formatted_samples["messages"].append(message)
+    return formatted_samples
+
+
+# In[10]:
+
+
+dataset = dataset.map(format_intersection_data, batched = True, batch_size = 4, num_proc = 4)
+
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
+
+# In[11]:
+
+
+# Use UnslothVisionDataCollator which handles audio token alignment correctly
+from unsloth.trainer import UnslothVisionDataCollator
+from trl import SFTTrainer, SFTConfig
+
+trainer = SFTTrainer(
+    model = model,
+    train_dataset = dataset,
+    processing_class = processor.tokenizer,
+    data_collator = UnslothVisionDataCollator(model, processor),
+    args = SFTConfig(
+        per_device_train_batch_size = 8,
+        gradient_accumulation_steps = 1,
+        warmup_ratio = 0.03,
+        # num_train_epochs = 1, # Use for full training runs
+        max_steps = 60,
+        learning_rate = 5e-5,
+        logging_steps = 1,
+        save_strategy = "steps",
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "cosine",
+        seed = 3407,
+        output_dir = "outputs",
+        report_to = "none",
+        remove_unused_columns = False,
+
+        # The below are a must for audio finetuning:
+        dataset_text_field = "",
+        dataset_kwargs = {"skip_prepare_dataset": True},
+        max_length = 8192,
+    )
+)
+
+
+# In[12]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# # Let's train the model!
+# 
+# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
+
+# In[13]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[14]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
+
+# In[15]:
+
+
+messages = [
+    {
+        "role": "system",
+        "content": [
+            {
+                "type": "text",
+                "text": "You are an assistant that transcribes speech accurately.",
+            }
+        ],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "audio", "audio": test_audio['audio']['array']},
+            {"type": "text", "text": "Please transcribe this audio."}
+        ]
+    }
+]
+
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[16]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+processor.save_pretrained("gemma_4_lora")
+# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# processor.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[17]:
+
+
+if False:
+    from unsloth import FastModel
+    model, processor = FastModel.from_pretrained(
+        model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
+        max_seq_length = 2048,
+        load_in_4bit = True,
+    )
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "What is Gemma-4?",}]
+}]
+inputs = processor.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 128, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(processor, skip_prompt = True),
+)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
+
+# In[18]:
+
+
+if False: # Change to True to save finetune!
+    model.save_pretrained_merged("gemma-4", processor)
+
+
+# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[19]:
+
+
+if False: # Change to True to upload finetune
+    model.push_to_hub_merged(
+        "HF_ACCOUNT/gemma-4-finetune", processor,
+        token = "YOUR_HF_TOKEN"
+    )
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
+
+# In[20]:
+
+
+if False: # Change to True to save to GGUF
+    model.save_pretrained_gguf(
+        "gemma_4_finetune",
+        processor,
+        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
+    )
+
+
+# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[21]:
+
+
+if False: # Change to True to upload GGUF
+    model.push_to_hub_gguf(
+        "HF_ACCOUNT/gemma_4_finetune",
+        processor,
+        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,557 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab L4 instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+# 
+# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
+
+# In[3]:
+
+
+from unsloth import FastModel
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E4B-it",
+    dtype = None, # None for auto detection
+    max_seq_length = 1024, # Choose any for long context!
+    load_in_4bit = True,  # 4 bit quantization to reduce memory
+    full_finetuning = False, # [NEW!] We have full finetuning now!
+    # token = "YOUR_HF_TOKEN", # HF Token for gated models
+)
+
+
+# # Gemma 4 can process Text, Vision and Audio!
+# 
+# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[4]:
+
+
+from transformers import TextStreamer
+# Helper function for inference
+def do_gemma_4_inference(messages, max_new_tokens = 128):
+    _ = model.generate(
+        **tokenizer.apply_chat_template(
+            messages,
+            add_generation_prompt = True, # Must add for generation
+            tokenize = True,
+            return_dict = True,
+            return_tensors = "pt",
+        ).to("cuda"),
+        max_new_tokens = max_new_tokens,
+        temperature = 1.0, top_p = 0.95, top_k = 64,
+        streamer = TextStreamer(tokenizer, skip_prompt = True),
+        use_cache = True
+    )
+
+
+# # Gemma 4 can see images!
+# 
+# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
+
+# In[5]:
+
+
+sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "image", "image" : sloth_link },
+        { "type": "text",  "text" : "Which films does this animal feature in?" }
+    ]
+}]
+# You might have to wait 1 minute for Unsloth's auto compiler
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# Let's make a poem about sloths!
+
+# In[6]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{ "type" : "text",
+                  "text" : "Write a poem about sloths." }]
+}]
+do_gemma_4_inference(messages)
+
+
+# # Gemma 4 can also hear!
+
+# In[7]:
+
+
+from IPython.display import Audio, display
+Audio("https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3")
+
+
+# In[8]:
+
+
+get_ipython().system('wget -qqq https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3 -O audio.mp3')
+
+
+# In[9]:
+
+
+audio_file = "audio.mp3"
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "audio", "audio" : audio_file },
+        { "type": "text",  "text" : "What is this audio about?" }
+    ]
+}]
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# # Let's combine all 3 modalities together!
+
+# In[10]:
+
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "audio", "audio" : audio_file },
+        { "type": "image", "image" : sloth_link },
+        { "type": "text",  "text" : "What is this audio and image about? "\
+                                    "How are they related?" }
+    ]
+}]
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# # Let's finetune Gemma 4!
+# 
+# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
+
+# We now add LoRA adapters so we only need to update a small amount of parameters!
+
+# In[11]:
+
+
+model = FastModel.get_peft_model(
+    model,
+    finetune_vision_layers     = False, # Turn off for just text!
+    finetune_language_layers   = True,  # Should leave on!
+    finetune_attention_modules = True,  # Attention good for GRPO
+    finetune_mlp_modules       = True,  # Should leave on always!
+
+    r = 8,           # Larger = higher accuracy, but might overfit
+    lora_alpha = 8,  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
+# 
+# ```
+# <bos><|turn>user
+# Hello<turn|>
+# <|turn>model
+# Hey there!<turn|>
+# ```
+# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
+
+# In[12]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4",
+)
+
+
+# We get the first 3000 rows of the dataset
+
+# In[13]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
+
+
+# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
+
+# In[14]:
+
+
+from unsloth.chat_templates import standardize_data_formats
+dataset = standardize_data_formats(dataset)
+
+
+# Let's see how row 100 looks like!
+
+# In[15]:
+
+
+dataset[100]
+
+
+# We now have to apply the chat template for `Gemma-4` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
+
+# In[16]:
+
+
+def formatting_prompts_func(examples):
+   convos = examples["conversations"]
+   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
+   return { "text" : texts, }
+
+dataset = dataset.map(formatting_prompts_func, batched = True)
+
+
+# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
+
+# In[17]:
+
+
+dataset[100]["text"]
+
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
+
+# In[18]:
+
+
+from trl import SFTTrainer, SFTConfig
+trainer = SFTTrainer(
+    model = model,
+    tokenizer = tokenizer,
+    train_dataset = dataset,
+    eval_dataset = None, # Can set up evaluation!
+    args = SFTConfig(
+        dataset_text_field = "text",
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
+        warmup_steps = 5,
+        # num_train_epochs = 1, # Set this for 1 full training run.
+        max_steps = 60,
+        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
+        logging_steps = 1,
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "linear",
+        seed = 3407,
+        report_to = "none", # Use TrackIO/WandB etc
+    ),
+)
+
+
+# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
+
+# In[19]:
+
+
+from unsloth.chat_templates import train_on_responses_only
+trainer = train_on_responses_only(
+    trainer,
+    instruction_part = "<|turn>user\n",
+    response_part = "<|turn>model\n",
+)
+
+
+# Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!
+
+# In[20]:
+
+
+tokenizer.decode(trainer.train_dataset[100]["input_ids"])
+
+
+# Now let's print the masked out example - you should see only the answer is present:
+
+# In[21]:
+
+
+tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
+
+
+# In[22]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# # Let's train the model!
+# 
+# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
+
+# In[23]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[24]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[25]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4",
+)
+messages = [{
+    "role": "user",
+    "content": [{
+        "type" : "text",
+        "text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
+    }]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+outputs = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+)
+tokenizer.batch_decode(outputs)
+
+
+#  You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
+
+# In[26]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "Why is the sky blue?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[27]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[28]:
+
+
+if False:
+    from unsloth import FastModel
+    model, tokenizer = FastModel.from_pretrained(
+        model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
+        max_seq_length = 2048,
+        load_in_4bit = True,
+    )
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "What is Gemma-4?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 128, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
+
+# In[29]:
+
+
+if False: # Change to True to save finetune!
+    model.save_pretrained_merged("gemma-4-finetune", tokenizer)
+
+
+# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[30]:
+
+
+if False: # Change to True to upload finetune
+    model.push_to_hub_merged(
+        "HF_ACCOUNT/gemma-4-finetune", tokenizer,
+        token = "YOUR_HF_TOKEN"
+    )
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
+
+# In[31]:
+
+
+if False: # Change to True to save to GGUF
+    model.save_pretrained_gguf(
+        "gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
+    )
+
+
+# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[32]:
+
+
+if False: # Change to True to upload GGUF
+    model.push_to_hub_gguf(
+        "HF_ACCOUNT/gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,448 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab L4 instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+
+# In[3]:
+
+
+from unsloth import FastVisionModel # FastLanguageModel for LLMs
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, processor = FastVisionModel.from_pretrained(
+    "unsloth/gemma-4-E4B-it",
+    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
+    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
+)
+
+
+# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
+# 
+# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
+
+# In[4]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    finetune_vision_layers     = True, # False if not finetuning vision layers
+    finetune_language_layers   = True, # False if not finetuning language layers
+    finetune_attention_modules = True, # False if not finetuning attention layers
+    finetune_mlp_modules       = True, # False if not finetuning MLP layers
+
+    r = 32,                           # The larger, the higher the accuracy, but might overfit
+    lora_alpha = 32,                  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+    use_rslora = False,               # We support rank stabilized LoRA
+    loftq_config = None,               # And LoftQ
+    target_modules = "all-linear",    # Optional now! Can specify a list if needed
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
+# 
+# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
+
+# In[5]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
+
+
+# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
+
+# In[6]:
+
+
+dataset
+
+
+# In[7]:
+
+
+dataset[2]["image"]
+
+
+# In[8]:
+
+
+dataset[2]["text"]
+
+
+# We can also render LaTeX directly in the browser!
+
+# In[9]:
+
+
+from IPython.display import display, Math, Latex
+
+latex = dataset[3]["text"]
+display(Math(latex))
+
+
+# To format the dataset, all vision fine-tuning tasks should follow this format:
+# 
+# ```python
+# [
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+# ]
+# ```
+
+# In[10]:
+
+
+instruction = "Write the LaTeX representation for this image."
+
+def convert_to_conversation(sample):
+    conversation = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": instruction},
+                {"type": "image", "image": sample["image"]},
+            ],
+        },
+        {"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
+    ]
+    return {"messages": conversation}
+pass
+
+
+# Let's convert the dataset into the "correct" format for finetuning:
+
+# In[11]:
+
+
+converted_dataset = [convert_to_conversation(sample) for sample in dataset]
+
+
+# The first example is now structured like below:
+
+# In[12]:
+
+
+converted_dataset[0]
+
+
+# Lets take the Gemma 4 instruction chat template and use it in our base model
+
+# In[13]:
+
+
+from unsloth import get_chat_template
+
+processor = get_chat_template(
+    processor,
+    "gemma-4"
+)
+
+
+# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
+
+# In[14]:
+
+
+image = dataset[2]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# You can see it's absolutely terrible! It doesn't follow instructions at all
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
+# 
+# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
+
+# In[15]:
+
+
+from unsloth.trainer import UnslothVisionDataCollator
+from trl import SFTTrainer, SFTConfig
+
+trainer = SFTTrainer(
+    model = model,
+    train_dataset = converted_dataset,
+    processing_class = processor.tokenizer,
+    data_collator = UnslothVisionDataCollator(model, processor),
+    args = SFTConfig(
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4,
+        max_grad_norm = 0.3,
+        warmup_ratio = 0.03,
+        max_steps = 60,
+        # num_train_epochs = 2, # Set this instead of max_steps for full training runs
+        learning_rate = 2e-4,
+        logging_steps = 1,
+        save_strategy = "steps",
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "cosine",
+        seed = 3407,
+        output_dir = "outputs",
+        report_to = "none", # For Weights and Biases or others
+
+        # You MUST put the below items for vision finetuning:
+        remove_unused_columns = False,
+        dataset_text_field = "",
+        dataset_kwargs = {"skip_prepare_dataset": True},
+        max_length = 2048,
+    )
+)
+
+
+# In[16]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# In[17]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[18]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model! You can modify the instruction and input—just leave the output blank.
+# 
+# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
+
+# In[19]:
+
+
+image = dataset[10]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, use Hugging Face’s `push_to_hub` for online saving, or `save_pretrained` for local storage.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[20]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+processor.save_pretrained("gemma_4_lora")
+# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[21]:
+
+
+if False:
+    from unsloth import FastVisionModel
+
+    model, processor = FastVisionModel.from_pretrained(
+        model_name = "gemma_4_lora",  # YOUR MODEL YOU USED FOR TRAINING
+        load_in_4bit = True,  # Set to False for 16bit LoRA
+    )
+
+sample = dataset[1]
+image = sample["image"].convert("RGB")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": sample["text"],
+            },
+            {
+                "type": "image",
+            },
+        ],
+    },
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
+_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                   use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[22]:
+
+
+# Select ONLY 1 to save! (Both not needed!)
+
+# Save locally to 16bit
+if False: model.save_pretrained_merged("unsloth_finetune", processor,)
+
+# To export and save to your Hugging Face account
+if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
+
+
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).