# AI Trainer for Qwen Models on Google Colab (T4 GPU)

This notebook allows you to train Qwen models on GitHub repositories using Google Colab's T4 GPU with 13GB VRAM.

## 1. Setup Environment

First, let's install the required dependencies.

In [2]:
# Install required packages
!pip install unsloth bitsandbytes
!pip install transformers datasets
!pip install accelerate peft
!pip install GitPython PyYAML

Collecting unsloth
  Downloading unsloth-2025.8.9-py3-none-any.whl.metadata (52 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/52.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m52.3/52.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl.metadata (11 kB)
Collecting unsloth_zoo>=2025.8.8 (from unsloth)
  Downloading unsloth_zoo-2025.8.8-py3-none-any.whl.metadata (9.4 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.32.post2-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.1 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.28-py3-none-any.whl.metadata (11 kB)
Collecting datasets<4.0.0,>=3.4.1 (from unsloth)
  Downloa

In [3]:
# Set environment variables for optimal GPU performance
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:512'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
os.environ['DISABLE_TORCH_COMPILE'] = '1'

print("Environment variables set successfully!")

Environment variables set successfully!


## 2. Import Libraries

Let's import all necessary libraries.

In [4]:
import torch
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
import git
from pathlib import Path

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


## 3. Configuration

Configuration optimized for T4 GPU with 13GB VRAM.

In [22]:
# Model configuration
MODEL_NAME = "unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit"
MAX_SEQ_LENGTH = 2048

# Training configuration for T4 GPU (13GB VRAM)
TRAINING_CONFIG = {
    'per_device_train_batch_size': 1,
    'gradient_accumulation_steps': 8,
    'max_steps': 100,
    'learning_rate': 2e-4,
    'use_gradient_checkpointing': True,
    'bf16': False
}

## 4. Load Model

Load the Qwen model with Unsloth for memory efficiency.

In [15]:
# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,
    load_in_4bit=True,
)

# Configure model for training
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing=TRAINING_CONFIG['use_gradient_checkpointing'],
    random_state=3407,
)

==((====))==  Unsloth 2025.8.9: Fast Qwen2 patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## 5. Process GitHub Repositories

Extract code from GitHub repositories for training.

## 5.1 Advanced Dataset Processing

For more comprehensive dataset processing with support for multiple file types, you can use this advanced processor:

In [18]:
from unsloth.chat_templates import get_chat_template
from datasets import Dataset

class AdvancedDatasetProcessor:
    """Advanced processor for GitHub repositories with comprehensive file support"""

    # Supported file extensions
    CODE_EXTENSIONS = {
        '.py': 'python', '.js': 'javascript', '.ts': 'typescript',
        '.java': 'java', '.cpp': 'cpp', '.c': 'c', '.cs': 'csharp',
        '.php': 'php', '.rb': 'ruby', '.go': 'go', '.rs': 'rust',
        '.swift': 'swift', '.kt': 'kotlin', '.scala': 'scala',
        '.sql': 'sql', '.sh': 'bash', '.yaml': 'yaml', '.yml': 'yaml',
        '.json': 'json', '.xml': 'xml', '.html': 'html', '.css': 'css',
        '.md': 'markdown'
    }

    def __init__(self):
        pass

    def process_github_repos(self, repo_urls, max_files_per_repo=500000):
        """Process multiple GitHub repositories into a training dataset"""
        all_code_samples = []

        for repo_url in repo_urls:
            try:
                print(f"Processing repository: {repo_url}")
                repo_samples = self._process_single_repo(repo_url, max_files_per_repo)
                all_code_samples.extend(repo_samples)
                print(f"Extracted {len(repo_samples)} samples from {repo_url}")
            except Exception as e:
                print(f"Failed to process repository {repo_url}: {str(e)}")
                continue

        if not all_code_samples:
            raise ValueError("No code samples extracted from any repository")

        print(f"Total samples collected: {len(all_code_samples)}")

        # Create HuggingFace dataset
        from datasets import Dataset
        dataset = Dataset.from_list(all_code_samples)
        return dataset

    def _process_single_repo(self, repo_url, max_files_per_repo):
        """Process a single GitHub repository"""
        import tempfile

        with tempfile.TemporaryDirectory() as temp_dir:
            try:
                # Clone repository
                repo_name = repo_url.split('/')[-1].replace('.git', '')
                repo_path = f"{temp_dir}/{repo_name}"

                print(f"Cloning {repo_url}...")
                repo = git.Repo.clone_from(repo_url, repo_path, depth=1, branch="18.0")

                # Extract code samples
                code_samples = self._extract_code_samples(repo_path, max_files_per_repo)

                return code_samples

            finally:
                print(f"Finished processing {repo_url}")

    def _extract_code_samples(self, repo_path, max_files_per_repo):
        """Extract code samples from a repository"""
        code_samples = []
        repo_path_obj = Path(repo_path)

        # Find all code files
        code_files = []
        for ext in self.CODE_EXTENSIONS:
            code_files.extend(repo_path_obj.rglob(f'*{ext}'))

        print(f"Found {len(code_files)} code files")

        # Limit files per repo to prevent memory issues
        code_files = code_files[:max_files_per_repo]

        for code_file in code_files:
            try:
                if self._should_exclude_file(str(code_file.relative_to(repo_path))):
                    continue

                sample = self._process_code_file(code_file, repo_path_obj)
                if sample:
                    code_samples.append(sample)

            except Exception as e:
                print(f"Failed to process {code_file}: {str(e)}")
                continue

        return code_samples

    def _should_exclude_file(self, relative_path):
        """Check if a file should be excluded based on patterns"""
        import re
        exclude_patterns = [
            r'\.git/', r'__pycache__/', r'node_modules/',
            r'\.venv/', r'venv/', r'package-lock\.json$',
            r'\.log$', r'\.tmp$', r'~\$.*', r'\.swp$',
            r'\.DS_Store', r'\.pyc$'
        ]
        for pattern in exclude_patterns:
            if re.search(pattern, relative_path):
                return True
        return False

    def _process_code_file(self, file_path, repo_path):
        """Process a single code file into a training sample"""
        try:
            # Read file content
            with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
                content = f.read()

            # Skip if file is too small or too large
            if len(content.strip()) < 10:
                return None
            if len(content) > 100000:  # Rough limit
                return None

            # Get relative path for context
            relative_path = file_path.relative_to(repo_path)

            # Determine language
            extension = file_path.suffix.lower()
            language = self.CODE_EXTENSIONS.get(extension, 'unknown')

            # Create training sample
            sample = {
                'text': content,
                'language': language,
                'file_path': str(relative_path),
                'repo_name': repo_path.name,
                'file_size': len(content),
                'line_count': len(content.splitlines())
            }

            return sample

        except Exception as e:
            print(f"Error processing {file_path}: {str(e)}")
            return None

    def _prepare_dataset(self, train_dataset: Dataset) -> Dataset:
        """Prepare and tokenize the dataset for Qwen2.5-Coder"""
        print("Preparing dataset for Qwen2.5-Coder...")

        # Apply chat template for Qwen2.5-Coder if available
        try:
            chat_template = get_chat_template("qwen")
            if chat_template and isinstance(chat_template, str):
                tokenizer.chat_template = chat_template
                print("Applied Qwen chat template from string")
            else:
                print(f"Invalid chat template received: {type(chat_template)}")
        except Exception as e:
            print(f"Could not apply Qwen chat template: {e}")
            # Fallback to default formatting
            pass

        def tokenize_function(examples):
            # Format examples as instruction-following pairs for code training
            formatted_texts = []
            for text in examples["text"]:
                # Create an instruction format appropriate for code training
                # For Qwen2.5-Coder, we can use a code completion or analysis format
                messages = [
                    {"role": "user", "content": "Analyze and understand the following code:"},
                    {"role": "assistant", "content": text}
                ]

                # Apply chat template if available, otherwise use simple formatting
                try:
                    # Log tokenizer state in multiprocessing context
                    import multiprocessing
                    if hasattr(tokenizer, 'chat_template'):
                        print(f"Tokenize function - chat_template type: {type(tokenizer.chat_template)}")

                    # Check if tokenizer has apply_chat_template method
                    if hasattr(tokenizer, 'apply_chat_template'):
                        formatted_text = tokenizer.apply_chat_template(
                            messages,
                            tokenize=False,
                            add_generation_prompt=False  # We're training on the full conversation
                        )
                    else:
                        print("Tokenizer does not have apply_chat_template method, using fallback")
                        formatted_text = f"<|im_start|>user\nAnalyze and understand the following code:<|im_end|>\n<|im_start|>assistant\n{text}<|im_end|>"

                except AttributeError as e:
                    if 'unsloth_push_to_hub' in str(e):
                        print(f"AttributeError in multiprocessing context: {e}")
                    elif 'padding_side' in str(e):
                        print(f"Chat template padding_side error: {e}")
                        print("Using fallback formatting due to chat template issue")
                        formatted_text = f"<|im_start|>user\nAnalyze and understand the following code:<|im_end|>\n<|im_start|>assistant\n{text}<|im_end|>"
                    else:
                        raise
                except Exception as e:
                    print(f"Error applying chat template: {e}, using fallback formatting")
                    # Fallback to simple formatting with special tokens
                    formatted_text = f"<|im_start|>user\nAnalyze and understand the following code:<|im_end|>\n<|im_start|>assistant\n{text}<|im_end|>"

                formatted_texts.append(formatted_text)

            # Tokenize with proper padding and truncation for Qwen2.5-Coder
            tokenized = tokenizer(
                formatted_texts,
                padding="max_length",
                truncation=True,
                max_length=MAX_SEQ_LENGTH,
                return_tensors="pt",
                add_special_tokens=True
            )

            # For causal language modeling, we need to create proper labels
            # Clone input_ids to create labels
            labels = tokenized["input_ids"].clone()

            # Try to mask the user part of the conversation
            # Find the assistant token to determine where the assistant response starts
            try:
                # Convert to string to find the assistant token
                decoded_tokens = tokenizer.batch_decode(tokenized["input_ids"], skip_special_tokens=False)
                for i, decoded in enumerate(decoded_tokens):
                    # Find where the assistant response starts
                    assistant_start = decoded.find("<|im_start|>assistant")
                    if assistant_start != -1:
                        # Find the actual token position
                        # We'll mask everything before the assistant response with -100
                        assistant_tokens = tokenizer("<|im_start|>assistant", add_special_tokens=False)["input_ids"]
                        if len(assistant_tokens) > 0:
                            # Find where the assistant token first appears
                            assistant_token_id = assistant_tokens[0]
                            assistant_positions = (tokenized["input_ids"][i] == assistant_token_id).nonzero(as_tuple=True)[0]
                            if len(assistant_positions) > 0:
                                # Mask everything before the assistant token
                                labels[i, :assistant_positions[0]] = -100
            except Exception as e:
                print(f"Could not mask user tokens: {e}")
                # Fallback: Just use the input_ids as labels
                pass

            tokenized["labels"] = labels
            return tokenized

        # Tokenize dataset
        tokenized_dataset = train_dataset.map(
            tokenize_function,
            batched=True,
            remove_columns=["text", "language", "file_path", "repo_name", "file_size", "line_count"],
            desc="Tokenizing dataset for Qwen2.5-Coder"
        )

        print(f"Dataset tokenized for Qwen2.5-Coder. Size: {len(tokenized_dataset)}")
        return tokenized_dataset

# Example usage:
processor = AdvancedDatasetProcessor()
train_dataset = processor.process_github_repos(["https://github.com/odoo/odoo.git"])
tokenized_dataset = processor._prepare_dataset(train_dataset)

Processing repository: https://github.com/odoo/odoo.git
Cloning https://github.com/odoo/odoo.git...
Found 17769 code files
Finished processing https://github.com/odoo/odoo.git
Extracted 17631 samples from https://github.com/odoo/odoo.git
Total samples collected: 17631
Preparing dataset for Qwen2.5-Coder...
Could not apply Qwen chat template: 'str' object has no attribute 'padding_side'


Tokenizing dataset for Qwen2.5-Coder:   0%|          | 0/17631 [00:00<?, ? examples/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function - chat_template type: <class 'str'>
Tokenize function

## 6. Training

Set up and run the training process.

In [20]:
try:
  if torch.cuda.is_available():
      torch.cuda.empty_cache()
      torch.cuda.synchronize()

except Exception as e:
  print(f"Warning: Failed to clear GPU cache: {str(e)}")

In [25]:
# Set up trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=tokenized_dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    packing=True,
    args=TrainingArguments(
        per_device_train_batch_size=TRAINING_CONFIG['per_device_train_batch_size'],
        gradient_accumulation_steps=TRAINING_CONFIG['gradient_accumulation_steps'],
        max_steps=TRAINING_CONFIG['max_steps'],
        learning_rate=TRAINING_CONFIG['learning_rate'],
        num_train_epochs=3,
        fp16=not TRAINING_CONFIG['bf16'],
        bf16=TRAINING_CONFIG['bf16'],
        logging_steps=1,
        save_steps=50,
        output_dir="./model_output",
        optim="adamw_torch",
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
    ),
)

In [26]:
# Start training
print("Starting training...")
trainer.train()
print("Training completed!")

Starting training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 17,631 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8
 "-____-"     Trainable parameters = 40,370,176 of 7,655,986,688 (0.53% trained)
  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msuherdy[0m ([33msuherdy-personal[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,9.1229
2,8.3787
3,11.6311
4,10.0826
5,12.3623
6,11.8164
7,11.1036
8,12.131
9,6.6401
10,6.0631


Training completed!


## 7. Save Model

Save the trained model.

In [None]:
# Save the model
model.save_pretrained("/content/drive/My Drive/trained_model")
tokenizer.save_pretrained("/content/drive/My Drive/trained_model")
#model.save_pretrained_gguf("/content/drive/My Drive/trained_model_quant", tokenizer) # Saves in GGUF format for Ollama
model.save_pretrained_gguf("/content/drive/My Drive/trained_model_quant", tokenizer, quantization_method = "q4_k_m")
#model.save_pretrained_gguf("/content/drive/My Drive/trained_model_quant", tokenizer, quantization_method = "q8_0")
print("Model saved successfully!")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 2.37 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 28/28 [03:00<00:00,  6.43s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving /content/drive/My Drive/trained_model_quant/pytorch_model-00001-of-00004.bin...
