Go to file
2025-08-23 16:44:36 +07:00
configs 1. add 2 data processor type: standard and synthetic 2025-08-23 16:44:33 +07:00
gitclone fix github clone to folder gitclone 2025-08-22 21:16:21 +07:00
logs fix eval strategy error 2025-08-22 20:05:43 +07:00
src 1. add 2 data processor type: standard and synthetic 2025-08-23 16:44:33 +07:00
.gitignore fix many bugs 2025-08-22 23:29:16 +07:00
ai_trainer_t4_colab.ipynb fix notebook file 2025-08-23 10:30:35 +07:00
compare_configs.py first commit 2025-08-22 16:33:30 +07:00
example_dataset_processing.py adding dataset processor in notebook file 2025-08-23 07:04:45 +07:00
example_synthetic_dataset_processing.py 1. add 2 data processor type: standard and synthetic 2025-08-23 16:44:33 +07:00
README_DATASET_PROCESSING.md 1. add 2 data processor type: standard and synthetic 2025-08-23 16:44:33 +07:00
README.md fix some bugs 2025-08-22 19:33:17 +07:00
requirements.txt fix some bugs 2025-08-22 19:33:17 +07:00
run_training_qwen3.py first commit 2025-08-22 16:33:30 +07:00
run_training.py first commit 2025-08-22 16:33:30 +07:00

AI Trainer

A Python application for training various unsloth models using data from GitHub repositories. Supports both Qwen2.5-Coder and Qwen3 models optimized for RTX3070 8GB VRAM.

Supported Models

1. Qwen2.5-Coder-7B-Instruct (Default)

  • Model: unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit
  • Best for: Code generation, code completion, programming tasks
  • Memory Usage: Moderate (~6-7GB VRAM)
  • Config: configs/training_config.yaml

2. Qwen3-8B

  • Model: unsloth/Qwen3-8B-bnb-4bit
  • Best for: General instruction following, broader language tasks
  • Memory Usage: Higher (~7-8GB VRAM)
  • Config: configs/training_config_qwen3.yaml

Features

  • Dataset Processing: Automatically processes code from GitHub repositories
  • Memory Optimized: Designed for RTX3070 8GB VRAM with no CPU offloading
  • Configurable Training: YAML-based configuration system
  • Progress Logging: Comprehensive logging and monitoring
  • Modular Design: Clean separation of concerns with dataset processing, training, and utilities
  • Multi-Model Support: Easy switching between different model architectures

Requirements

  • Python 3.8+
  • CUDA-compatible GPU (tested with RTX3070 8GB VRAM)
  • Git
  • Dependencies listed in requirements.txt

Private Repository Support

The application now supports processing private GitHub repositories by using a GitHub token for authentication. To use this feature:

  1. Generate a GitHub personal access token with appropriate permissions
  2. Pass the token using the --github_token command line argument
  3. Use private repository URLs in the same format as public repositories

Supported URL formats for private repositories:

  • https://github.com/user/private-repo.git
  • github.com/user/private-repo
  • user/private-repo

Installation

  1. Clone this repository
  2. if have CUDA GPU install PyTorch:
    pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu129
    
  3. Install dependencies:
    pip install -r requirements.txt
    

Usage

Training Qwen2.5-Coder-7B (Default)

# Using the main script
python src/main.py \
    --repo1 https://github.com/user/repo1 \
    --repo2 https://github.com/user/repo2 \
    --config configs/training_config.yaml \
    --output_dir ./models \
    --log_level INFO

# Or using the runner script
python run_training.py \
    --repo1 https://github.com/user/repo1 \
    --repo2 https://github.com/user/repo2

# Using private repositories with a GitHub token
python run_training.py \
    --repo1 https://github.com/user/private-repo1 \
    --repo2 https://github.com/user/private-repo2 \
    --github_token YOUR_GITHUB_TOKEN

Training Qwen3-8B

# Using the main script with Qwen3 config
python src/main.py \
    --repo1 https://github.com/user/repo1 \
    --repo2 https://github.com/user/repo2 \
    --config configs/training_config_qwen3.yaml \
    --output_dir ./models \
    --log_level INFO

# Or using the dedicated Qwen3 runner
python run_training_qwen3.py \
    --repo1 https://github.com/user/repo1 \
    --repo2 https://github.com/user/repo2

# Using private repositories with a GitHub token
python run_training_qwen3.py \
    --repo1 https://github.com/user/private-repo1 \
    --repo2 https://github.com/user/private-repo2 \
    --github_token YOUR_GITHUB_TOKEN

Command Line Arguments

  • --repo1: First GitHub repository URL (required)
  • --repo2: Second GitHub repository URL (required)
  • --config: Path to training configuration file (default: configs/training_config.yaml)
  • --output_dir: Directory to save trained model (default: ./models)
  • --log_level: Logging level (DEBUG, INFO, WARNING, ERROR)
  • --github_token: GitHub token for accessing private repositories (optional)

Project Structure

ai_trainer/
├── src/
│   ├── __init__.py
│   ├── main.py              # Main entry point
│   ├── trainer.py           # Model training logic
│   ├── dataset_processor.py # GitHub repository processing
│   ├── config.py            # Configuration management
│   └── utils.py             # Utility functions
├── configs/
│   └── training_config.yaml # Training configuration
├── data/
│   └── processed/           # Processed datasets
├── models/                  # Trained models
├── logs/                    # Training logs
├── requirements.txt
└── README.md

Memory Optimization

This application is specifically optimized for RTX3070 8GB VRAM:

  • Uses 4-bit quantization (bnb-4bit)
  • Gradient checkpointing enabled
  • No CPU offloading
  • Optimized batch sizes for 8GB VRAM
  • Memory-efficient data loading

Configuration

Qwen2.5-Coder-7B Configuration

File: configs/training_config.yaml

model:
  name: "unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit"
  max_seq_length: 2048

training:
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 4
  learning_rate: 2.0e-4
  num_train_epochs: 3

memory:
  use_gradient_checkpointing: true
  offload_to_cpu: false
  max_memory_usage: 0.85

Qwen3-8B Configuration

File: configs/training_config_qwen3.yaml

model:
  name: "unsloth/Qwen3-8B-bnb-4bit"
  max_seq_length: 2048

training:
  per_device_train_batch_size: 1  # More conservative
  gradient_accumulation_steps: 8  # Higher accumulation
  learning_rate: 1.0e-4  # Lower learning rate
  num_train_epochs: 3

memory:
  use_gradient_checkpointing: true
  offload_to_cpu: false
  max_memory_usage: 0.95  # More aggressive memory usage

Key Differences

Setting Qwen2.5-Coder Qwen3-8B Reason
Batch Size 2 1 Larger model needs smaller batches
Gradient Accumulation 4 8 Maintains effective batch size
Learning Rate 2e-4 1e-4 Larger model needs more conservative LR
Memory Usage 85% 95% Qwen3 can use more VRAM
Effective Batch Size 8 8 Same training dynamics

Model Selection Guide

Choose Qwen2.5-Coder-7B when:

  • You want to fine-tune specifically for code generation tasks
  • Working with programming languages and technical content
  • Need code completion and code understanding capabilities
  • Prefer moderate memory usage (~6-7GB VRAM)

Choose Qwen3-8B when:

  • You need general instruction following capabilities
  • Working with mixed content (code + natural language)
  • Want broader language understanding and generation
  • Have sufficient VRAM (~7-8GB) and prefer newer architecture

License

MIT License