Go to file

Suherdy Yacob 10d11d2b07 Merge branch 'main' of https://git.mapan.co.id/admin.suherdy/ai_github_trainer		2025-08-23 16:44:36 +07:00
configs	1. add 2 data processor type: standard and synthetic	2025-08-23 16:44:33 +07:00
gitclone	fix github clone to folder gitclone	2025-08-22 21:16:21 +07:00
logs	fix eval strategy error	2025-08-22 20:05:43 +07:00
src	1. add 2 data processor type: standard and synthetic	2025-08-23 16:44:33 +07:00
.gitignore	fix many bugs	2025-08-22 23:29:16 +07:00
ai_trainer_t4_colab.ipynb	fix notebook file	2025-08-23 10:30:35 +07:00
compare_configs.py	first commit	2025-08-22 16:33:30 +07:00
example_dataset_processing.py	adding dataset processor in notebook file	2025-08-23 07:04:45 +07:00
example_synthetic_dataset_processing.py	1. add 2 data processor type: standard and synthetic	2025-08-23 16:44:33 +07:00
README_DATASET_PROCESSING.md	1. add 2 data processor type: standard and synthetic	2025-08-23 16:44:33 +07:00
README.md	fix some bugs	2025-08-22 19:33:17 +07:00
requirements.txt	fix some bugs	2025-08-22 19:33:17 +07:00
run_training_qwen3.py	first commit	2025-08-22 16:33:30 +07:00
run_training.py	first commit	2025-08-22 16:33:30 +07:00

README.md

AI Trainer

A Python application for training various unsloth models using data from GitHub repositories. Supports both Qwen2.5-Coder and Qwen3 models optimized for RTX3070 8GB VRAM.

Supported Models

1. Qwen2.5-Coder-7B-Instruct (Default)

Model: unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit
Best for: Code generation, code completion, programming tasks
Memory Usage: Moderate (~6-7GB VRAM)
Config: configs/training_config.yaml

2. Qwen3-8B

Model: unsloth/Qwen3-8B-bnb-4bit
Best for: General instruction following, broader language tasks
Memory Usage: Higher (~7-8GB VRAM)
Config: configs/training_config_qwen3.yaml

Features

Dataset Processing: Automatically processes code from GitHub repositories
Memory Optimized: Designed for RTX3070 8GB VRAM with no CPU offloading
Configurable Training: YAML-based configuration system
Progress Logging: Comprehensive logging and monitoring
Modular Design: Clean separation of concerns with dataset processing, training, and utilities
Multi-Model Support: Easy switching between different model architectures

Requirements

Python 3.8+
CUDA-compatible GPU (tested with RTX3070 8GB VRAM)
Git
Dependencies listed in requirements.txt

Private Repository Support

The application now supports processing private GitHub repositories by using a GitHub token for authentication. To use this feature:

Generate a GitHub personal access token with appropriate permissions
Pass the token using the --github_token command line argument
Use private repository URLs in the same format as public repositories

Supported URL formats for private repositories:

https://github.com/user/private-repo.git
github.com/user/private-repo
user/private-repo

Installation

Clone this repository

if have CUDA GPU install PyTorch:

pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu129

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Training Qwen2.5-Coder-7B (Default)

# Using the main script
python src/main.py \
    --repo1 https://github.com/user/repo1 \
    --repo2 https://github.com/user/repo2 \
    --config configs/training_config.yaml \
    --output_dir ./models \
    --log_level INFO

# Or using the runner script
python run_training.py \
    --repo1 https://github.com/user/repo1 \
    --repo2 https://github.com/user/repo2

# Using private repositories with a GitHub token
python run_training.py \
    --repo1 https://github.com/user/private-repo1 \
    --repo2 https://github.com/user/private-repo2 \
    --github_token YOUR_GITHUB_TOKEN

Training Qwen3-8B

# Using the main script with Qwen3 config
python src/main.py \
    --repo1 https://github.com/user/repo1 \
    --repo2 https://github.com/user/repo2 \
    --config configs/training_config_qwen3.yaml \
    --output_dir ./models \
    --log_level INFO

# Or using the dedicated Qwen3 runner
python run_training_qwen3.py \
    --repo1 https://github.com/user/repo1 \
    --repo2 https://github.com/user/repo2

# Using private repositories with a GitHub token
python run_training_qwen3.py \
    --repo1 https://github.com/user/private-repo1 \
    --repo2 https://github.com/user/private-repo2 \
    --github_token YOUR_GITHUB_TOKEN

Command Line Arguments

--repo1: First GitHub repository URL (required)
--repo2: Second GitHub repository URL (required)
--config: Path to training configuration file (default: configs/training_config.yaml)
--output_dir: Directory to save trained model (default: ./models)
--log_level: Logging level (DEBUG, INFO, WARNING, ERROR)
--github_token: GitHub token for accessing private repositories (optional)

Project Structure

ai_trainer/
├── src/
│   ├── __init__.py
│   ├── main.py              # Main entry point
│   ├── trainer.py           # Model training logic
│   ├── dataset_processor.py # GitHub repository processing
│   ├── config.py            # Configuration management
│   └── utils.py             # Utility functions
├── configs/
│   └── training_config.yaml # Training configuration
├── data/
│   └── processed/           # Processed datasets
├── models/                  # Trained models
├── logs/                    # Training logs
├── requirements.txt
└── README.md

Memory Optimization

This application is specifically optimized for RTX3070 8GB VRAM:

Uses 4-bit quantization (bnb-4bit)
Gradient checkpointing enabled
No CPU offloading
Optimized batch sizes for 8GB VRAM
Memory-efficient data loading

Configuration

Qwen2.5-Coder-7B Configuration

File: configs/training_config.yaml

model:
  name: "unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit"
  max_seq_length: 2048

training:
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 4
  learning_rate: 2.0e-4
  num_train_epochs: 3

memory:
  use_gradient_checkpointing: true
  offload_to_cpu: false
  max_memory_usage: 0.85

Qwen3-8B Configuration

File: configs/training_config_qwen3.yaml

model:
  name: "unsloth/Qwen3-8B-bnb-4bit"
  max_seq_length: 2048

training:
  per_device_train_batch_size: 1  # More conservative
  gradient_accumulation_steps: 8  # Higher accumulation
  learning_rate: 1.0e-4  # Lower learning rate
  num_train_epochs: 3

memory:
  use_gradient_checkpointing: true
  offload_to_cpu: false
  max_memory_usage: 0.95  # More aggressive memory usage

Key Differences

Setting	Qwen2.5-Coder	Qwen3-8B	Reason
Batch Size	2	1	Larger model needs smaller batches
Gradient Accumulation	4	8	Maintains effective batch size
Learning Rate	2e-4	1e-4	Larger model needs more conservative LR
Memory Usage	85%	95%	Qwen3 can use more VRAM
Effective Batch Size	8	8	Same training dynamics

Model Selection Guide

Choose Qwen2.5-Coder-7B when:

You want to fine-tune specifically for code generation tasks
Working with programming languages and technical content
Need code completion and code understanding capabilities
Prefer moderate memory usage (~6-7GB VRAM)

Choose Qwen3-8B when:

You need general instruction following capabilities
Working with mixed content (code + natural language)
Want broader language understanding and generation
Have sufficient VRAM (~7-8GB) and prefer newer architecture

License

MIT License