ai_github_trainer/README.md
2025-08-22 19:33:17 +07:00

214 lines
6.5 KiB
Markdown

# AI Trainer
A Python application for training various unsloth models using data from GitHub repositories. Supports both Qwen2.5-Coder and Qwen3 models optimized for RTX3070 8GB VRAM.
## Supported Models
### 1. Qwen2.5-Coder-7B-Instruct (Default)
- **Model**: `unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit`
- **Best for**: Code generation, code completion, programming tasks
- **Memory Usage**: Moderate (~6-7GB VRAM)
- **Config**: `configs/training_config.yaml`
### 2. Qwen3-8B
- **Model**: `unsloth/Qwen3-8B-bnb-4bit`
- **Best for**: General instruction following, broader language tasks
- **Memory Usage**: Higher (~7-8GB VRAM)
- **Config**: `configs/training_config_qwen3.yaml`
## Features
- **Dataset Processing**: Automatically processes code from GitHub repositories
- **Memory Optimized**: Designed for RTX3070 8GB VRAM with no CPU offloading
- **Configurable Training**: YAML-based configuration system
- **Progress Logging**: Comprehensive logging and monitoring
- **Modular Design**: Clean separation of concerns with dataset processing, training, and utilities
- **Multi-Model Support**: Easy switching between different model architectures
## Requirements
- Python 3.8+
- CUDA-compatible GPU (tested with RTX3070 8GB VRAM)
- Git
- Dependencies listed in `requirements.txt`
## Private Repository Support
The application now supports processing private GitHub repositories by using a GitHub token for authentication.
To use this feature:
1. Generate a GitHub personal access token with appropriate permissions
2. Pass the token using the `--github_token` command line argument
3. Use private repository URLs in the same format as public repositories
Supported URL formats for private repositories:
- `https://github.com/user/private-repo.git`
- `github.com/user/private-repo`
- `user/private-repo`
## Installation
1. Clone this repository
2. if have CUDA GPU install PyTorch:
```bash
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu129
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
## Usage
### Training Qwen2.5-Coder-7B (Default)
```bash
# Using the main script
python src/main.py \
--repo1 https://github.com/user/repo1 \
--repo2 https://github.com/user/repo2 \
--config configs/training_config.yaml \
--output_dir ./models \
--log_level INFO
# Or using the runner script
python run_training.py \
--repo1 https://github.com/user/repo1 \
--repo2 https://github.com/user/repo2
# Using private repositories with a GitHub token
python run_training.py \
--repo1 https://github.com/user/private-repo1 \
--repo2 https://github.com/user/private-repo2 \
--github_token YOUR_GITHUB_TOKEN
```
### Training Qwen3-8B
```bash
# Using the main script with Qwen3 config
python src/main.py \
--repo1 https://github.com/user/repo1 \
--repo2 https://github.com/user/repo2 \
--config configs/training_config_qwen3.yaml \
--output_dir ./models \
--log_level INFO
# Or using the dedicated Qwen3 runner
python run_training_qwen3.py \
--repo1 https://github.com/user/repo1 \
--repo2 https://github.com/user/repo2
# Using private repositories with a GitHub token
python run_training_qwen3.py \
--repo1 https://github.com/user/private-repo1 \
--repo2 https://github.com/user/private-repo2 \
--github_token YOUR_GITHUB_TOKEN
```
### Command Line Arguments
- `--repo1`: First GitHub repository URL (required)
- `--repo2`: Second GitHub repository URL (required)
- `--config`: Path to training configuration file (default: configs/training_config.yaml)
- `--output_dir`: Directory to save trained model (default: ./models)
- `--log_level`: Logging level (DEBUG, INFO, WARNING, ERROR)
- `--github_token`: GitHub token for accessing private repositories (optional)
## Project Structure
```
ai_trainer/
├── src/
│ ├── __init__.py
│ ├── main.py # Main entry point
│ ├── trainer.py # Model training logic
│ ├── dataset_processor.py # GitHub repository processing
│ ├── config.py # Configuration management
│ └── utils.py # Utility functions
├── configs/
│ └── training_config.yaml # Training configuration
├── data/
│ └── processed/ # Processed datasets
├── models/ # Trained models
├── logs/ # Training logs
├── requirements.txt
└── README.md
```
## Memory Optimization
This application is specifically optimized for RTX3070 8GB VRAM:
- Uses 4-bit quantization (bnb-4bit)
- Gradient checkpointing enabled
- No CPU offloading
- Optimized batch sizes for 8GB VRAM
- Memory-efficient data loading
## Configuration
### Qwen2.5-Coder-7B Configuration
**File**: `configs/training_config.yaml`
```yaml
model:
name: "unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit"
max_seq_length: 2048
training:
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 2.0e-4
num_train_epochs: 3
memory:
use_gradient_checkpointing: true
offload_to_cpu: false
max_memory_usage: 0.85
```
### Qwen3-8B Configuration
**File**: `configs/training_config_qwen3.yaml`
```yaml
model:
name: "unsloth/Qwen3-8B-bnb-4bit"
max_seq_length: 2048
training:
per_device_train_batch_size: 1 # More conservative
gradient_accumulation_steps: 8 # Higher accumulation
learning_rate: 1.0e-4 # Lower learning rate
num_train_epochs: 3
memory:
use_gradient_checkpointing: true
offload_to_cpu: false
max_memory_usage: 0.95 # More aggressive memory usage
```
### Key Differences
| Setting | Qwen2.5-Coder | Qwen3-8B | Reason |
|---------|---------------|----------|---------|
| Batch Size | 2 | 1 | Larger model needs smaller batches |
| Gradient Accumulation | 4 | 8 | Maintains effective batch size |
| Learning Rate | 2e-4 | 1e-4 | Larger model needs more conservative LR |
| Memory Usage | 85% | 95% | Qwen3 can use more VRAM |
| Effective Batch Size | 8 | 8 | Same training dynamics |
## Model Selection Guide
### Choose Qwen2.5-Coder-7B when:
- You want to fine-tune specifically for **code generation** tasks
- Working with **programming languages** and technical content
- Need **code completion** and **code understanding** capabilities
- Prefer **moderate memory usage** (~6-7GB VRAM)
### Choose Qwen3-8B when:
- You need **general instruction following** capabilities
- Working with **mixed content** (code + natural language)
- Want **broader language understanding** and generation
- Have **sufficient VRAM** (~7-8GB) and prefer newer architecture
## License
MIT License