ai_github_trainer/README.md

# AI Trainer

A Python application for training various unsloth models using data from GitHub repositories. Supports both Qwen2.5-Coder and Qwen3 models optimized for RTX3070 8GB VRAM.

## Supported Models

### 1. Qwen2.5-Coder-7B-Instruct (Default)
- **Model**: `unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit`
- **Best for**: Code generation, code completion, programming tasks
- **Memory Usage**: Moderate (~6-7GB VRAM)
- **Config**: `configs/training_config.yaml`

### 2. Qwen3-8B
- **Model**: `unsloth/Qwen3-8B-bnb-4bit`
- **Best for**: General instruction following, broader language tasks
- **Memory Usage**: Higher (~7-8GB VRAM)
- **Config**: `configs/training_config_qwen3.yaml`

## Features

- **Dataset Processing**: Automatically processes code from GitHub repositories
- **Memory Optimized**: Designed for RTX3070 8GB VRAM with no CPU offloading
- **Configurable Training**: YAML-based configuration system
- **Progress Logging**: Comprehensive logging and monitoring
- **Modular Design**: Clean separation of concerns with dataset processing, training, and utilities
- **Multi-Model Support**: Easy switching between different model architectures

## Requirements

- Python 3.8+
- CUDA-compatible GPU (tested with RTX3070 8GB VRAM)
- Git
- Dependencies listed in `requirements.txt`

## Private Repository Support

The application now supports processing private GitHub repositories by using a GitHub token for authentication.
To use this feature:

1. Generate a GitHub personal access token with appropriate permissions
2. Pass the token using the `--github_token` command line argument
3. Use private repository URLs in the same format as public repositories

Supported URL formats for private repositories:
- `https://github.com/user/private-repo.git`
- `github.com/user/private-repo`
- `user/private-repo`

## Installation

1. Clone this repository
2. if have CUDA GPU install PyTorch:
   ```bash
   pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu129
   ```
2. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

## Usage

### Training Qwen2.5-Coder-7B (Default)
```bash
# Using the main script
python src/main.py \
    --repo1 https://github.com/user/repo1 \
    --repo2 https://github.com/user/repo2 \
    --config configs/training_config.yaml \
    --output_dir ./models \
    --log_level INFO

# Or using the runner script
python run_training.py \
    --repo1 https://github.com/user/repo1 \
    --repo2 https://github.com/user/repo2

# Using private repositories with a GitHub token
python run_training.py \
    --repo1 https://github.com/user/private-repo1 \
    --repo2 https://github.com/user/private-repo2 \
    --github_token YOUR_GITHUB_TOKEN
```

### Training Qwen3-8B
```bash
# Using the main script with Qwen3 config
python src/main.py \
    --repo1 https://github.com/user/repo1 \
    --repo2 https://github.com/user/repo2 \
    --config configs/training_config_qwen3.yaml \
    --output_dir ./models \
    --log_level INFO

# Or using the dedicated Qwen3 runner
python run_training_qwen3.py \
    --repo1 https://github.com/user/repo1 \
    --repo2 https://github.com/user/repo2

# Using private repositories with a GitHub token
python run_training_qwen3.py \
    --repo1 https://github.com/user/private-repo1 \
    --repo2 https://github.com/user/private-repo2 \
    --github_token YOUR_GITHUB_TOKEN
```

### Command Line Arguments

- `--repo1`: First GitHub repository URL (required)
- `--repo2`: Second GitHub repository URL (required)
- `--config`: Path to training configuration file (default: configs/training_config.yaml)
- `--output_dir`: Directory to save trained model (default: ./models)
- `--log_level`: Logging level (DEBUG, INFO, WARNING, ERROR)
- `--github_token`: GitHub token for accessing private repositories (optional)

## Project Structure

```
ai_trainer/
├── src/
│   ├── __init__.py
│   ├── main.py              # Main entry point
│   ├── trainer.py           # Model training logic
│   ├── dataset_processor.py # GitHub repository processing
│   ├── config.py            # Configuration management
│   └── utils.py             # Utility functions
├── configs/
│   └── training_config.yaml # Training configuration
├── data/
│   └── processed/           # Processed datasets
├── models/                  # Trained models
├── logs/                    # Training logs
├── requirements.txt
└── README.md
```

## Memory Optimization

This application is specifically optimized for RTX3070 8GB VRAM:
- Uses 4-bit quantization (bnb-4bit)
- Gradient checkpointing enabled
- No CPU offloading
- Optimized batch sizes for 8GB VRAM
- Memory-efficient data loading

## Configuration

### Qwen2.5-Coder-7B Configuration
**File**: `configs/training_config.yaml`

```yaml
model:
  name: "unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit"
  max_seq_length: 2048

training:
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 4
  learning_rate: 2.0e-4
  num_train_epochs: 3

memory:
  use_gradient_checkpointing: true
  offload_to_cpu: false
  max_memory_usage: 0.85
```

### Qwen3-8B Configuration
**File**: `configs/training_config_qwen3.yaml`

```yaml
model:
  name: "unsloth/Qwen3-8B-bnb-4bit"
  max_seq_length: 2048

training:
  per_device_train_batch_size: 1  # More conservative
  gradient_accumulation_steps: 8  # Higher accumulation
  learning_rate: 1.0e-4  # Lower learning rate
  num_train_epochs: 3

memory:
  use_gradient_checkpointing: true
  offload_to_cpu: false
  max_memory_usage: 0.95  # More aggressive memory usage
```

### Key Differences

| Setting | Qwen2.5-Coder | Qwen3-8B | Reason |
|---------|---------------|----------|---------|
| Batch Size | 2 | 1 | Larger model needs smaller batches |
| Gradient Accumulation | 4 | 8 | Maintains effective batch size |
| Learning Rate | 2e-4 | 1e-4 | Larger model needs more conservative LR |
| Memory Usage | 85% | 95% | Qwen3 can use more VRAM |
| Effective Batch Size | 8 | 8 | Same training dynamics |

## Model Selection Guide

### Choose Qwen2.5-Coder-7B when:
- You want to fine-tune specifically for **code generation** tasks
- Working with **programming languages** and technical content
- Need **code completion** and **code understanding** capabilities
- Prefer **moderate memory usage** (~6-7GB VRAM)

### Choose Qwen3-8B when:
- You need **general instruction following** capabilities
- Working with **mixed content** (code + natural language)
- Want **broader language understanding** and generation
- Have **sufficient VRAM** (~7-8GB) and prefer newer architecture

## License

MIT License