# AI Trainer A Python application for training various unsloth models using data from GitHub repositories. Supports both Qwen2.5-Coder and Qwen3 models optimized for RTX3070 8GB VRAM. ## Supported Models ### 1. Qwen2.5-Coder-7B-Instruct (Default) - **Model**: `unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit` - **Best for**: Code generation, code completion, programming tasks - **Memory Usage**: Moderate (~6-7GB VRAM) - **Config**: `configs/training_config.yaml` ### 2. Qwen3-8B - **Model**: `unsloth/Qwen3-8B-bnb-4bit` - **Best for**: General instruction following, broader language tasks - **Memory Usage**: Higher (~7-8GB VRAM) - **Config**: `configs/training_config_qwen3.yaml` ## Features - **Dataset Processing**: Automatically processes code from GitHub repositories - **Memory Optimized**: Designed for RTX3070 8GB VRAM with no CPU offloading - **Configurable Training**: YAML-based configuration system - **Progress Logging**: Comprehensive logging and monitoring - **Modular Design**: Clean separation of concerns with dataset processing, training, and utilities - **Multi-Model Support**: Easy switching between different model architectures ## Requirements - Python 3.8+ - CUDA-compatible GPU (tested with RTX3070 8GB VRAM) - Git - Dependencies listed in `requirements.txt` ## Private Repository Support The application now supports processing private GitHub repositories by using a GitHub token for authentication. To use this feature: 1. Generate a GitHub personal access token with appropriate permissions 2. Pass the token using the `--github_token` command line argument 3. Use private repository URLs in the same format as public repositories Supported URL formats for private repositories: - `https://github.com/user/private-repo.git` - `github.com/user/private-repo` - `user/private-repo` ## Installation 1. Clone this repository 2. if have CUDA GPU install PyTorch: ```bash pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu129 ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` ## Usage ### Training Qwen2.5-Coder-7B (Default) ```bash # Using the main script python src/main.py \ --repo1 https://github.com/user/repo1 \ --repo2 https://github.com/user/repo2 \ --config configs/training_config.yaml \ --output_dir ./models \ --log_level INFO # Or using the runner script python run_training.py \ --repo1 https://github.com/user/repo1 \ --repo2 https://github.com/user/repo2 # Using private repositories with a GitHub token python run_training.py \ --repo1 https://github.com/user/private-repo1 \ --repo2 https://github.com/user/private-repo2 \ --github_token YOUR_GITHUB_TOKEN ``` ### Training Qwen3-8B ```bash # Using the main script with Qwen3 config python src/main.py \ --repo1 https://github.com/user/repo1 \ --repo2 https://github.com/user/repo2 \ --config configs/training_config_qwen3.yaml \ --output_dir ./models \ --log_level INFO # Or using the dedicated Qwen3 runner python run_training_qwen3.py \ --repo1 https://github.com/user/repo1 \ --repo2 https://github.com/user/repo2 # Using private repositories with a GitHub token python run_training_qwen3.py \ --repo1 https://github.com/user/private-repo1 \ --repo2 https://github.com/user/private-repo2 \ --github_token YOUR_GITHUB_TOKEN ``` ### Command Line Arguments - `--repo1`: First GitHub repository URL (required) - `--repo2`: Second GitHub repository URL (required) - `--config`: Path to training configuration file (default: configs/training_config.yaml) - `--output_dir`: Directory to save trained model (default: ./models) - `--log_level`: Logging level (DEBUG, INFO, WARNING, ERROR) - `--github_token`: GitHub token for accessing private repositories (optional) ## Project Structure ``` ai_trainer/ ├── src/ │ ├── __init__.py │ ├── main.py # Main entry point │ ├── trainer.py # Model training logic │ ├── dataset_processor.py # GitHub repository processing │ ├── config.py # Configuration management │ └── utils.py # Utility functions ├── configs/ │ └── training_config.yaml # Training configuration ├── data/ │ └── processed/ # Processed datasets ├── models/ # Trained models ├── logs/ # Training logs ├── requirements.txt └── README.md ``` ## Memory Optimization This application is specifically optimized for RTX3070 8GB VRAM: - Uses 4-bit quantization (bnb-4bit) - Gradient checkpointing enabled - No CPU offloading - Optimized batch sizes for 8GB VRAM - Memory-efficient data loading ## Configuration ### Qwen2.5-Coder-7B Configuration **File**: `configs/training_config.yaml` ```yaml model: name: "unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit" max_seq_length: 2048 training: per_device_train_batch_size: 2 gradient_accumulation_steps: 4 learning_rate: 2.0e-4 num_train_epochs: 3 memory: use_gradient_checkpointing: true offload_to_cpu: false max_memory_usage: 0.85 ``` ### Qwen3-8B Configuration **File**: `configs/training_config_qwen3.yaml` ```yaml model: name: "unsloth/Qwen3-8B-bnb-4bit" max_seq_length: 2048 training: per_device_train_batch_size: 1 # More conservative gradient_accumulation_steps: 8 # Higher accumulation learning_rate: 1.0e-4 # Lower learning rate num_train_epochs: 3 memory: use_gradient_checkpointing: true offload_to_cpu: false max_memory_usage: 0.95 # More aggressive memory usage ``` ### Key Differences | Setting | Qwen2.5-Coder | Qwen3-8B | Reason | |---------|---------------|----------|---------| | Batch Size | 2 | 1 | Larger model needs smaller batches | | Gradient Accumulation | 4 | 8 | Maintains effective batch size | | Learning Rate | 2e-4 | 1e-4 | Larger model needs more conservative LR | | Memory Usage | 85% | 95% | Qwen3 can use more VRAM | | Effective Batch Size | 8 | 8 | Same training dynamics | ## Model Selection Guide ### Choose Qwen2.5-Coder-7B when: - You want to fine-tune specifically for **code generation** tasks - Working with **programming languages** and technical content - Need **code completion** and **code understanding** capabilities - Prefer **moderate memory usage** (~6-7GB VRAM) ### Choose Qwen3-8B when: - You need **general instruction following** capabilities - Working with **mixed content** (code + natural language) - Want **broader language understanding** and generation - Have **sufficient VRAM** (~7-8GB) and prefer newer architecture ## License MIT License