183 lines
4.8 KiB
Markdown
183 lines
4.8 KiB
Markdown
# Odoo AI Model Trainer
|
|
|
|
A comprehensive Python project for training AI models on Odoo documentation using Unsloth, optimized for RTX3070 8GB VRAM. The project scrapes both English and Indonesian Odoo documentation and fine-tunes the unsloth/Qwen3-8B-bnb-4bit model.
|
|
|
|
## Features
|
|
|
|
- 🌐 **Bilingual Support**: Scrapes both English and Indonesian Odoo documentation
|
|
- 🚀 **Optimized Training**: Uses Unsloth for 2x faster training and 70% less memory
|
|
- 🎯 **RTX3070 Optimized**: Configured for 8GB VRAM with memory-efficient settings
|
|
- 📊 **Data Pipeline**: Complete pipeline from data collection to model training
|
|
- 🔧 **Modular Design**: Separate scripts for scraping, preprocessing, and training
|
|
- 📈 **Progress Tracking**: Built-in statistics and progress monitoring
|
|
|
|
## Requirements
|
|
|
|
### Hardware
|
|
- NVIDIA RTX3070 (8GB VRAM) or better
|
|
- 16GB+ RAM recommended
|
|
- 50GB+ free disk space
|
|
|
|
### Software
|
|
- Python 3.8+
|
|
- CUDA 11.8+
|
|
- PyTorch with CUDA support
|
|
|
|
## Installation
|
|
|
|
1. **Clone or download this project**
|
|
|
|
2. **Install dependencies**:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
3. **Verify CUDA installation**:
|
|
```bash
|
|
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Full Pipeline (Recommended)
|
|
Run the complete training pipeline:
|
|
```bash
|
|
python main.py
|
|
```
|
|
|
|
### Step-by-Step Execution
|
|
|
|
1. **Data Collection Only**:
|
|
```bash
|
|
python main.py --only-collection
|
|
```
|
|
|
|
2. **Data Preprocessing Only**:
|
|
```bash
|
|
python main.py --only-preprocessing
|
|
```
|
|
|
|
3. **Model Training Only**:
|
|
```bash
|
|
python main.py --only-training
|
|
```
|
|
|
|
### Skip Specific Steps
|
|
```bash
|
|
# Skip data collection if you already have data
|
|
python main.py --skip-collection
|
|
|
|
# Skip preprocessing if you already have training data
|
|
python main.py --skip-preprocessing
|
|
|
|
# Skip training for testing other components
|
|
python main.py --skip-training
|
|
```
|
|
|
|
### Individual Scripts
|
|
You can also run individual scripts directly:
|
|
|
|
```bash
|
|
# Scrape Odoo documentation
|
|
python data_scraper.py
|
|
|
|
# Preprocess the scraped data
|
|
python data_preprocessor.py
|
|
|
|
# Train the model
|
|
python train_model.py
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
.
|
|
├── main.py # Main orchestrator script
|
|
├── data_scraper.py # Web scraping for Odoo docs
|
|
├── data_preprocessor.py # Data cleaning and formatting
|
|
├── train_model.py # Model training with Unsloth
|
|
├── requirements.txt # Python dependencies
|
|
├── README.md # This file
|
|
├── odoo_docs_data.csv # Scraped raw data (generated)
|
|
├── training_data.json # Processed training data (generated)
|
|
└── odoo_model_output/ # Trained model (generated)
|
|
```
|
|
|
|
## Output Files
|
|
|
|
- **odoo_docs_data.csv**: Raw scraped documentation
|
|
- **training_data.json**: Processed training data in instruction format
|
|
- **odoo_model_output/**: Directory containing the fine-tuned model
|
|
- **odoo_model_output_gguf/**: GGUF quantized model for deployment
|
|
|
|
## Configuration
|
|
|
|
### Memory Optimization for RTX3070
|
|
The training is configured with:
|
|
- Batch size: 1 (per device)
|
|
- Gradient accumulation: 4 (effective batch size: 4)
|
|
- Max sequence length: 2048 tokens
|
|
- 4-bit quantization to save VRAM
|
|
- Gradient checkpointing enabled
|
|
|
|
### Training Parameters
|
|
- Learning rate: 2e-4
|
|
- Max steps: 100 (increase for production)
|
|
- Warmup steps: 5
|
|
- LoRA rank: 16
|
|
- LoRA alpha: 16
|
|
|
|
## Troubleshooting
|
|
|
|
### CUDA Out of Memory
|
|
If you encounter CUDA OOM errors:
|
|
1. Reduce batch size in `train_model.py`
|
|
2. Increase gradient accumulation steps
|
|
3. Reduce max sequence length
|
|
4. Restart your Python session
|
|
|
|
### Data Collection Issues
|
|
- Check internet connection
|
|
- Odoo website may block rapid requests - the script includes delays
|
|
- If Indonesian docs fail, they may be at a different URL
|
|
|
|
### Training Issues
|
|
- Ensure CUDA is properly installed
|
|
- Check that your GPU drivers are up to date
|
|
- Verify PyTorch CUDA compatibility
|
|
|
|
## Model Usage
|
|
|
|
After training, you can use the model for Odoo-related questions:
|
|
|
|
```python
|
|
from train_model import OdooModelTrainer
|
|
|
|
trainer = OdooModelTrainer()
|
|
trainer.load_model()
|
|
|
|
# Load your trained model
|
|
# trainer.model = ... (load from odoo_model_output)
|
|
|
|
response = trainer.generate_response("How do I install Odoo?")
|
|
print(response)
|
|
```
|
|
|
|
## Performance Notes
|
|
|
|
- **Training Time**: ~30-60 minutes for 100 steps on RTX3070
|
|
- **Memory Usage**: ~6-7GB VRAM during training
|
|
- **Data Size**: ~20-50MB of documentation data
|
|
- **Model Size**: ~4-5GB for the fine-tuned model
|
|
|
|
## Contributing
|
|
|
|
Feel free to submit issues and enhancement requests!
|
|
|
|
## License
|
|
|
|
This project is open source. Please check individual component licenses for details.
|
|
|
|
## Disclaimer
|
|
|
|
This project is for educational and research purposes. Ensure compliance with Odoo's terms of service when scraping documentation. |