Go to file

Suherdy SYC. Yacob 55cfa53e9b first commit		2025-08-22 16:30:56 +07:00
data_preprocessor.py	first commit	2025-08-22 16:30:56 +07:00
data_scraper.py	first commit	2025-08-22 16:30:56 +07:00
main.py	first commit	2025-08-22 16:30:56 +07:00
README.md	first commit	2025-08-22 16:30:56 +07:00
requirements.txt	first commit	2025-08-22 16:30:56 +07:00
test_setup.py	first commit	2025-08-22 16:30:56 +07:00
train_model.py	first commit	2025-08-22 16:30:56 +07:00

README.md

Odoo AI Model Trainer

A comprehensive Python project for training AI models on Odoo documentation using Unsloth, optimized for RTX3070 8GB VRAM. The project scrapes both English and Indonesian Odoo documentation and fine-tunes the unsloth/Qwen3-8B-bnb-4bit model.

Features

🌐 Bilingual Support: Scrapes both English and Indonesian Odoo documentation
🚀 Optimized Training: Uses Unsloth for 2x faster training and 70% less memory
🎯 RTX3070 Optimized: Configured for 8GB VRAM with memory-efficient settings
📊 Data Pipeline: Complete pipeline from data collection to model training
🔧 Modular Design: Separate scripts for scraping, preprocessing, and training
📈 Progress Tracking: Built-in statistics and progress monitoring

Requirements

Hardware

NVIDIA RTX3070 (8GB VRAM) or better
16GB+ RAM recommended
50GB+ free disk space

Software

Python 3.8+
CUDA 11.8+
PyTorch with CUDA support

Installation

Clone or download this project
Install dependencies:
```
pip install -r requirements.txt
```

Verify CUDA installation:

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

Usage

Full Pipeline (Recommended)

Run the complete training pipeline:

python main.py

Step-by-Step Execution

Data Collection Only:
```
python main.py --only-collection
```
Data Preprocessing Only:
```
python main.py --only-preprocessing
```
Model Training Only:
```
python main.py --only-training
```

Skip Specific Steps

# Skip data collection if you already have data
python main.py --skip-collection

# Skip preprocessing if you already have training data
python main.py --skip-preprocessing

# Skip training for testing other components
python main.py --skip-training

Individual Scripts

You can also run individual scripts directly:

# Scrape Odoo documentation
python data_scraper.py

# Preprocess the scraped data
python data_preprocessor.py

# Train the model
python train_model.py

Project Structure

.
├── main.py                    # Main orchestrator script
├── data_scraper.py           # Web scraping for Odoo docs
├── data_preprocessor.py      # Data cleaning and formatting
├── train_model.py           # Model training with Unsloth
├── requirements.txt          # Python dependencies
├── README.md                # This file
├── odoo_docs_data.csv       # Scraped raw data (generated)
├── training_data.json       # Processed training data (generated)
└── odoo_model_output/       # Trained model (generated)

Output Files

odoo_docs_data.csv: Raw scraped documentation
training_data.json: Processed training data in instruction format
odoo_model_output/: Directory containing the fine-tuned model
odoo_model_output_gguf/: GGUF quantized model for deployment

Configuration

Memory Optimization for RTX3070

The training is configured with:

Batch size: 1 (per device)
Gradient accumulation: 4 (effective batch size: 4)
Max sequence length: 2048 tokens
4-bit quantization to save VRAM
Gradient checkpointing enabled

Training Parameters

Learning rate: 2e-4
Max steps: 100 (increase for production)
Warmup steps: 5
LoRA rank: 16
LoRA alpha: 16

Troubleshooting

CUDA Out of Memory

If you encounter CUDA OOM errors:

Reduce batch size in train_model.py
Increase gradient accumulation steps
Reduce max sequence length
Restart your Python session

Data Collection Issues

Check internet connection
Odoo website may block rapid requests - the script includes delays
If Indonesian docs fail, they may be at a different URL

Training Issues

Ensure CUDA is properly installed
Check that your GPU drivers are up to date
Verify PyTorch CUDA compatibility

Model Usage

After training, you can use the model for Odoo-related questions:

from train_model import OdooModelTrainer

trainer = OdooModelTrainer()
trainer.load_model()

# Load your trained model
# trainer.model = ... (load from odoo_model_output)

response = trainer.generate_response("How do I install Odoo?")
print(response)

Performance Notes

Training Time: ~30-60 minutes for 100 steps on RTX3070
Memory Usage: ~6-7GB VRAM during training
Data Size: ~20-50MB of documentation data
Model Size: ~4-5GB for the fine-tuned model

Contributing

Feel free to submit issues and enhancement requests!

License

This project is open source. Please check individual component licenses for details.

Disclaimer

This project is for educational and research purposes. Ensure compliance with Odoo's terms of service when scraping documentation.