ai_github_trainer/README_DATASET_PROCESSING.md
Suherdy Yacob aaa0f1b51e 1. add 2 data processor type: standard and synthetic
2. add DataProcessorSynthetic class to format github repo to QA ChatML format
2025-08-23 16:44:33 +07:00

181 lines
4.8 KiB
Markdown

# Dataset Processing from GitHub Repositories
This guide explains how to get and process datasets from GitHub repositories using the provided tools.
## Prerequisites
Make sure you have installed the required dependencies:
```bash
pip install -r requirements.txt
```
## Using the DatasetProcessor Class
The `DatasetProcessor` class in `src/dataset_processor.py` provides comprehensive functionality for processing GitHub repositories into training datasets.
### Example Usage
```python
from src.dataset_processor import DatasetProcessor
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig
# Initialize configuration
config = AppConfig(
model=ModelConfig(),
training=TrainingConfig(),
dataset=DatasetConfig(),
memory=MemoryConfig()
)
# Initialize dataset processor
processor = DatasetProcessor()
# Process GitHub repositories
repo_urls = [
"https://github.com/karpathy/nanoGPT.git",
# Add more repository URLs as needed
]
dataset = processor.process_github_repos(
repo_urls=repo_urls,
config=config,
github_token=None # Add your token for private repositories
)
print(f"Dataset processed successfully with {len(dataset)} samples")
```
## Using the DatasetProcessorSynthetic Class
The `DatasetProcessorSynthetic` class in `src/dataset_processor_synthetic.py` provides functionality for processing GitHub repositories into training datasets in QA ChatML format using a local AI model (Ollama).
### Example Usage
```python
from src.dataset_processor_synthetic import DatasetProcessorSynthetic
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig
# Initialize configuration
config = AppConfig(
model=ModelConfig(),
training=TrainingConfig(),
dataset=DatasetConfig(),
memory=MemoryConfig()
)
# Initialize dataset processor
processor = DatasetProcessorSynthetic()
# Process GitHub repositories
repo_urls = [
"https://github.com/karpathy/nanoGPT.git",
# Add more repository URLs as needed
]
dataset = processor.process_github_repos(
repo_urls=repo_urls,
config=config,
github_token=None # Add your token for private repositories
)
print(f"Dataset processed successfully with {len(dataset)} samples")
```
## Saving and Loading Datasets
Both dataset processors support saving and loading datasets to/from disk to avoid reprocessing:
```python
# Save dataset
processor.save_dataset(dataset, "./my_processed_dataset")
# Load dataset
loaded_dataset = processor.load_dataset("./my_processed_dataset")
```
The main script also supports saving/loading datasets via command-line arguments:
```bash
# Process and save dataset
python src/main.py --repo1 https://github.com/repo1 --repo2 https://github.com/repo2 --dataset_path ./my_dataset
# Load and train with existing dataset
python src/main.py --repo1 https://github.com/repo1 --repo2 https://github.com/repo2 --dataset_path ./my_dataset
```
## Using the Example Script
You can run the example script directly:
```bash
python example_dataset_processing.py
```
This will process the example repository and show information about the processed dataset.
## Using in Google Colab
The `ai_trainer_t4_colab.ipynb` notebook includes sections for processing GitHub repositories:
1. Simple repository processing (Section 5)
2. Advanced dataset processing (Section 5.1)
## Supported File Types
The DatasetProcessor supports the following file types:
- Python (.py)
- JavaScript (.js)
- TypeScript (.ts)
- Java (.java)
- C++ (.cpp, .hpp)
- C (.c, .h)
- C# (.cs)
- PHP (.php)
- Ruby (.rb)
- Go (.go)
- Rust (.rs)
- Swift (.swift)
- Kotlin (.kt)
- Scala (.scala)
- SQL (.sql)
- Bash (.sh)
- YAML (.yaml, .yml)
- JSON (.json)
- XML (.xml)
- HTML (.html)
- CSS (.css)
- Markdown (.md)
## Configuration
The dataset processing can be configured through the `DatasetConfig` class:
```python
dataset_config = DatasetConfig(
min_file_size=10, # Minimum file size in characters
max_file_size=10000, # Maximum file size in characters
supported_languages=[...], # List of supported programming languages
exclude_patterns=[...] # Patterns to exclude
)
```
## Output Format
The processed dataset contains the following fields for each sample:
For the standard `DatasetProcessor`:
- `text`: The content of the code file
- `language`: The programming language detected
- `file_path`: Relative path to the file within the repository
- `repo_name`: Name of the repository
- `file_size`: Size of the file in characters
- `line_count`: Number of lines in the file
For the `DatasetProcessorSynthetic`:
- `messages`: List of messages in ChatML format (system, user, assistant)
- `language`: The programming language detected
- `file_path`: Relative path to the file within the repository
- `repo_name`: Name of the repository
- `file_size`: Size of the file in characters
- `line_count`: Number of lines in the file