2. add DataProcessorSynthetic class to format github repo to QA ChatML format
181 lines
4.8 KiB
Markdown
181 lines
4.8 KiB
Markdown
# Dataset Processing from GitHub Repositories
|
|
|
|
This guide explains how to get and process datasets from GitHub repositories using the provided tools.
|
|
|
|
## Prerequisites
|
|
|
|
Make sure you have installed the required dependencies:
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Using the DatasetProcessor Class
|
|
|
|
The `DatasetProcessor` class in `src/dataset_processor.py` provides comprehensive functionality for processing GitHub repositories into training datasets.
|
|
|
|
### Example Usage
|
|
|
|
```python
|
|
from src.dataset_processor import DatasetProcessor
|
|
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig
|
|
|
|
# Initialize configuration
|
|
config = AppConfig(
|
|
model=ModelConfig(),
|
|
training=TrainingConfig(),
|
|
dataset=DatasetConfig(),
|
|
memory=MemoryConfig()
|
|
)
|
|
|
|
# Initialize dataset processor
|
|
processor = DatasetProcessor()
|
|
|
|
# Process GitHub repositories
|
|
repo_urls = [
|
|
"https://github.com/karpathy/nanoGPT.git",
|
|
# Add more repository URLs as needed
|
|
]
|
|
|
|
dataset = processor.process_github_repos(
|
|
repo_urls=repo_urls,
|
|
config=config,
|
|
github_token=None # Add your token for private repositories
|
|
)
|
|
|
|
print(f"Dataset processed successfully with {len(dataset)} samples")
|
|
```
|
|
|
|
## Using the DatasetProcessorSynthetic Class
|
|
|
|
The `DatasetProcessorSynthetic` class in `src/dataset_processor_synthetic.py` provides functionality for processing GitHub repositories into training datasets in QA ChatML format using a local AI model (Ollama).
|
|
|
|
### Example Usage
|
|
|
|
```python
|
|
from src.dataset_processor_synthetic import DatasetProcessorSynthetic
|
|
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig
|
|
|
|
# Initialize configuration
|
|
config = AppConfig(
|
|
model=ModelConfig(),
|
|
training=TrainingConfig(),
|
|
dataset=DatasetConfig(),
|
|
memory=MemoryConfig()
|
|
)
|
|
|
|
# Initialize dataset processor
|
|
processor = DatasetProcessorSynthetic()
|
|
|
|
# Process GitHub repositories
|
|
repo_urls = [
|
|
"https://github.com/karpathy/nanoGPT.git",
|
|
# Add more repository URLs as needed
|
|
]
|
|
|
|
dataset = processor.process_github_repos(
|
|
repo_urls=repo_urls,
|
|
config=config,
|
|
github_token=None # Add your token for private repositories
|
|
)
|
|
|
|
print(f"Dataset processed successfully with {len(dataset)} samples")
|
|
```
|
|
|
|
## Saving and Loading Datasets
|
|
|
|
Both dataset processors support saving and loading datasets to/from disk to avoid reprocessing:
|
|
|
|
```python
|
|
# Save dataset
|
|
processor.save_dataset(dataset, "./my_processed_dataset")
|
|
|
|
# Load dataset
|
|
loaded_dataset = processor.load_dataset("./my_processed_dataset")
|
|
```
|
|
|
|
The main script also supports saving/loading datasets via command-line arguments:
|
|
|
|
```bash
|
|
# Process and save dataset
|
|
python src/main.py --repo1 https://github.com/repo1 --repo2 https://github.com/repo2 --dataset_path ./my_dataset
|
|
|
|
# Load and train with existing dataset
|
|
python src/main.py --repo1 https://github.com/repo1 --repo2 https://github.com/repo2 --dataset_path ./my_dataset
|
|
```
|
|
|
|
## Using the Example Script
|
|
|
|
You can run the example script directly:
|
|
|
|
```bash
|
|
python example_dataset_processing.py
|
|
```
|
|
|
|
This will process the example repository and show information about the processed dataset.
|
|
|
|
## Using in Google Colab
|
|
|
|
The `ai_trainer_t4_colab.ipynb` notebook includes sections for processing GitHub repositories:
|
|
|
|
1. Simple repository processing (Section 5)
|
|
2. Advanced dataset processing (Section 5.1)
|
|
|
|
## Supported File Types
|
|
|
|
The DatasetProcessor supports the following file types:
|
|
- Python (.py)
|
|
- JavaScript (.js)
|
|
- TypeScript (.ts)
|
|
- Java (.java)
|
|
- C++ (.cpp, .hpp)
|
|
- C (.c, .h)
|
|
- C# (.cs)
|
|
- PHP (.php)
|
|
- Ruby (.rb)
|
|
- Go (.go)
|
|
- Rust (.rs)
|
|
- Swift (.swift)
|
|
- Kotlin (.kt)
|
|
- Scala (.scala)
|
|
- SQL (.sql)
|
|
- Bash (.sh)
|
|
- YAML (.yaml, .yml)
|
|
- JSON (.json)
|
|
- XML (.xml)
|
|
- HTML (.html)
|
|
- CSS (.css)
|
|
- Markdown (.md)
|
|
|
|
## Configuration
|
|
|
|
The dataset processing can be configured through the `DatasetConfig` class:
|
|
|
|
```python
|
|
dataset_config = DatasetConfig(
|
|
min_file_size=10, # Minimum file size in characters
|
|
max_file_size=10000, # Maximum file size in characters
|
|
supported_languages=[...], # List of supported programming languages
|
|
exclude_patterns=[...] # Patterns to exclude
|
|
)
|
|
```
|
|
|
|
## Output Format
|
|
|
|
The processed dataset contains the following fields for each sample:
|
|
|
|
For the standard `DatasetProcessor`:
|
|
- `text`: The content of the code file
|
|
- `language`: The programming language detected
|
|
- `file_path`: Relative path to the file within the repository
|
|
- `repo_name`: Name of the repository
|
|
- `file_size`: Size of the file in characters
|
|
- `line_count`: Number of lines in the file
|
|
|
|
For the `DatasetProcessorSynthetic`:
|
|
- `messages`: List of messages in ChatML format (system, user, assistant)
|
|
- `language`: The programming language detected
|
|
- `file_path`: Relative path to the file within the repository
|
|
- `repo_name`: Name of the repository
|
|
- `file_size`: Size of the file in characters
|
|
- `line_count`: Number of lines in the file |