2. add DataProcessorSynthetic class to format github repo to QA ChatML format
4.8 KiB
Dataset Processing from GitHub Repositories
This guide explains how to get and process datasets from GitHub repositories using the provided tools.
Prerequisites
Make sure you have installed the required dependencies:
pip install -r requirements.txt
Using the DatasetProcessor Class
The DatasetProcessor class in src/dataset_processor.py provides comprehensive functionality for processing GitHub repositories into training datasets.
Example Usage
from src.dataset_processor import DatasetProcessor
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig
# Initialize configuration
config = AppConfig(
model=ModelConfig(),
training=TrainingConfig(),
dataset=DatasetConfig(),
memory=MemoryConfig()
)
# Initialize dataset processor
processor = DatasetProcessor()
# Process GitHub repositories
repo_urls = [
"https://github.com/karpathy/nanoGPT.git",
# Add more repository URLs as needed
]
dataset = processor.process_github_repos(
repo_urls=repo_urls,
config=config,
github_token=None # Add your token for private repositories
)
print(f"Dataset processed successfully with {len(dataset)} samples")
Using the DatasetProcessorSynthetic Class
The DatasetProcessorSynthetic class in src/dataset_processor_synthetic.py provides functionality for processing GitHub repositories into training datasets in QA ChatML format using a local AI model (Ollama).
Example Usage
from src.dataset_processor_synthetic import DatasetProcessorSynthetic
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig
# Initialize configuration
config = AppConfig(
model=ModelConfig(),
training=TrainingConfig(),
dataset=DatasetConfig(),
memory=MemoryConfig()
)
# Initialize dataset processor
processor = DatasetProcessorSynthetic()
# Process GitHub repositories
repo_urls = [
"https://github.com/karpathy/nanoGPT.git",
# Add more repository URLs as needed
]
dataset = processor.process_github_repos(
repo_urls=repo_urls,
config=config,
github_token=None # Add your token for private repositories
)
print(f"Dataset processed successfully with {len(dataset)} samples")
Saving and Loading Datasets
Both dataset processors support saving and loading datasets to/from disk to avoid reprocessing:
# Save dataset
processor.save_dataset(dataset, "./my_processed_dataset")
# Load dataset
loaded_dataset = processor.load_dataset("./my_processed_dataset")
The main script also supports saving/loading datasets via command-line arguments:
# Process and save dataset
python src/main.py --repo1 https://github.com/repo1 --repo2 https://github.com/repo2 --dataset_path ./my_dataset
# Load and train with existing dataset
python src/main.py --repo1 https://github.com/repo1 --repo2 https://github.com/repo2 --dataset_path ./my_dataset
Using the Example Script
You can run the example script directly:
python example_dataset_processing.py
This will process the example repository and show information about the processed dataset.
Using in Google Colab
The ai_trainer_t4_colab.ipynb notebook includes sections for processing GitHub repositories:
- Simple repository processing (Section 5)
- Advanced dataset processing (Section 5.1)
Supported File Types
The DatasetProcessor supports the following file types:
- Python (.py)
- JavaScript (.js)
- TypeScript (.ts)
- Java (.java)
- C++ (.cpp, .hpp)
- C (.c, .h)
- C# (.cs)
- PHP (.php)
- Ruby (.rb)
- Go (.go)
- Rust (.rs)
- Swift (.swift)
- Kotlin (.kt)
- Scala (.scala)
- SQL (.sql)
- Bash (.sh)
- YAML (.yaml, .yml)
- JSON (.json)
- XML (.xml)
- HTML (.html)
- CSS (.css)
- Markdown (.md)
Configuration
The dataset processing can be configured through the DatasetConfig class:
dataset_config = DatasetConfig(
min_file_size=10, # Minimum file size in characters
max_file_size=10000, # Maximum file size in characters
supported_languages=[...], # List of supported programming languages
exclude_patterns=[...] # Patterns to exclude
)
Output Format
The processed dataset contains the following fields for each sample:
For the standard DatasetProcessor:
text: The content of the code filelanguage: The programming language detectedfile_path: Relative path to the file within the repositoryrepo_name: Name of the repositoryfile_size: Size of the file in charactersline_count: Number of lines in the file
For the DatasetProcessorSynthetic:
messages: List of messages in ChatML format (system, user, assistant)language: The programming language detectedfile_path: Relative path to the file within the repositoryrepo_name: Name of the repositoryfile_size: Size of the file in charactersline_count: Number of lines in the file