2.7 KiB
2.7 KiB
Dataset Processing from GitHub Repositories
This guide explains how to get and process datasets from GitHub repositories using the provided tools.
Prerequisites
Make sure you have installed the required dependencies:
pip install -r requirements.txt
Using the DatasetProcessor Class
The DatasetProcessor class in src/dataset_processor.py provides comprehensive functionality for processing GitHub repositories into training datasets.
Example Usage
from src.dataset_processor import DatasetProcessor
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig
# Initialize configuration
config = AppConfig(
model=ModelConfig(),
training=TrainingConfig(),
dataset=DatasetConfig(),
memory=MemoryConfig()
)
# Initialize dataset processor
processor = DatasetProcessor()
# Process GitHub repositories
repo_urls = [
"https://github.com/karpathy/nanoGPT.git",
# Add more repository URLs as needed
]
dataset = processor.process_github_repos(
repo_urls=repo_urls,
config=config,
github_token=None # Add your token for private repositories
)
print(f"Dataset processed successfully with {len(dataset)} samples")
Using the Example Script
You can run the example script directly:
python example_dataset_processing.py
This will process the example repository and show information about the processed dataset.
Using in Google Colab
The ai_trainer_t4_colab.ipynb notebook includes sections for processing GitHub repositories:
- Simple repository processing (Section 5)
- Advanced dataset processing (Section 5.1)
Supported File Types
The DatasetProcessor supports the following file types:
- Python (.py)
- JavaScript (.js)
- TypeScript (.ts)
- Java (.java)
- C++ (.cpp, .hpp)
- C (.c, .h)
- C# (.cs)
- PHP (.php)
- Ruby (.rb)
- Go (.go)
- Rust (.rs)
- Swift (.swift)
- Kotlin (.kt)
- Scala (.scala)
- SQL (.sql)
- Bash (.sh)
- YAML (.yaml, .yml)
- JSON (.json)
- XML (.xml)
- HTML (.html)
- CSS (.css)
- Markdown (.md)
Configuration
The dataset processing can be configured through the DatasetConfig class:
dataset_config = DatasetConfig(
min_file_size=10, # Minimum file size in characters
max_file_size=10000, # Maximum file size in characters
supported_languages=[...], # List of supported programming languages
exclude_patterns=[...] # Patterns to exclude
)
Output Format
The processed dataset contains the following fields for each sample:
text: The content of the code filelanguage: The programming language detectedfile_path: Relative path to the file within the repositoryrepo_name: Name of the repositoryfile_size: Size of the file in charactersline_count: Number of lines in the file