113 lines
2.7 KiB
Markdown
113 lines
2.7 KiB
Markdown
# Dataset Processing from GitHub Repositories
|
|
|
|
This guide explains how to get and process datasets from GitHub repositories using the provided tools.
|
|
|
|
## Prerequisites
|
|
|
|
Make sure you have installed the required dependencies:
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Using the DatasetProcessor Class
|
|
|
|
The `DatasetProcessor` class in `src/dataset_processor.py` provides comprehensive functionality for processing GitHub repositories into training datasets.
|
|
|
|
### Example Usage
|
|
|
|
```python
|
|
from src.dataset_processor import DatasetProcessor
|
|
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig
|
|
|
|
# Initialize configuration
|
|
config = AppConfig(
|
|
model=ModelConfig(),
|
|
training=TrainingConfig(),
|
|
dataset=DatasetConfig(),
|
|
memory=MemoryConfig()
|
|
)
|
|
|
|
# Initialize dataset processor
|
|
processor = DatasetProcessor()
|
|
|
|
# Process GitHub repositories
|
|
repo_urls = [
|
|
"https://github.com/karpathy/nanoGPT.git",
|
|
# Add more repository URLs as needed
|
|
]
|
|
|
|
dataset = processor.process_github_repos(
|
|
repo_urls=repo_urls,
|
|
config=config,
|
|
github_token=None # Add your token for private repositories
|
|
)
|
|
|
|
print(f"Dataset processed successfully with {len(dataset)} samples")
|
|
```
|
|
|
|
## Using the Example Script
|
|
|
|
You can run the example script directly:
|
|
|
|
```bash
|
|
python example_dataset_processing.py
|
|
```
|
|
|
|
This will process the example repository and show information about the processed dataset.
|
|
|
|
## Using in Google Colab
|
|
|
|
The `ai_trainer_t4_colab.ipynb` notebook includes sections for processing GitHub repositories:
|
|
|
|
1. Simple repository processing (Section 5)
|
|
2. Advanced dataset processing (Section 5.1)
|
|
|
|
## Supported File Types
|
|
|
|
The DatasetProcessor supports the following file types:
|
|
- Python (.py)
|
|
- JavaScript (.js)
|
|
- TypeScript (.ts)
|
|
- Java (.java)
|
|
- C++ (.cpp, .hpp)
|
|
- C (.c, .h)
|
|
- C# (.cs)
|
|
- PHP (.php)
|
|
- Ruby (.rb)
|
|
- Go (.go)
|
|
- Rust (.rs)
|
|
- Swift (.swift)
|
|
- Kotlin (.kt)
|
|
- Scala (.scala)
|
|
- SQL (.sql)
|
|
- Bash (.sh)
|
|
- YAML (.yaml, .yml)
|
|
- JSON (.json)
|
|
- XML (.xml)
|
|
- HTML (.html)
|
|
- CSS (.css)
|
|
- Markdown (.md)
|
|
|
|
## Configuration
|
|
|
|
The dataset processing can be configured through the `DatasetConfig` class:
|
|
|
|
```python
|
|
dataset_config = DatasetConfig(
|
|
min_file_size=10, # Minimum file size in characters
|
|
max_file_size=10000, # Maximum file size in characters
|
|
supported_languages=[...], # List of supported programming languages
|
|
exclude_patterns=[...] # Patterns to exclude
|
|
)
|
|
```
|
|
|
|
## Output Format
|
|
|
|
The processed dataset contains the following fields for each sample:
|
|
- `text`: The content of the code file
|
|
- `language`: The programming language detected
|
|
- `file_path`: Relative path to the file within the repository
|
|
- `repo_name`: Name of the repository
|
|
- `file_size`: Size of the file in characters
|
|
- `line_count`: Number of lines in the file |