116 lines
3.7 KiB
Markdown
116 lines
3.7 KiB
Markdown
# Google Maps Review Crawler
|
|
|
|
A Python-based automation tool that fetches customer reviews from the Google Business Profile API and synchronizes them to a PostgreSQL database.
|
|
|
|
## Features
|
|
- **OAuth 2.0 Authentication**: Connects directly to your Google Business Profile to fetch *all* reviews (bypassing the standard Places API 5-review limit).
|
|
- **Multi-Outlet Support**: Automatically queries the `master_outlet` table for active `google_business_id`s and iterates through all your locations.
|
|
- **Original Language Extraction**: Strips automated "Translated by Google" annotations to ensure only authentic, original review text is saved.
|
|
- **Rolling Window Filter**: Only processes reviews published within the last 3 months (90 days) to optimize API calls.
|
|
- **UPSERT Logic**: Safely updates existing database records (e.g., if a customer changes their rating or text) without creating duplicates.
|
|
- **Automated Scheduler**: Includes a background daemon script (`schedule_crawler.py`) to run automatically every hour.
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
- Python 3.9+
|
|
- PostgreSQL database
|
|
- A Google Cloud Project with the **Google My Business API (v4)** enabled.
|
|
- OAuth 2.0 Client credentials downloaded as `client_secret.json`.
|
|
|
|
---
|
|
|
|
## 1. Installation & Environment Setup
|
|
|
|
1. **Clone the repository** (or navigate to the folder).
|
|
2. **Create a virtual environment**:
|
|
```bash
|
|
python -m venv venv
|
|
source venv/bin/activate
|
|
```
|
|
3. **Install dependencies**:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
4. **Configure Database Credentials**:
|
|
Create a `.env` file in the root directory:
|
|
```env
|
|
DB_HOST=192.169.0.10
|
|
DB_PORT=5432
|
|
DB_NAME=your_db_name
|
|
DB_USER=your_db_user
|
|
DB_PASSWORD=your_db_password
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Authentication (One-Time Setup)
|
|
|
|
The Google Business Profile API requires explicit permission to read your reviews.
|
|
|
|
1. Ensure your `client_secret.json` from Google Cloud Console is in the project folder.
|
|
2. Run the authorization script:
|
|
```bash
|
|
python authorize.py
|
|
```
|
|
3. A browser window will open. Log in with the Google Account that manages your Business Profiles.
|
|
4. Once authorized, a `token.json` file will be created. The crawler will automatically use and refresh this token moving forward.
|
|
|
|
---
|
|
|
|
## 3. Database Setup & Mapping
|
|
|
|
The crawler maps Google locations to your database using `google_business_id`.
|
|
|
|
1. **Find your Location IDs**:
|
|
Run the helper script to list all active stores managed by your Google Account:
|
|
```bash
|
|
python list_locations.py
|
|
```
|
|
*Note: This will output the Store Code and the long numeric Location ID.*
|
|
|
|
2. **Update your Database**:
|
|
Insert the `Location ID` into the `google_business_id` column of your `master_outlet` table.
|
|
|
|
---
|
|
|
|
## 4. Running the Crawler
|
|
|
|
### Manual Execution
|
|
To run the crawler once immediately:
|
|
```bash
|
|
python crawler.py
|
|
```
|
|
|
|
### Automated (Hourly Scheduler)
|
|
To run the crawler continuously in the background (runs once every hour):
|
|
```bash
|
|
chmod +x run_hourly.sh
|
|
./run_hourly.sh
|
|
```
|
|
|
|
- **Scheduler Logs**: `tail -f scheduler.log` (monitors the hourly heartbeat)
|
|
- **Crawler Logs**: `tail -f crawler.log` (monitors the specific reviews being upserted)
|
|
|
|
To stop the background scheduler:
|
|
```bash
|
|
pkill -f schedule_crawler.py
|
|
```
|
|
|
|
---
|
|
|
|
## Database Schema (`google_review`)
|
|
|
|
Built automatically by `database.py`:
|
|
- `id` (SERIAL PRIMARY KEY)
|
|
- `review_id` (TEXT UNIQUE)
|
|
- `place_id` (TEXT) - *Legacy column, nullable*
|
|
- `original_text` (TEXT) - *The clean, untranslated review text*
|
|
- `author_display_name` (TEXT)
|
|
- `publish_time` (TIMESTAMP)
|
|
- `rating` (INTEGER)
|
|
- `outlet_code` (VARCHAR) - *Foreign Key linked to master_outlet.popcorn_code*
|
|
- `language` (VARCHAR)
|
|
- `created_at` (TIMESTAMP)
|
|
- `updated_at` (TIMESTAMP)
|