Google Sites Importer
Minimal importer for the MSD Soft Matter Lab Google Sites page.
Requirements
- Python 3.12+
- Dependencies (install via
pip install -r requirements.txt):requests- HTTP clientbeautifulsoup4- HTML parsingmarkdownify- HTML to Markdown conversion (optional, falls back to text extraction)
Installation
# Create a virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Usage
# Import all discovered pages (with 2s delay between requests)
python3 scripts/import_google_site.py
# Import specific pages by slug
python3 scripts/import_google_site.py --pages home about contact
# Force re-import even if cached
python3 scripts/import_google_site.py --force
# Custom delay between requests
python3 scripts/import_google_site.py --delay 1.0
# Show help
python3 scripts/import_google_site.py --help
Features
- Per-page fetch with caching: Downloads are cached in
.cache/google_site/to avoid repeated requests - Polite delays: Configurable delay between requests (default: 2 seconds)
- HTML cleaning: Strips scripts, styles, and unwanted elements
- Image downloads: Images are downloaded to
assets/img/imported/<slug>/with root-relative links - Markdown conversion: Converts HTML to Markdown using markdownify (or plain text fallback)
- YAML front matter: Each page includes title, permalink, source_url, and last_imported timestamp
- Idempotent writes: Only updates files if content has changed
- Page filtering: Import specific pages using
--pagesflag - Force mode: Re-fetch and re-write with
--forceflag
Output Structure
pages/
home.md
about.md
contact.md
...
assets/img/imported/
home/
image1.jpg
image2.png
about/
photo.jpg
...
Example Output
---
title: "Research"
permalink: /research/
source_url: https://sites.google.com/view/msdsoftmatter/research
last_imported: 2025-11-12T10:30:00Z
---
# Research
Our research focuses on...
Notes
- Google Sites uses JavaScript for navigation, so automatic page discovery may not find all pages
- Manually specify pages with
--pagesif needed - The script is macOS-friendly and uses
pathlibfor cross-platform compatibility - Network requests only occur at runtime (no external calls in test environment)
- Root-relative links ensure images work from any URL path