Chen Team

Google Sites Importer

Google Sites Importer

Minimal importer for the MSD Soft Matter Lab Google Sites page.

Requirements

  • Python 3.12+
  • Dependencies (install via pip install -r requirements.txt):
    • requests - HTTP client
    • beautifulsoup4 - HTML parsing
    • markdownify - HTML to Markdown conversion (optional, falls back to text extraction)

Installation

# Create a virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Usage

# Import all discovered pages (with 2s delay between requests)
python3 scripts/import_google_site.py

# Import specific pages by slug
python3 scripts/import_google_site.py --pages home about contact

# Force re-import even if cached
python3 scripts/import_google_site.py --force

# Custom delay between requests
python3 scripts/import_google_site.py --delay 1.0

# Show help
python3 scripts/import_google_site.py --help

Features

  • Per-page fetch with caching: Downloads are cached in .cache/google_site/ to avoid repeated requests
  • Polite delays: Configurable delay between requests (default: 2 seconds)
  • HTML cleaning: Strips scripts, styles, and unwanted elements
  • Image downloads: Images are downloaded to assets/img/imported/<slug>/ with root-relative links
  • Markdown conversion: Converts HTML to Markdown using markdownify (or plain text fallback)
  • YAML front matter: Each page includes title, permalink, source_url, and last_imported timestamp
  • Idempotent writes: Only updates files if content has changed
  • Page filtering: Import specific pages using --pages flag
  • Force mode: Re-fetch and re-write with --force flag

Output Structure

pages/
  home.md
  about.md
  contact.md
  ...

assets/img/imported/
  home/
    image1.jpg
    image2.png
  about/
    photo.jpg
  ...

Example Output

---
title: "Research"
permalink: /research/
source_url: https://sites.google.com/view/msdsoftmatter/research
last_imported: 2025-11-12T10:30:00Z
---

# Research

Our research focuses on...

Notes

  • Google Sites uses JavaScript for navigation, so automatic page discovery may not find all pages
  • Manually specify pages with --pages if needed
  • The script is macOS-friendly and uses pathlib for cross-platform compatibility
  • Network requests only occur at runtime (no external calls in test environment)
  • Root-relative links ensure images work from any URL path