pbs-projects/PBS/Tech/Projects/production-deploy-march-2026.md

116 lines
4.7 KiB
Markdown

---
project: production-deploy-march-2026
type: session-notes
status: completed
tags:
- production
- docker
- traefik
- ansible
- cloudflare
- gitea
- n8n
created: 2026-03-29
updated: 2026-03-29
path: PBS/Tech/Projects/
---
# Production Deploy - March 29, 2026
## Overview
Deployed infrastructure updates to production Linode via Ansible, including
new container configurations, database schema updates, and Gitea setup.
Used Cloudflare Worker for maintenance mode during deploy.
## Accomplishments
### Cloudflare Worker Maintenance Page
- Created `pbs-maintain` Worker with PBS-branded maintenance page (Sunnie
themed)
- Added IP bypass so Travis can access the site while maintenance page is
active for public
- Worker URL: https://pbs-maintain.tjherbranson.workers.dev
- Toggle on/off by adding/removing Workers Route for `
plantbasedsoutherner.com/*`
- Worker uses `CF-Connecting-IP` header to check allowed IPs
### n8n Document Pipeline Fix
- Identified race condition: two emails processed simultaneously caused
Gitea ref lock errors
- Root cause: Gitea Contents API creates a commit on every file
create/update — two simultaneous API calls create competing commits on
`main`
- Fix: Added Loop node before Gitea create/update nodes to serialize file
processing
- Key distinction: Loop node serializes items through the same path; Split
In Batches chunks data — these are different nodes with different behaviors
### Production Deploy Process
- Cloned live Linode as backup before changes
- Ran Ansible playbook against the clone
- Deployed MySQL schema updates via phpMyAdmin (copy/paste with `IF NOT
EXISTS`)
- Updated DNS in Cloudflare to point to new clone server
- Clone became the new production server; original kept as rollback backup
### Container Fixes During Deploy
- **pbs-api healthcheck**: Replaced curl-based healthcheck with Python
(`urllib.request`) since curl not installed in container
- **Missing README.md**: pbs-api build failed because `pyproject.toml`
referenced a README.md that didn't exist — created empty file
- **MySQL memory limits**: Added deploy block with 768M limit and tuning
flags to compose
- **WordPress memory limit**: Added 2000M deploy limit
- **Portainer stale container reference**: Restarted Portainer to clear
cached container IDs from pre-deploy
### Gitea Production Setup
- Added DNS record for `gitea.plantbasedsoutherner.com` in Cloudflare
- Waited for DNS propagation before Traefik could issue Let's Encrypt cert
- Removed healthcheck from Gitea container (healthcheck returns 404 before
setup wizard completes, Traefik won't route to unhealthy containers)
- Completed setup wizard and created admin user
- Set `LANDING_PAGE = login` in app.ini (future task)
### WordPress Staging Redirect Issue
- After Ansible deploy, site redirected to staging domain
- Root cause: One Traefik router label had staging URL, WordPress picked it
up and wrote it to `wp_options` table
- Fix: Updated `siteurl` and `home` in `wp_options` via phpMyAdmin, flushed
Redis, purged Cloudflare cache
- Lesson: WordPress can auto-update `wp_options` URLs based on incoming
request hostname
## Key Learnings
- **Cloudflare Worker IP bypass**: `return fetch(request)` re-fetches
through Cloudflare's network, so WordPress sees a Cloudflare edge IP
instead of your real IP — can trigger Wordfence lockouts
- **Cloudflare caches 301 redirects**: Always purge Cloudflare cache after
fixing redirect issues
- **Gitea API creates implicit commits**: Every file create/update via the
Contents API is a git commit — serialize multiple file operations to avoid
ref lock errors
- **Ansible staging drift**: Fixes applied directly on staging don't make
it back to Ansible automatically — fix on staging to unblock, but
immediately update Ansible too
- **DNS propagation for Let's Encrypt**: Traefik can't issue certs until
DNS propagates — recreate the container (`docker compose up -d
--force-recreate`) to trigger a retry without restarting all of Traefik
- **Wildcard DNS records**: `*.staging` in Cloudflare catches all
subdomains under staging automatically — convenient for staging, avoid on
production
- **`IF NOT EXISTS` MySQL warning**: The "less efficient" warning about
index generation during table creation is negligible for small schemas —
keep the safety of `IF NOT EXISTS`
## Still To Do
- [ ] n8n and database configuration on production
- [ ] Set Gitea landing page to login in `app.ini`
- [ ] Configure Gitea email settings (deferred)
- [ ] Add Gitea healthcheck back with wget-based check after setup is stable
- [ ] Delete old Linode backup server after stability verification (~1 week)
- [ ] Continue work on per-container Ansible playbook
- [ ] Update Ansible with any fixes applied directly to production during
this session
...sent from Jenny & Travis