Katalyst: Determining root cause of a Katalyst crash

Description

Katalyst suddenly becomes unavailable or crashes.
Users cannot access the web UI or core services; existing sessions are dropped.
There is concern about stability and risk of recurrence, and a clear root cause is required.

Solution

Collect technical evidence immediately after the incident:
1. Application logs
  - Gather Katalyst server logs (and Katalyst logs if applicable) covering at least 30–60 minutes before and after the crash.
  - Note the precise time users experienced the outage.
2. Infrastructure and OS logs
  - Retrieve system logs from the Katalyst servers (e.g., OS event logs, syslog, journald).
  - Include logs from load balancers, reverse proxies, and any WAF/security devices in front of Katalyst.
3. Database logs
  - Collect Spectrus DB / PostgreSQL logs during the same period (for connection errors, out‑of‑memory, deadlocks, etc.).
Analyze for typical root causes:
- Out‑of‑memory or CPU exhaustion on the Katalyst host.
- DB connection failures or timeouts.
- Proxy/load‑balancer health check failures or misconfiguration.
- External dependency outages (identity provider, materials/optimization services) causing cascading errors.
Once a probable root cause is identified:
- Implement remediations (e.g., increase resources, tune DB, adjust timeouts, fix proxy routing, correct configuration).
- Schedule a controlled test in non‑prod if needed.
- Document the incident timeline, findings, and corrective actions for future reference.

Comments