Description
- Katalyst suddenly becomes unavailable or crashes.
- Users cannot access the web UI or core services; existing sessions are dropped.
- There is concern about stability and risk of recurrence, and a clear root cause is required.
Solution
- Collect technical evidence immediately after the incident:
- Application logs
- Gather Katalyst server logs (and Katalyst logs if applicable) covering at least 30–60 minutes before and after the crash.
- Note the precise time users experienced the outage.
- Infrastructure and OS logs
- Retrieve system logs from the Katalyst servers (e.g., OS event logs, syslog, journald).
- Include logs from load balancers, reverse proxies, and any WAF/security devices in front of Katalyst.
- Database logs
- Collect Spectrus DB / PostgreSQL logs during the same period (for connection errors, out‑of‑memory, deadlocks, etc.).
- Application logs
- Analyze for typical root causes:
- Out‑of‑memory or CPU exhaustion on the Katalyst host.
- DB connection failures or timeouts.
- Proxy/load‑balancer health check failures or misconfiguration.
- External dependency outages (identity provider, materials/optimization services) causing cascading errors.
- Once a probable root cause is identified:
- Implement remediations (e.g., increase resources, tune DB, adjust timeouts, fix proxy routing, correct configuration).
- Schedule a controlled test in non‑prod if needed.
- Document the incident timeline, findings, and corrective actions for future reference.
Comments
0 comments
Article is closed for comments.