Engagement Summary (Read Carefully)
We are hiring a senior DevOps / Backend Reliability consultant for a temporary, high-priority engagement to stabilize our production system and ensure it does not go down.
This is not feature work and not a full-time role.
Your mandate is simple and non-negotiable:
The server must remain stable under normal and abnormal conditions, with clear visibility into failures and fast recovery if anything degrades.
Why We’re Hiring
We have experienced:
Backend crashes
Login failures
Database connection pool exhaustion
Performance degradation under very light usage
Systems becoming less stable after partial fixes
This indicates architecture, configuration, and operational reliability gaps, not isolated bugs.
We need an expert who can diagnose, stabilize, and harden the system correctly, then advise us on ongoing safeguards.
Primary Objective
By the end of this engagement:
The backend does not crash
Resource exhaustion is prevented, not patched
Failures are observable and explainable
The system can recover gracefully without manual intervention
We have confidence onboarding users will not create instability
Scope of Work
Phase 1 – Root Cause & Diagnosis (Immediate)
Review backend architecture, infra, and deployment setup
Analyze logs, metrics, and recent failure patterns
Identify exact causes of:
DB connection pool exhaustion
Server crashes or lockups
Performance degradation
Validate whether issues stem from:
Application lifecycle management
Database usage patterns
Infrastructure configuration
Concurrency, timeouts, or memory leaks
Phase 2 – Stabilization & Fixes
Implement correct fixes, not workarounds:
Proper DB connection lifecycle handling
Safe connection limits and pooling strategy
Timeouts, retries, and circuit-breaking where appropriate
Server configuration tuned for stability
Ensure system remains stable through:
Restarts
Deployments
Light-to-moderate load
Phase 3 – Reliability & Safeguards
Add or refine:
Monitoring and alerting
Health checks
Error visibility and logging
Define:
What “healthy” looks like
What triggers alerts
How failures should degrade safely
Ensure no single failure can cascade into a full outage
Deliverables
Clear written explanation of:
Root causes
Fixes applied
Remaining risks (if any)
Confirmation that:
DB exhaustion cannot silently occur
Server crashes are prevented or safely handled
Optional: recommendations for long-term reliability best practices
Technical Environment
AWS (EC2 / RDS / related services)
Node.js backend
Relational database (Postgres or MySQL)
Docker / CI-CD pipelines (if applicable)
You do not need to rewrite the system — you need to make it stable and reliable.
Who This Is For
Senior DevOps, SRE, or Backend Infrastructure Engineer
You have:
Fixed real production outages
Solved DB connection pool exhaustion before
Stabilized systems others “patched”
You think in:
Failure modes
Load behavior
Graceful degradation
You can explain why something broke and why it won’t again
Who This Is NOT For
Junior DevOps engineers
Developers who mainly do features
Anyone who “tunes until it works” without root cause analysis
Anyone uncomfortable owning production stability
Engagement Details
Type: Temporary / Contract / Consulting
Initial Time: 5–15 hours
Start: Immediate
Ongoing: Advisory support as needed (optional)
Goal: Production stability and confidence by early next week
How to Apply (Required)
Please include:
A production system you stabilized and what was failing
Your approach to preventing DB connection pool exhaustion
Experience with monitoring and alerting
Availability in the next 48–72 hours
Whether you’re comfortable pairing live via Zoom / screen-share
Final Note
We care far more about systems that don’t break than features that ship fast.
Apply Now
Apply Now