Senior SRE Engineer Azure Healthcare Observability Healthcare SaaS
2025/11/10
At AppRecode, we are passionate about building software that solves problems. We count on our DevOps Engineers to empower our users with a rich feature set, high availability, and stellar performance to pursue their missions.
As we expand our customer deployments, we seek an experienced Senior SRE / DevOps Engineer to ensure reliability, observability, and operational excellence. Specifically, we are searching for someone who will demonstrate a unique and informed viewpoint, enjoys collaborating with a cross-functional team, and help develop real-world solutions and positive user experiences at every interaction, with a strong track record of bringing products to life.
Join us, and you will have an opportunity to work with great Engineers, CEOs, CTOs, and other mature operators, a dynamic but still laid-back team (yes, you can combine that), agile development practices, the stack, and the approaches you choose to get the job done.
About the Position
The client is a US-based SaaS company in the healthcare and financial services domain that has scaled from 15 to 100 employees. Their previous SRE recently departed, creating an urgent need to rebuild the SRE practice, ensure production reliability, and maintain strict HIPAA and PCI compliance.
Duration: Long-term engagement (2+ years)
Start: ASAP, latest January 2026
Company
A US-based healthcare and financial services SaaS company with 50+ developers across 6 development teams. They recently experienced Azure outage impacts highlighting the need for robust SRE practices, multi-cloud resilience, and operational excellence.
Key Responsibilities
- Rebuild SRE practice after previous SRE departure – ensure reliability and uptime of production healthcare systems, define and monitor SLIs/SLOs/SLAs
- Manage and enhance observability infrastructure: Grafana dashboards and alerting, Loki log aggregation, Azure Application Insights (or evaluate Sentry alternative)
- Implement Grafana AI capabilities for automated dashboard creation and enable developers to independently access logs, metrics, and traces
- Incident response, on-call rotation, root cause analysis, post-incident reviews, and error budget management
- Manage Azure PaaS infrastructure: App Services, Azure SQL, API Management, App Gateway, Front Door, Traffic Manager, virtual machines
- HIPAA and PCI compliance implementation and maintenance – security monitoring, key rotation automation (Azure Key Vault), audit logging, compliance reporting
- Design and implement multi-cloud resilience strategy after recent Azure outage (evaluate Google Cloud for redundancy, disaster recovery planning and testing)
- Support migration from Azure App Services to Azure Kubernetes Service (AKS) – design reliability patterns for containerized .NET applications
- Service mesh evaluation and implementation (Istio, Linkerd) for compliance encryption requirements
- Automate operational tasks, create runbooks, facilitate retrospectives, establish operational metrics and FinOps considerations
Reporting & Collaboration
Reports to: Client’s DevOps & Infrastructure Manager
Collaborates with: 4 DevOps team members, 6 development teams (~50 developers), CTO
Technologies
Must-have: Site Reliability Engineering (production SRE with proven reliability track record), Microsoft Azure (App Services, Azure SQL, API Management, App Gateway, Front Door, Traffic Manager), Grafana (dashboard creation, alerting, visualization), Loki (log aggregation and analysis), Azure Application Insights, Incident Management (on-call, root cause analysis, post-mortems), HIPAA Compliance, PCI Compliance, Azure Key Vault
Nice-to-have: Azure Kubernetes Service (AKS), Service Mesh (Istio, Linkerd), Sentry, Terraform, Multi-Cloud (GCP, AWS), Docker, Prometheus, Chaos Engineering, .NET troubleshooting, PowerShell/Bash/Python, FinOps, Azure Certifications
Soft Skills
- Upper Intermediate+ in English
- Calm under pressure – handle production incidents with clear thinking
- Excellent communicator explaining complex issues during incidents
- Proactive mindset preventing incidents through monitoring and analysis
- Collaborative team player working with developers on reliability
- Highly self-managed working independently while coordinating with teams
- Resilient in on-call rotation and high-pressure production situations
Challenges & Milestones
Month 1-3: Rebuild SRE practice, assess observability stack, establish on-call rotation, review Azure outage mitigation, document reliability posture
Month 3-6: Optimize Grafana dashboards, evaluate App Insights vs Sentry, implement Grafana AI, define SLIs/SLOs, design multi-cloud resilience, disaster recovery testing
Month 6-12: Design AKS reliability architecture, implement Kubernetes observability, service mesh implementation, support AKS migration
Long-Term: Mature SRE practice, multi-cloud resilience operational, AKS migration complete, service mesh operational, incident rates significantly reduced
Working Hours
Central Time (US) preferred but flexible
Full-time (40 hours/week), Remote