Senior SRE Engineer Azure Healthcare Observability Healthcare SaaS

2025/11/10

At AppRecode, we are passionate about building software that solves problems. We count on our DevOps Engineers to empower our users with a rich feature set, high availability, and stellar performance to pursue their missions.

As we expand our customer deployments, we seek an experienced Senior SRE / DevOps Engineer to ensure reliability, observability, and operational excellence. Specifically, we are searching for someone who will demonstrate a unique and informed viewpoint, enjoys collaborating with a cross-functional team, and help develop real-world solutions and positive user experiences at every interaction, with a strong track record of bringing products to life.

Join us, and you will have an opportunity to work with great Engineers, CEOs, CTOs, and other mature operators, a dynamic but still laid-back team (yes, you can combine that), agile development practices, the stack, and the approaches you choose to get the job done.

About the Position

The client is a US-based SaaS company in the healthcare and financial services domain that has scaled from 15 to 100 employees. Their previous SRE recently departed, creating an urgent need to rebuild the SRE practice, ensure production reliability, and maintain strict HIPAA and PCI compliance.

Duration: Long-term engagement (2+ years)

Start: ASAP, latest January 2026

Company

A US-based healthcare and financial services SaaS company with 50+ developers across 6 development teams. They recently experienced Azure outage impacts highlighting the need for robust SRE practices, multi-cloud resilience, and operational excellence.

Key Responsibilities

Rebuild SRE practice after previous SRE departure – ensure reliability and uptime of production healthcare systems, define and monitor SLIs/SLOs/SLAs
Manage and enhance observability infrastructure: Grafana dashboards and alerting, Loki log aggregation, Azure Application Insights (or evaluate Sentry alternative)
Implement Grafana AI capabilities for automated dashboard creation and enable developers to independently access logs, metrics, and traces
Incident response, on-call rotation, root cause analysis, post-incident reviews, and error budget management
Manage Azure PaaS infrastructure: App Services, Azure SQL, API Management, App Gateway, Front Door, Traffic Manager, virtual machines
HIPAA and PCI compliance implementation and maintenance – security monitoring, key rotation automation (Azure Key Vault), audit logging, compliance reporting
Design and implement multi-cloud resilience strategy after recent Azure outage (evaluate Google Cloud for redundancy, disaster recovery planning and testing)
Support migration from Azure App Services to Azure Kubernetes Service (AKS) – design reliability patterns for containerized .NET applications
Service mesh evaluation and implementation (Istio, Linkerd) for compliance encryption requirements
Automate operational tasks, create runbooks, facilitate retrospectives, establish operational metrics and FinOps considerations

Reporting & Collaboration

Reports to: Client’s DevOps & Infrastructure Manager

Collaborates with: 4 DevOps team members, 6 development teams (~50 developers), CTO

Technologies

Must-have: Site Reliability Engineering (production SRE with proven reliability track record), Microsoft Azure (App Services, Azure SQL, API Management, App Gateway, Front Door, Traffic Manager), Grafana (dashboard creation, alerting, visualization), Loki (log aggregation and analysis), Azure Application Insights, Incident Management (on-call, root cause analysis, post-mortems), HIPAA Compliance, PCI Compliance, Azure Key Vault

Nice-to-have: Azure Kubernetes Service (AKS), Service Mesh (Istio, Linkerd), Sentry, Terraform, Multi-Cloud (GCP, AWS), Docker, Prometheus, Chaos Engineering, .NET troubleshooting, PowerShell/Bash/Python, FinOps, Azure Certifications

Soft Skills

Upper Intermediate+ in English
Calm under pressure – handle production incidents with clear thinking
Excellent communicator explaining complex issues during incidents
Proactive mindset preventing incidents through monitoring and analysis
Collaborative team player working with developers on reliability
Highly self-managed working independently while coordinating with teams
Resilient in on-call rotation and high-pressure production situations

Challenges & Milestones

Month 1-3: Rebuild SRE practice, assess observability stack, establish on-call rotation, review Azure outage mitigation, document reliability posture

Month 3-6: Optimize Grafana dashboards, evaluate App Insights vs Sentry, implement Grafana AI, define SLIs/SLOs, design multi-cloud resilience, disaster recovery testing

Month 6-12: Design AKS reliability architecture, implement Kubernetes observability, service mesh implementation, support AKS migration

Long-Term: Mature SRE practice, multi-cloud resilience operational, AKS migration complete, service mesh operational, incident rates significantly reduced

Working Hours

Central Time (US) preferred but flexible

Full-time (40 hours/week), Remote