Nowa
Senior Site Reliability Engineer
Brak informacji o wynagrodzeniu
SeniorFull-time·Umowa o pracę·B2B
#337012·Dodano dziś·0
Źródło: theprotocol.itTech Stack / Keywords
TerraformBicepPrometheusGrafanaPythonBashGitHubArgoCDGitOpsKubernetesIstioLinkerdWindows
Firma i stanowisko
Webellian is a well-established Digital Transformation and IT consulting company committed to creating a positive impact for our clients. We strive to make a meaningful difference in diverse sectors such as insurance, banking, healthcare, retail, and manufacturing. Our passion for cutting-edge and disruptive technologies, as well as our shared values and strong principles, are what motivate us. We are a community of engineers and senior advisors who work with our clients across industries, playing a deep and meaningful role in accelerating and realizing their vision and strategy.
Wymagania
- 5+ years professional experience in site reliability engineering, DevOps, or platform engineering roles.
- Strong Kubernetes experience: cluster operations, networking (Ingress, network policies), storage, autoscaling, and hands-on troubleshooting across production environments.
- Solid Infrastructure as Code experience with Terraform; familiarity with Bicep or ARM templates is a plus.
- Production experience with Azure cloud services: AKS, ACR, Key Vault, Azure Monitor, Application Insights, Virtual Networks, and Private Endpoints.
- Strong observability experience: Prometheus, Grafana, centralized logging, alerting configuration, and distributed tracing instrumentation.
- Working knowledge of SLO/SLI methodology: error budget principles, reliability target setting, and capacity planning.
- Structured incident management experience: on-call ownership, blameless post-incident review, and runbook authorship.
- Scripting and automation proficiency in Python or bash for toil elimination and operational tooling.
- Strong CI/CD experience: GitHub Actions and ArgoCD or equivalent GitOps tooling.
- Comfortable in agile, iterative delivery environments with personal ownership and accountability for platform reliability.
- Clear communicator across global, cross-functional stakeholders; able to translate technical reliability metrics into business impact for non-technical audiences.
- Proactive learner with pragmatic adoption of AI-assisted developer tools (e.g., GitHub Copilot, Claude Code) to improve automation coverage and delivery velocity.
Nice to have:
- Kubernetes certifications: CKA or CKAD.
- Experience supporting AI or ML infrastructure workloads: GPU scheduling, model serving platforms, or inference pipeline operations.
- Exposure to chaos engineering practices and fault injection testing.
- FinOps experience: reserved capacity planning, resource right-sizing programs, and cost attribution per team or workload.
- Service mesh experience (Istio, Linkerd) for traffic management and reliability patterns.
- Experience in regulated industries (insurance, finance, healthcare) where auditability, change traceability, and secure-by-default operations are standard practice.
Obowiązki
- Define, instrument, and maintain SLOs and SLIs for platform components; own error budget tracking and produce regular reliability reports for hub leadership.
- Serve on the on-call rotation as the infrastructure escalation tier; lead incident response for cluster-level, network-level, and storage failures; chair blameless post-incident reviews.
- Implement and operate Kubernetes infrastructure (AKS): cluster lifecycle management, networking, resource quotas, autoscaling configuration, and multi-tenancy patterns across spoke namespaces.
- Develop Infrastructure as Code (Terraform) to provision and manage Azure resources with consistency, auditability, and repeatable rollback capability.
- Build and maintain observability infrastructure: Prometheus, Grafana, Azure Monitor, and Application Insights; own alerting rules, dashboards, and distributed tracing coverage across platform components.
- Perform capacity planning and cost-aware resource management: right-size node pools, tune vertical and horizontal pod autoscalers, and identify resource waste across namespaces.
- Identify and eliminate toil: automate repetitive operational tasks through scripting and tooling; measure and track toil reduction over time.
- Maintain platform reliability procedures: rolling upgrades, backup and recovery testing, disaster recovery runbooks, and change freeze coordination.
- Contribute to CI/CD pipelines and GitOps tooling (GitHub Actions, ArgoCD) from a reliability and deployment safety perspective; work with the Platform Team on release gates and rollback mechanisms.
- Collaborate with the Run & Change team on incident SLA targets and operational procedures; work with Security Engineers on infrastructure hardening and vulnerability remediation.
Oferta
- Contract under Polish law: B2B or Umowa o Pracę
- Benefits such as private medical care, group insurance, Multisport card
- English classes available
- Hybrid work (at least 1 day/week on-site) in Warsaw (Mokotów)
- Opportunity to work with excellent professionals
- High standards of work and focus on the quality of code
- New technologies in use
- Continuously learning and growth
- International team
- Pinball, PlayStation & much more (on-site)
- Sharing the costs of sports activities
- Private medical care
- Life insurance
- Remote work opportunities
- Fruits
- Video games at work
- Coffee / tea
- Drinks
- Parking space for employees
- Leisure zone
- English classes
Opieka zdrowotna
Ubezpieczenie
Karta sportowa
Kursy językowe
Elastyczne godziny
Webellian
44 aktywne oferty