Senior Site Reliability Engineer

Brak informacji o wynagrodzeniu
SeniorFull-time·Umowa o pracę·B2B
#337012·Dodano dziś·0
Źródło: theprotocol.it
Aplikuj teraz

Tech Stack / Keywords

TerraformBicepPrometheusGrafanaPythonBashGitHubArgoCDGitOpsKubernetesIstioLinkerdWindows

Firma i stanowisko

Webellian is a well-established Digital Transformation and IT consulting company committed to creating a positive impact for our clients. We strive to make a meaningful difference in diverse sectors such as insurance, banking, healthcare, retail, and manufacturing. Our passion for cutting-edge and disruptive technologies, as well as our shared values and strong principles, are what motivate us. We are a community of engineers and senior advisors who work with our clients across industries, playing a deep and meaningful role in accelerating and realizing their vision and strategy.


Wymagania

  • 5+ years professional experience in site reliability engineering, DevOps, or platform engineering roles.
  • Strong Kubernetes experience: cluster operations, networking (Ingress, network policies), storage, autoscaling, and hands-on troubleshooting across production environments.
  • Solid Infrastructure as Code experience with Terraform; familiarity with Bicep or ARM templates is a plus.
  • Production experience with Azure cloud services: AKS, ACR, Key Vault, Azure Monitor, Application Insights, Virtual Networks, and Private Endpoints.
  • Strong observability experience: Prometheus, Grafana, centralized logging, alerting configuration, and distributed tracing instrumentation.
  • Working knowledge of SLO/SLI methodology: error budget principles, reliability target setting, and capacity planning.
  • Structured incident management experience: on-call ownership, blameless post-incident review, and runbook authorship.
  • Scripting and automation proficiency in Python or bash for toil elimination and operational tooling.
  • Strong CI/CD experience: GitHub Actions and ArgoCD or equivalent GitOps tooling.
  • Comfortable in agile, iterative delivery environments with personal ownership and accountability for platform reliability.
  • Clear communicator across global, cross-functional stakeholders; able to translate technical reliability metrics into business impact for non-technical audiences.
  • Proactive learner with pragmatic adoption of AI-assisted developer tools (e.g., GitHub Copilot, Claude Code) to improve automation coverage and delivery velocity.

Nice to have:

  • Kubernetes certifications: CKA or CKAD.
  • Experience supporting AI or ML infrastructure workloads: GPU scheduling, model serving platforms, or inference pipeline operations.
  • Exposure to chaos engineering practices and fault injection testing.
  • FinOps experience: reserved capacity planning, resource right-sizing programs, and cost attribution per team or workload.
  • Service mesh experience (Istio, Linkerd) for traffic management and reliability patterns.
  • Experience in regulated industries (insurance, finance, healthcare) where auditability, change traceability, and secure-by-default operations are standard practice.

Obowiązki

  • Define, instrument, and maintain SLOs and SLIs for platform components; own error budget tracking and produce regular reliability reports for hub leadership.
  • Serve on the on-call rotation as the infrastructure escalation tier; lead incident response for cluster-level, network-level, and storage failures; chair blameless post-incident reviews.
  • Implement and operate Kubernetes infrastructure (AKS): cluster lifecycle management, networking, resource quotas, autoscaling configuration, and multi-tenancy patterns across spoke namespaces.
  • Develop Infrastructure as Code (Terraform) to provision and manage Azure resources with consistency, auditability, and repeatable rollback capability.
  • Build and maintain observability infrastructure: Prometheus, Grafana, Azure Monitor, and Application Insights; own alerting rules, dashboards, and distributed tracing coverage across platform components.
  • Perform capacity planning and cost-aware resource management: right-size node pools, tune vertical and horizontal pod autoscalers, and identify resource waste across namespaces.
  • Identify and eliminate toil: automate repetitive operational tasks through scripting and tooling; measure and track toil reduction over time.
  • Maintain platform reliability procedures: rolling upgrades, backup and recovery testing, disaster recovery runbooks, and change freeze coordination.
  • Contribute to CI/CD pipelines and GitOps tooling (GitHub Actions, ArgoCD) from a reliability and deployment safety perspective; work with the Platform Team on release gates and rollback mechanisms.
  • Collaborate with the Run & Change team on incident SLA targets and operational procedures; work with Security Engineers on infrastructure hardening and vulnerability remediation.

Oferta

  • Contract under Polish law: B2B or Umowa o Pracę
  • Benefits such as private medical care, group insurance, Multisport card
  • English classes available
  • Hybrid work (at least 1 day/week on-site) in Warsaw (Mokotów)
  • Opportunity to work with excellent professionals
  • High standards of work and focus on the quality of code
  • New technologies in use
  • Continuously learning and growth
  • International team
  • Pinball, PlayStation & much more (on-site)
  • Sharing the costs of sports activities
  • Private medical care
  • Life insurance
  • Remote work opportunities
  • Fruits
  • Video games at work
  • Coffee / tea
  • Drinks
  • Parking space for employees
  • Leisure zone
  • English classes
Opieka zdrowotna
Ubezpieczenie
Karta sportowa
Kursy językowe
Elastyczne godziny
Webellian

Webellian

44 aktywne oferty

Zobacz wszystkie oferty
Aplikuj teraz