Senior IaaS / Kubernetes Platform Engineer (worldwide remote, work anywhere)

Brak informacji o wynagrodzeniu

SeniorFull-time

#337219·Dodano dziś·0

Źródło: Cloudlinux

Aplikuj teraz

Tech Stack / Keywords

KubernetesLinuxSecurityCloudNetworkNetworkingAPIArgoCD

Firma i stanowisko

CloudLinux is a global remote-first company delivering high-volume, low-cost Linux infrastructure and security products. The Infrastructure Department manages a private cloud and multi-tenant Kubernetes platform powering 500+ VMs across multiple datacenters and serving 20+ engineering teams. The company is transitioning from an OpenNebula-based virtualization platform to a Kubernetes-native multi-tenant cloud with KubeVirt for VM orchestration.

Wymagania

Must have:

5+ years in infrastructure/platform engineering roles, with at least 3 years operating production Kubernetes clusters.
Production experience with at least 3 of: KubeVirt or similar VM-on-K8s technology, Cluster API (CAPI), Cilium or Calico, Rook-Ceph or other Kubernetes storage operators at scale, ArgoCD or Flux.
Deep Linux systems knowledge: kernel tuning, networking stack (iptables/nftables, routing, bonding, VLAN), filesystem operations, performance troubleshooting.
Ceph distributed storage experience: cluster operations, OSD lifecycle, pool management, performance tuning, troubleshooting degraded states.
Infrastructure as Code: Terraform/OpenTofu and Ansible at production scale.
Bare-metal infrastructure experience: IPMI/iDRAC, PXE boot, RAID configuration, hardware diagnostics, datacenter operations.
Networking fundamentals: BGP, VLAN, IPSec/WireGuard, DNS, load balancing.
Strong written and verbal English (B2+ minimum).
Proactive mindset: history of identifying problems before incidents and driving improvements.

Nice to have:

Experience building multi-tenant Kubernetes platforms (vCluster, Capsule, or custom namespace isolation).
Crossplane or similar Kubernetes-native infrastructure abstraction.
Policy-as-Code: Kyverno, OPA Gatekeeper, or Kubewarden.
Container security: image signing (Sigstore/cosign), runtime security (Falco), sandboxed execution (Kata Containers, gVisor).
SRE practices: SLO/SLI design, error budget policies, chaos engineering (LitmusChaos, Chaos Mesh), incident management frameworks.
FinOps: OpenCost, Kubecost, cloud cost optimization.
Immutable OS experience: Talos Linux, Flatcar Container Linux, or similar.
OpenNebula experience.
Experience with LINSTOR/DRBD or TopoLVM for local high-performance storage.
SR-IOV and DPDK experience for hardware-accelerated networking.
Experience migrating from traditional virtualization (VMware, OpenNebula, Proxmox) to Kubernetes/KubeVirt.
Grafana LGTM stack (Mimir, Loki, Tempo) for observability.
Compliance environment experience (SOC2, ISO 27001, NIS2).
Go or Python programming for infrastructure tooling.
Experience with Juniper JunOS switch configuration.

What we’re looking for:

Proactive mindset to reduce unplanned work through automation and resilient systems.
Platform-minded approach to replace repetitive support work with scalable solutions.
Ability to work across current and future stack (OpenNebula and Kubernetes-native platform).
Transparent communication including documentation, ADRs, postmortems.
Focused on knowledge sharing and documentation.
Strong English communication skills for documentation and collaboration.

Obowiązki

Kubernetes Platform Engineering (Primary Focus — 40%)

Design, build, and operate a multi-tenant Kubernetes platform using Cluster API (CAPI) with bare-metal providers (Metal3/Sidero).
Implement hard multi-tenancy using vCluster (Loft Labs) or similar technology, providing isolated Kubernetes API servers per tenant.
Deploy and manage KubeVirt for VM orchestration within Kubernetes, including CPU pinning, NUMA awareness, and HugePages configuration.
Implement GitOps-driven infrastructure using ArgoCD or Flux as the single source of truth for all cluster configurations.
Deploy and manage Policy-as-Code using Kyverno or OPA Gatekeeper for admission control, resource quotas, and security policies.
Build self-service capabilities using Crossplane or similar Kubernetes-native infrastructure provisioning tools.

Storage Engineering (20%)

Operate and optimize Ceph distributed storage clusters (currently 1 PiB raw, 149 OSDs, Quincy 17.2.5).
Manage Rook-Ceph operator deployments at scale on modern Kubernetes (v1.28+).
Implement storage tiering: Ceph for bulk storage, local NVMe for high-IOPS workloads, LINSTOR/DRBD or TopoLVM for ultra-fast replicated storage.
Design and implement per-VM / per-tenant I/O isolation on shared Ceph clusters.
Manage CDI (Containerized Data Importer) for VM image lifecycle in KubeVirt environments.

Networking (15%)

Deploy and manage overlay networks for pod networking, micro-segmentation, and WireGuard/IPsec encryption.
Implement Cluster Mesh for multi-datacenter pod-to-pod connectivity.
Configure Multus CNI and SR-IOV for multi-NIC VM support in KubeVirt.
Work with physical network infrastructure: Juniper switches (JunOS), BGP (eBGP/iBGP), EVPN/VXLAN, VLANs.
Maintain IPSec site-to-site connectivity between datacenters.

Reliability and Operations (15%)

Practice SRE discipline: define and maintain SLOs with error budgets, implement proactive capacity management with 6-12 month forecasting.
Design and execute chaos engineering experiments to validate system resilience.
Participate in on-call rotation for IaaS infrastructure (OpenNebula, Ceph, networking).
Write and maintain runbooks, DRP documentation, and postmortem analyses.
Drive proactive improvement: identify reliability risks, performance bottlenecks, and toil, then propose and implement solutions.

Infrastructure as Code and Automation (10%)

Develop and maintain Terraform/OpenTofu modules for multi-cloud infrastructure provisioning.
Write Ansible playbooks for bare-metal server configuration and fleet management.
Automate infrastructure lifecycle: PXE boot images, hardware provisioning (Foreman), IPMI management.
Implement FinOps practices: cost attribution, resource utilization analysis, right-sizing recommendations using OpenCost/Kubecost.

Oferta

Focus on professional development.
Interesting and challenging projects.
Fully remote work with flexible working hours.
Paid 24 days of vacation per year, 10 days of national holidays, and unlimited sick leaves.
Compensation for private medical insurance.
Co-working and gym/sports reimbursement.
Budget for education.
Opportunity to receive a reward for the most innovative idea that the company can patent.

Elastyczne godziny

Płatny urlop

Opieka zdrowotna

Karta sportowa

Dofinansowanie szkoleń

Inne informacje

By applying for this position, you consent to the processing of your personal data as described in the company's Privacy Policy.

CloudLinux

3 aktywne oferty

Zobacz wszystkie oferty

Aplikuj teraz