Platform Engineer – AI Supercompute Infrastructure (Networking & Systems) | Cloud & Engineering

Brak informacji o wynagrodzeniu
SeniorFull-time
#320874·Dodano miesiąc temu·35
Źródło: nofluffjobs.com
Aplikuj teraz

Tech Stack / Keywords

NetworkingOSI modelBGPLinuxKubernetesPrometheusGrafanaZabbixIaCSwitchesEthernet

Firma i stanowisko

We are a technology consulting firm building and operating next-generation AI supercompute infrastructure for the world's most ambitious organizations. As a repeatedly awarded NVIDIA Consulting Partner of the Year in EMEA, we hold one of the deepest and most recognized NVIDIA partnerships in the region. Our Cloud Engineering teams design and deliver cloud projects for clients in Poland and abroad in areas including cloud development, DevOps, integration, migration, data management, and infrastructure.


Wymagania

  • 5-8 years of hands-on experience in infrastructure, networking, or systems engineering.
  • Solid understanding of networking fundamentals: OSI model, switching and routing (BGP, OSPF), VLANs, MTU, and traffic engineering.
  • Working knowledge of high-performance networking technologies: InfiniBand, RDMA, RoCE, or equivalent HPC interconnects.
  • Familiarity with Linux networking: interfaces, bridges, bonding, namespaces, tc/qdisc, and kernel network tuning.
  • Basic hands-on experience with Kubernetes or Slurm.
  • Experience with at least one monitoring stack: Prometheus, Grafana, Zabbix, or similar.
  • Experience with network automation and Infrastructure as Code (IaC).
  • Comfort working directly with physical hardware: servers, switches, cabling, and data centre environments.

Bonus Experience:

  • Exposure to NVIDIA networking products: Mellanox/ConnectX NICs, Quantum InfiniBand switches, Spectrum Ethernet switches.
  • Familiarity with NCCL tuning, collective communication patterns, or distributed training networking requirements.
  • Hands-on experience with DCGM, iperf3, perftest, or ibdiagnet for benchmarking and validation.
  • Exposure to container networking.
  • Experience in consulting or client-facing technical roles.

Obowiązki

Physical Network Configuration:

  • Own the correctness of physical network configuration including standards, review, and oversight of data centre teams and contractors.

Network Fabric Management:

  • Configure and operate InfiniBand and RoCEv2 fabrics for GPU-to-GPU communication and distributed training.

Network Performance Optimisation:

  • Profile and tune network throughput, latency, and congestion for AI workloads.
  • Work with NCCL, GPUDirect RDMA, NVLink, and NVSwitch.

Cluster Platform Operations:

  • Support deployment, day-2 operations, and troubleshooting of Kubernetes and Slurm clusters.
  • Contribute to OS-level configuration, driver management, and node lifecycle automation.

Monitoring & Observability:

  • Instrument network and cluster health using Prometheus, Grafana, and DCGM Exporter.
  • Build dashboards for GPU utilisation, link errors, and fabric saturation with documentation.

Oferta

  • Flexible working hours.
  • Permanent employment or contract.
  • Medical and health insurance.
  • Multisport and other lifestyle benefits.
  • Language courses.
  • Friendly coworkers and team spirit.
  • Multiple geographies and clients.
  • Work for well-known brands.
  • Exposure to trailblazing business and technology projects.
  • Opportunities to influence business operations.
  • Development path tailored to individual needs.
Elastyczne godziny
Opieka zdrowotna
Karta sportowa
Kursy językowe
Deloitte

Deloitte

33 aktywne oferty

Zobacz wszystkie oferty
Aplikuj teraz