Senior HPC Cluster Administrator - Deep Learning Frameworks Infrastructure

221 250 - 383 500 PLN/ mies.Umowa o pracę (brutto)
292 500 - 507 000 PLN/ mies.Umowa o pracę (brutto)
SeniorFull-time·Umowa o pracę
#325155·Dodano 19 dni temu·34
Źródło: NVIDIA
Aplikuj teraz

Tech Stack / Keywords

NetworkingLinuxEmbeddedAnsibleTerraformCI/CDGitLabPrometheus

Firma i stanowisko

NVIDIA's Deep Learning Frameworks (DLFW) Infrastructure team manages large-scale GPU compute clusters supporting deep learning training, inference, and high-performance computing workloads, including DGX/HGX platforms and Grace Blackwell systems.


Wymagania

  • BS/MS in CS, EE, CE, or equivalent hands-on experience
  • 5+ years of experience deploying and administering large-scale HPC or ML training clusters
  • Deep expertise in Linux systems administration at scale
  • Strong scripting and automation skills in Python and/or bash
  • Hands-on experience with Slurm (scheduling, accounting, cgroup configuration)
  • Proficiency with configuration management and IaC (Ansible required; Terraform a plus)
  • Experience with container technologies (Docker, Apptainer/Singularity, Kubernetes)
  • Solid understanding of high-speed networking (InfiniBand, RoCE, RDMA, EFA)
  • Experience with distributed/parallel filesystems and storage architecture
  • Ability to own problems end-to-end and communicate clearly with engineering and management stakeholders

Nice to have:

  • Experience with NVIDIA GPU infrastructure tools (DCGM, nvidia-smi, MIG, NVSwitch diagnostics)
  • Familiarity with cluster management platforms (Colossus, Bright Cluster Manager, xCAT, or similar)
  • Experience supporting large-scale distributed deep learning workloads (PyTorch, JAX, Megatron)
  • Knowledge of BMC/IPMI/Redfish for out-of-band management and hardware lifecycle
  • Background in MLOps tooling or ML platform engineering

Obowiązki

  • Own the full lifecycle of GPU compute clusters including procurement, provisioning, configuration management, monitoring, and deprecation across heterogeneous Linux environments (DGX, HGX, embedded systems)
  • Design and scale storage solutions (NFS, Lustre, WekaFS, or equivalent) with a roadmap for capacity and performance growth
  • Lead automation of infrastructure using IaC tools (Ansible, Terraform) and CI/CD pipelines (GitLab)
  • Manage and optimize job scheduling via Slurm, including fair-share policies, reservation management, and MIG/GPU partitioning strategies
  • Maintain and improve observability stacks (Prometheus, Grafana, DCGM) and drive proactive resolution of hardware and software incidents
  • Collaborate with ML engineers and software teams to tune cluster configuration for large-scale distributed training workloads
  • Evaluate and introduce new technologies such as networking fabrics (InfiniBand, NVLink, EFA/RDMA), storage tiers, and container runtimes to improve performance and reliability
  • Mentor junior engineers and contribute to team-wide engineering standards

Oferta

  • Remote work from Poland
  • Competitive base salary range for Poland: 221,250 PLN - 383,500 PLN for Level 3, and 292,500 PLN - 507,000 PLN for Level 4

Inne informacje

Only candidates located in Poland are considered for this position.

NVIDIA

NVIDIA

30 aktywnych ofert

Zobacz wszystkie oferty
Aplikuj teraz