Senior Solutions Architect, HPC and AI

~24 375 - 42 250 PLN/ mies.
SeniorFull-time
#296232·Dodano 2 miesiące temu·57
Źródło: NVIDIA
Aplikuj teraz

Tech Stack / Keywords

AIPythonProfilingSOLIDCUDACloudLLMPyTorch

Firma i stanowisko

We are seeking a Senior Solutions Architect with strong hands-on experience in deploying, debugging, and optimizing training and inference workloads on large-scale GPU clusters. The role supports customers and partners across Europe in training models on groundbreaking GPU infrastructure, focusing on High Performance Computing and AI workloads.


Wymagania

  • BS, MS, PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or a related engineering field—or equivalent practical experience.
  • 8+ years of experience in accelerated computing technologies at cluster scale, ideally including work with NVIDIA platforms.
  • Strong programming skills in at least one of the following languages: C, C++, or Python.
  • Practical experience identifying and resolving bottlenecks in large-scale training workloads or parallel applications.
  • Hands-on experience in profiling and debugging large parallel applications.
  • Solid understanding of CPU and GPU architectures, CUDA, parallel filesystems, and high-speed interconnects.
  • Experience working with large compute clusters with understanding of internal scheduling and resource management mechanisms (e.g., SLURM or Cloud based clusters).
  • Proficient knowledge of training pipelines and frameworks, including their internal operations and performance attributes.

Nice to have:

  • Experience debugging training pipelines running on thousands of GPUs in production environments.
  • Hands-on experience with performance profiling and optimizations using tools like Nsight Systems, Nsight Compute.
  • Good understanding of NCCL, MPI, and low-level communication libraries.
  • Ability to debug stability issues across the entire stack: parallel application, training frameworks, runtime libraries, schedulers, and hardware.
  • Solid understanding of internal workings of LLM frameworks such as PyTorch, Megatron-LM, or NeMo, and their impact on compute layers like CPUs, GPUs, network, and storage.
  • Understanding of inference tools such as vLLM, Dynamo, TensorRT-LLM, RedHat Inference Server, or SGLang.

Obowiązki

  • Collaborate with NVIDIA’s training framework developers and product teams to stay ahead of the latest features and help partners adopt them effectively.
  • Assist with deployment, debugging, and improving the efficiency of AI workloads on extensive NVIDIA platforms.
  • Benchmark new framework features, analyze performance, and share actionable insights with customers and internal teams.
  • Work directly with external customers to solve cluster performance and stability issues, identify bottlenecks, and implement effective solutions.
  • Build expertise and guide customers in scaling workloads efficiently and reliably on the latest generation of NVIDIA GPUs.
  • Contribute to Europe’s Sovereign AI initiative by helping customers implement advanced resiliency features within AI training pipelines.

Oferta

  • Base salary range for Poland: 292,500 PLN - 507,000 PLN for Level 4, and 375,000 PLN - 650,000 PLN for Level 5.
NVIDIA

NVIDIA

30 aktywnych ofert

Zobacz wszystkie oferty
Aplikuj teraz