Oman , Oman
--
Company

Job Details

Job Description

Roles & Responsibilities

Responsibilities:

Lead and mentor a team of highly talented junior ML engineers through:

  • Code reviews, design reviews, and technical direction
  • Enforcement of strong software engineering and ML best practices

Design, deploy, and operate scalable AI systems with a focus on reliability and performance

Lead production deployment of LLMs and multimodal systems (RAG, OCR, voice)

Own model performance end-to-end, combining evaluation, observability, and hardware optimization:

  • Build evaluation pipelines (benchmarks, regression testing, LLM-as-judge)
  • Implement deep observability (tracing, latency, error tracking)
  • Optimize GPU utilization (multi-GPU serving, batching, quantization, memory tuning)
  • Continuously improve throughput, latency, and cost efficiency

Architect and manage GPU infrastructure:

  • Model serving, load balancing, and scaling strategies
  • Hardware-aware deployment and performance tuning

Build and maintain robust MLOps pipelines:

  • Model/version management, CI/CD, automated testing, and rollback strategies
  • Monitoring and feedback loops for continuous improvement

Engage directly with clients and stakeholders to:

  • Gather and clarify business requirements
  • Translate non-technical needs into well-defined technical problems
  • Communicate solutions, trade-offs, and progress through clear documentation, reports, and proposals

Contribute hands-on to system design, implementation, debugging, and production incident resolution

Similar Jobs