Overview:
We are seeking a Mid-Level MLOps Engineer to build, operate, and evolve our Kubeflow-based ML platform on Azure. This role focuses on enabling reliable, scalable, and cost-efficient ML workflows by designing CI/CD pipelines, managing Kubernetes-based ML infrastructure, improving platform observability, and supporting MLE and Data Science teams across the model lifecycle.
The ideal candidate is hands-on, comfortable working across infrastructure and ML workflows, and motivated to operationalize best practices in MLOps.
Responsibilities:
Platform & Infrastructure:
Deploy, configure, and operate Kubeflow components on Azure Kubernetes Service (AKS)
Support Kubernetes workloads for training, inference, and batch pipelines
Manage container images, registries, and ML runtime environments
Assist with Kubeflow and Kubernetes upgrades under senior guidance
CI/CD & Automation:
Build and maintain CI/CD pipelines for ML workflows and platform services
Automate model training, validation, and deployment pipelines
Implement reproducibility and versioning for data, models, and pipelines
Observability & Reliability:
Implement logging, monitoring, and alerting at the platform level
Diagnose and resolve workflow, pipeline, and infrastructure failures
Support SLAs and reliability objectives for ML platforms
Collaboration & Enablement:
Work closely with MLEs and Data Scientists to onboard workflows onto Kubeflow
Provide best practices, templates, and documentation for ML teams
Collaborate with Infra and Security teams on access control and compliance needs
Cost Awareness & Optimization:
Assist with collecting and reporting costs at Kubeflow namespace or workflow level
Identify optimization opportunities related to compute usage and scheduling
Qualifications:
- 3–6 years of experience in MLOps, DevOps, or Platform Engineering
- Hands-on experience with Kubeflow, Kubernetes, Terraform (IaC), and containerized ML workloads
- Strong experience with Azure cloud services (AKS, ACR, Storage, Networking, IAM, AD Groups)
- Proficiency in Python and familiarity with ML frameworks (TensorFlow, PyTorch, scikit-learn)
- Experience building CI/CD pipelines (GitHub Actions, Azure DevOps, Argo, etc.)
- Understanding of ML lifecycle management (training, inference, monitoring, retraining)
- Familiarity with observability tools (Prometheus, Grafana, Azure Monitor, DataDog)
- Strong collaboration and communication skills
What makes us different?
- Hybrid work model: combination of remote and collaborative office experience to enable innovation
- Entrepreneurial environment in leading international company
- Professional growth possibilities & learning opportunities
- Variety of benefits to support your physical, emotional and financial wellbeing
- Volunteering opportunities to help external communities
About PepsiCo
We believe that culture should be at the cornerstone of everything we do at PepsiCo. We are agile, innovative and not afraid of failure. We want our team to come to work every day excited to explore new ways to bring enjoyment, refreshment and fun to the world.
PepsiCo Positive (pep+) is the future of our organization – a strategic end-to-end transformation, with sustainability at the center of how we will create growth and value by operating within planetary boundaries and inspiring positive change for the planet and people.
So, if you’re ready to be a part of a playground for those who think big, we’d love to chat.
- We encourage the diversity of applicants across gender, age, ethnicity, nationality, sexual orientation, social background, religion or belief and disability
#LI-Hybrid