Senior DevOps Engineer
  • England,London,City of London
  • Full Time, Permanent
  • £80,000 - £130,000 per annum
Job Description:
Senior DevOps Engineer – AI & Cloud InfrastructureType: Permanent / Full-Time (Employment or Contract considered)
Location: Remote or Hybrid
Time Zones: UK, Europe, North America–friendly




The OpportunityWe’re working with a high-growth tech-start up company building a next-generation AI cloud platform, focused on fast, reliable inference for large language models and other compute-intensive workloads.
The platform combines modern cloud infrastructure, Kubernetes, GPU clusters, and developer-first tooling to support mission-critical AI systems operating across multiple regions.
They’re now looking for a Senior DevOps Engineer to take ownership of the infrastructure backbone — someone who enjoys operating complex systems at scale and working closely with infrastructure, ML, and product engineering teams.


What You’ll Be DoingAI Cloud Infrastructure*Design, build, and operate highly available, secure infrastructure supporting AI inference, fine-tuning, and data processing workloads
*Manage multi-region Kubernetes clusters, including GPU-heavy environments
*Implement autoscaling strategies across heterogeneous compute fleets
Infrastructure as Code & Automation*Own and evolve infrastructure-as-code using tools such as Terraform, Helm, and similar
*Automate provisioning of compute, networking, and storage
*Build tooling to spin environments up and down for experiments, benchmarks, and customer deployments
CI/CD & Release Engineering*Design and maintain CI/CD pipelines across backend, infrastructure, and ML components
*Implement safe deployment strategies (e.g. blue/green, canary releases)
*Partner with engineers to improve build speed, test reliability, and deployment confidence
Observability, Reliability & SRE*Build and operate observability stacks (metrics, logging, tracing)
*Define and monitor SLOs / SLAs for latency, availability, and reliability
*Create runbooks, playbooks, and incident response processes for production systems
Security & Best Practices*Implement best practices around secrets management, access control, and network security
*Support secure, multi-tenant environments for enterprise customers
*Help foster a culture of operational excellence, ownership, and reliability




What They’re Looking ForEssential*4–8+ years’ experience in DevOps, SRE, Platform, or Infrastructure Engineering
*Strong experience running production systems on major cloud platforms (AWS, GCP, or Azure)
*Deep hands-on experience with Kubernetes in production
*Strong Infrastructure-as-Code skills (Terraform or equivalent)
*Proficiency in at least one scripting or programming language (e.g. Python, Go, Bash)
*Solid understanding of networking, security fundamentals, and distributed systems
*Proven experience building reliable, observable, automated systems
Nice to Have*Experience supporting GPU-based workloads or ML infrastructure
*Exposure to AI / ML platforms, inference systems, or data pipelines
*Familiarity with modern CI/CD tooling and GitOps approaches
*Experience with observability tooling (metrics, logs, tracing)
*Background in cloud platforms, AI infrastructure, or high-scale SaaS environments


Why Join*Work on core infrastructure powering cutting-edge AI systems
*High impact and ownership over architecture and tooling decisions
*Collaboration with senior engineers across infrastructure, ML, and product
*Competitive compensation, equity, and long-term growth potential
*Flexible remote / hybrid working
Job number 3389161

Increase your exposure to recruiters with ProJobs

Thousands of recruiters are looking for you in the Job Master profile database, increase your exposure 4 times with a ProJob subscription

You can cancel your subscription at any time.
metapel
Company Details:
True North Group
Company size:
Industry:
The jobs on site are for both men and women