Fonte oficial🌐 100% RemotaCLT

Devops /Platform Engineer (4631)

Keep Simple·Publicada há cerca de 1 mês

Você vai direto pra vaga — sem criar conta aqui.

Verificada em 18/05/2026 · Clique e candidate-se.

Sobre a vaga

Come work for a large global financial and insurance products company! This is your chance !!Start a successful career in a renowned company in the international market! Great opportunity!Global insurance and asset management company seeks a responsible, organized, dynamic and team-oriented person.Responsabilidades e atribuiçõesRole SummaryWe are seeking a Senior DevOps / Platform Engineer to design, build, and operate the cloud infrastructure, CI/CD pipelines, and developer platform that underpin our AI and digital innovation initiatives. This is a cloud-agnostic role — you will architect infrastructure and platform capabilities that work across AWS, Azure, and GCP, ensuring our engineering teams can build, deploy, and operate AI-powered applications with speed, security, and reliability.A distinguishing aspect of this role is the MLOps dimension. You will build and maintain the infrastructure for AI/ML model lifecycle management: training environments, model serving, experiment tracking, automated evaluation, and production monitoring. You will ensure that deploying an AI model to production is as reliable, repeatable, and observable as deploying a traditional software service. Key ResponsibilitiesCI/CD Pipeline EngineeringDesign and maintain end-to-end CI/CD pipelines for all engineering workstreams: application code, infrastructure-as-code, AI/ML models, data pipelines, and automation scripts;Build multi-stage deployment pipelines with automated testing gates: unit tests, integration tests, security scans (SAST/DAST/SCA), AI model evaluation, and infrastructure validation;Implement deployment strategies: blue/green, canary, rolling updates, and feature flags — for both traditional services and AI model endpoints;Design and maintain artifact management: container registries, model registries, package repositories, and versioned infrastructure modules;Build pipeline observability: deployment frequency tracking, lead time for changes, change failure rate, and mean time to recovery (DORA metrics);Implement GitOps workflows using ArgoCD, Flux, or equivalent for declarative infrastructure and application deployment.Cloud Infrastructure (Cloud-Agnostic)Design and maintain cloud infrastructure across AWS, Azure, and/or GCP — with emphasis on portability and avoiding deep vendor lock-in where practical;Implement infrastructure-as-code using Terraform (primary), Pulumi, or CloudFormation/Bicep with modular, reusable, and well-tested infrastructure modules;Design and operate Kubernetes clusters (EKS, AKS, GKE) for containerized workloads — including AI model serving, API services, and batch processing;Build and manage serverless compute infrastructure (Lambda, Azure Functions, Cloud Functions) for event-driven workflows and lightweight AI inference;Implement cloud cost optimization: right-sizing, reserved capacity planning, spot/preemptible instance strategies, and automated cost monitoring and alerting;Design multi-environment strategies: development, staging, production — with proper isolation, data governance, and promotion workflows.Security & Compliance InfrastructureImplement security-as-code: infrastructure security policies (Checkov, tfsec, Sentinel), container image scanning (Trivy, Snyk), and runtime security monitoring;Design and enforce zero-trust networking: service mesh (Istio, Linkerd), network policies, private endpoints, and API gateway security;Implement secrets management using HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, or equivalent;Build and maintain identity and access management: service accounts, workload identity, least-privilege IAM policies, and RBAC for Kubernetes and cloud resources;Ensure infrastructure compliance with SOC 2, ISO 27001, GDPR, and industry-specific regulations;Implement audit logging, security alerting, and automated compliance scanning across all infrastructure.MLOps & AI InfrastructureDesign and build ML training infrastructure: GPU/TPU compute provisioning, distributed training support, and experiment tracking (MLflow, Weights & Biases);Build model serving infrastructure: containerized model endpoints, auto-scaling (including GPU-based scaling), A/B testing, and model routing;Implement model registry and lifecycle management: model versioning, staging, approval workflows, and automated deployment pipelines;Build AI-specific monitoring: model latency, throughput, error rates, input/output drift detection, and token usage cost tracking;Design and operate vector database infrastructure for RAG systems: deployment, scaling, backup, and disaster recovery;Implement LLM gateway/proxy infrastructure: centralized API routing, rate limiting, cost controls, caching, and provider failover.Reliability & ObservabilityDesign and implement comprehensive observability stack: metrics (Prometheus/Grafana, Datadog), logs (ELK, Loki, CloudWatch), traces (Jaeger, OpenTelemetry), and AI-specific monitoring;Build and maintain alerting systems with proper escalation policies, runbooks, and automated remediation where possible;Implement SLI/SLO frameworks for all production services — including AI model endpoints — with error budget tracking;Design disaster recovery and business continuity plans: multi-region deployment, data replication, backup strategies, and failover testing;Build chaos engineering practices: fault injection, game days, and resilience testing for both infrastructure and AI systems;Maintain incident management processes: on-call rotations, incident response playbooks, and post-incident review facilitation.Developer Experience & PlatformBuild and maintain an Internal Developer Platform (IDP) that enables self-service infrastructure provisioning, environment management, and deployment;Design developer workflows: local development environments (dev containers, Codespaces), preview environments, and rapid feedback loops;Build and maintain developer documentation: architecture decision records (ADRs), runbooks, onboarding guides, and platform usage guidelines;Implement platform abstractions that reduce cognitive load on application developers while maintaining flexibility for power users;Design and operate shared services: database provisioning, cache infrastructure, message queue clusters, and monitoring stack.Requisitos e qualificaçõesRequired Qualifications / Skills6+ years of experience in DevOps, SRE, or platform engineering, with at least 2+ years supporting AI/ML workloads in production;Expert-level experience with infrastructure-as-code: Terraform (primary), with exposure to Pulumi, CloudFormation, or Bicep;Production experience with Kubernetes (EKS, AKS, or GKE): cluster management, Helm charts, operators, auto-scaling, and troubleshooting;Deep experience with CI/CD pipeline design: GitHub Actions, GitLab CI, Azure DevOps Pipelines, or Jenkins — including multi-stage pipelines with automated quality gates;Strong cloud infrastructure experience across at least two of: AWS, Azure, GCP — with hands-on skills in networking, compute, storage, identity, and security services;Proficiency in scripting and automation: Python, Bash, PowerShell, and at least one of: Go, TypeScript;Experience building observability stacks: Prometheus, Grafana, Datadog, ELK, OpenTelemetry, and alerting/on-call systems (PagerDuty, Opsgenie);Strong understanding of security engineering: secrets management, network security, IAM, container security, and compliance automation;Experience with GitOps practices and tools: ArgoCD, Flux, or equivalent;Fluent English, both written and spoken;Proven experience in international projects, including collaboration with global and multicultural teams;Strong communication, stakeholder management, and problem-solving skills;Previous experience mentoring engineers or acting as a technical lead is strongly preferred.Preferred QualificationsHands-on MLOps experience: model serving (vLLM, TensorRT, Triton Inference Server, SageMaker Endpoints, Azure ML), model registries (MLflow, Weights & Biases), and GPU infrastructure management;Experience building LLM gateway/proxy infrastructure: LiteLLM, AI Gateway, or custom routing layers;Familiarity with platform engineering tools: Backstage, Port, Humanitec, or custom developer portals;Experience with service mesh technologies: Istio, Linkerd, or Consul Connect;Knowledge of FinOps practices: cloud cost management, tagging strategies, showback/chargeback models;Experience in insurance, financial services, or other regulated industries with strict compliance requirements;Certifications: CKA/CKAD (Kubernetes), AWS Solutions Architect / DevOps Engineer, Azure DevOps Engineer Expert, HashiCorp Terraform Associate;Experience with chaos engineering tools: Chaos Monkey, Litmus, Gremlin;Familiarity with edge/hybrid deployment patterns for AI models;Experience building and operating data platform infrastructure: Spark clusters, Kafka, Airflow/Prefect deployments.Base RequirementsDevOps Experience | All team members must demonstrate hands-on experience with CI/CD pipelines, containerization (Docker/Kubernetes), cloud platforms, and deployment automation;Infrastructure as Code | Proficiency with at least one IaC toolchain (Terraform, Pulumi, CloudFormation/Bicep) is required across all roles — not just DevOps;Cloud Platforms | Working knowledge of at least one major cloud provider (AWS, Azure, or GCP);Version Control & Collaboration | Git-based workflows, code review practices, and collaborative development are expected of every team member.EducationBachelor's degree in Computer Science, Information Systems, Engineering, or a related field is preferred.Informações adicionaisModelo de contratação:PJForma de atuação:100% Remoto

Candidatar-se no site oficial

Receba vagas de Tecnologia como esta por e-mail

Grátis. Cancele quando quiser.

Explorar mais vagas

Vagas de Tecnologia Vagas 100% Remotas Ver todas as vagas