Vice President, DevOps Engineer

BlackRock, Inc.4.0·Posted 4 months ago

Location

All India, Gurugram

Experience

5–9 years

Required Skills

AWSAzureGCPAnsibleHelmKubernetesSplunkPythonBashGittransformersTerraformCloudFormationArgoCDIstiocertmanagerHashiCorp VaultPrometheusGrafanaAlertManagerGitOpsELK StackDatadogFinOpsCICDSREMLflowRaymodel serving frameworksLLM fine tuningchatbot implementationslangchainOpen Policy AgentKyverno

About the Role

As a Data Platform Cloud/DevOps Engineer in the Data Engineering team at BlackRock, your role will involve designing, building, and maintaining the cloud-native infrastructure that powers Aladdin's Enterprise Data Platform. You will play a crucial part in enabling data engineers, AI engineers, and application developers by providing scalable, reliable, and cost-efficient infrastructure for data processing, AI/ML workloads, and analytics services.

Key Responsibilities: - Infrastructure and Cloud Engineering - Design, deploy, and manage cloud-native infrastructure across AWS, Azure, and private clouds - Implement Infrastructure as Code (IaC) using Terraform, Ansible, and CloudFormation for repeatable, auditable deployments
- Manage Kubernetes clusters for scalable, reliable, and secure application and data workloads
- Deploy and configure service mesh, HashiCorp Vault, cert-manager, and other Kubernetes-native frameworks
- Design and implement network architectures including VPCs, load balancers, and ingress/egress controls
- Deploy and configure LLM serving platforms like MCP/agent orchestrators, chatbots, vector embedding services, and secured API gateways for generative AI applications

• CI/CD and Automation - Build and maintain CI/CD pipelines using ArgoCD, Azure DevOps, Jenkins, and GitHub Actions

- Implement GitOps workflows for automated, auditable infrastructure and application deployments
- Automate repetitive operational tasks using Python and Bash to improve team efficiency and reduce manual errors
- Develop self-service infrastructure provisioning capabilities for engineering teams
- Maintain version control best practices and collaborative development workflows
- Build and maintain MLOps CI/CD pipelines for automated model deployment to production environments
• Site Reliability Engineering (SRE) - Implement monitoring, logging, and observability solutions using Prometheus, Grafana, ELK Stack, and Datadog

- Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for data platform services
- Build automated alerting systems to proactively detect infrastructure issues and performance degradation
- Perform capacity planning and performance tuning for production infrastructure
- Conduct reliability analysis and implement preventive measures to improve system uptime
- Collaborate with operational teams on incident escalation and system reliability improvements
- Implement chaos engineering practices to test infrastructure resilience and fault tolerance
• Cloud Cost Optimization and FinOps - Monitor and optimize cloud infrastructure costs across AWS, Azure, and private cloud environments

- Right-size compute, storage, and networking resources based on utilization metrics and cost-performance analysis
- Develop cost dashboards and reports to provide visibility into infrastructure spending trends
- Collaborate with finance and engineering teams on cloud budget planning and forecasting
- Evaluate and recommend cost-effective architectural alternatives (e.g., spot instances, reserved capacity, serverless options)
Qualifications:
• Cloud and Infrastructure - Expert-level experience with AWS, Azure, or GCP cloud platforms and services

- Proficiency with Infrastructure as Code tools (Terraform, Ansible, CloudFormation)
- Templating with Helm, ArgoCD, Ansible, and Terraform
- Deep knowledge of Kubernetes (K8s) APIs, controllers, operators, and stateful workloads
- Understanding of the K8s Operator Pattern
- Comfortable building atop K8s native frameworks including service mesh, secrets management, log management, and observability.
• CI/CD and Automation - Hands-on experience with CI/CD platforms

- Proficiency in scripting languages (Python, Bash) for automation and infrastructure tooling
- Experience implementing GitOps principles and workflows
- Version control expertise
• Site Reliability Engineering (SRE) - Experience implementing monitoring and observability solutions

- Knowledge of SRE principles including SLOs, SLIs, error budgets, and reliability engineering
- Performance tuning and capacity planning experience for production systems
- Experience with chaos engineering and reliability testing

• FinOps and Cost Management

- Experience with cloud cost optimization strategies and FinOps practices
- Ability to analyze infrastructure costs and identify optimization opportunities
- Knowledge of cost allocation, tagging strategies, and chargeback mechanisms
- Familiarity with cloud cost management tools

In case you find the above role suitable, you are required to submit a copy of your most recent resume and cover letter. BlackRock offers a wide range of benefits including a strong retirement plan, tuition reimbursement, comprehensive healthcare, and Flexible Time Off (FTO). The hybrid work model at BlackRock aims to foster collabor

Land this role fasterProfessional

🎙️