Job Location: Gurgaon/Gurugram
Responsibilities
- Maintain and enhance monitoring framework (data collection, alert aggregation, dashboarding) and Implement and enhance alerting logic (framework)
- Enable proactive Incident alert and resolution leveraging knowledge scripts
- Identify and detect repetitive incidents (stability, reliability) and develop solutions to fix problems.
- Lead and mentor team members.
- Work on technical resolution for incidents and identify technical root cause
- Ensure tool standards, Exploit tool capability to fine tune product reliability
- Integrate incident, release, monitoring, alerting tools into overall ecosystem
- Measure and report SLI, MTTx in periodic reviews, analyze deviations and take actions to closure.
- Update runbooks with changes to process / tools
- Drive Postmortems to arrive at remedial actions.
- Participate in On-Call Incident Technical Support
- Ensure production release guidelines (entry/exit) and implementation are adhered to for changes to Production.
- Support CI/CD pipeline implementation and integration to quality, security.
- Help define standards and adopt new technologies.
- Participates in reviews of own work and leads reviews of colleagues work.
- Engage with key stakeholders (Business, Markets, Devs, Vendors)
- Ensuring team code is compliant with code quality and standards.
- Highlighting tech debt and ensuring it s addressed in the roadmap.
Required skills
- At least 8+ years overall IT experience with 5 years in relevant area (Development / SRE).
- Strong awareness and experience of working with Site Reliability Engineering principles.
- Good understanding of public cloud offerings such as AWS components like EC2, IAM, RDS, CloudWatch etc . ,
- Conceptual knowledge around ETL
- Good understanding of Big data (Hadoop / Hive / Spark)
- Hands on experience on enterprise tools set such as Grafana, Instana, Prometheus, ELK Stack etc.
- Has exposure to networking concepts (SSH, FTP, TCP/IP, DNS, Load balancing, CDN etc.).
- Has experience in any scripting language (bash / python / perl).
- Experience operating high-availability, fault-tolerant, scalable, distributed software in production: building monitoring into your code, tweaking dashboards, defining alerts.
- Knowledge of Agile software development principles including using JIRA.
- Experience in 24/7 high availability production environment.
- Excellent organizational, verbal and written communication skills.
- Aptitude to be a good team player and the desire to learn and implement new technologies.
- Knowledge of ITIL processes.
Nice to Have
- Good experience with CI/CD pipelines including BitBucket, Jenkins
- Knowledge of server-side technologies such as Kubernetes, NodeJS, Docker, Java
- Experience with building Rest APIs, API Integration, and Web Services is preferred
- Knowledge in Messaging and Streaming frameworks like – RabbitMQ / Kafka
- DB (Redis, RDS, Dynamo DB)
- Exposure to languages such as Typescript, Nodejs.
- ITIL V4 Foundation certified.
Submit CV To All Data Science Job Consultants Across India For Free
๐ Explore All Related ITSM Jobs Below! ๐
โ
Select your preferred “Job Category” in the Job Category Filter ๐ฏ
๐ Hit “Search” to find matching jobs ๐ฅ
โ Click the “+” icon that appears just before the company name to see the Job Detail & Apply Link ๐๐ผ

Leave a Reply