Description:
We are looking for a Site Reliability Engineer. As a Site Reliability Engineer, you will solve interesting technical challenges by defining, designing, deploying, and troubleshooting key Cloud resilience, security, and performance. The role involves software engineering, systems engineering, automation, network operations, data engineering, and DevOps. You should be comfortable at building complex distributed systems involving huge amounts of data – collecting metrics, building data pipelines, and analytics. You will incorporate the ethos of software engineering and apply it to large-scale operational problems. Your primary goals are to create highly reliable and services, platforms, and infrastructure, always thinking about reliability, scalability, ultra-scalable software systems to manage the operations of our services. When not working on operations, you will be working on software engineering tasks such as design and development of systems that increase reliability, scalability, and reduce operational overhead through automation. You should value simplicity and scale, work comfortably in a collaborative, agile environment, and be excited to learn.
Desired Qualifications:
-
-
- 5+ years of experience in DevOps role with hands-on experience in
- Ansible, Jenkins, Docker, Kubernetes, and Linux (TCP/IP, DNS, Load balancing technologies)
- Ability to lead and developing/operating large scale distributed services/ applications
- Strong grip on working with fault-tolerant, highly available, high throughput, distributed, scalable systems
- Thorough knowledge of working in an operational environment with mission-critical tier one services with associated pager duty
- Deep understanding of service metrics and alarms through the development of dashboards, service KPIs, alarming systems
- Thorough understanding of Linux operating systems and Linux system administration
- Knowledge of Linux internals, TCP/IP, DNS, Load balancing technologies
- Experience with Oracle DB, SQL, PL/SQL, Oracle Application Express (APEX)
- Ability to monitor, SNMP, Syslog, telemetry, REST API
- Experience with orchestration and configuration management tools like Terraform, Kubernetes
- Exposure to Grafana, Prometheus or other TSDB, Kafka, ElasticSearch, and other distributed platforms
- Linux
○ System Administration
○ User Mgmt
○ Troubleshooting - DBs
○ Experience with Oracle DB, SQL, PL/SQL, Oracle Application Express
(APEX) - Languages
○ Proficiency in programming, scripting languages such as Java, Python, Perl, BASH
Others - Others
○ Knowledge of Scrum & Agile Methodologies
○ Aptitude to be a good team player and the willingness to learn and implement new Cloud technologies as needed
○ Systematic problem-solving approach, strong communication skills, a senseof ownership and drive
○ Excellent organizational, verbal and written communication skills, and global team working skills
-