Systems Operations Manager V **Strong analytical skills - able to triage an issue - own a problem through resolution – prior helpdesk experience would help ** AWS – applied cloud experience;** Dynatrace & Splunk;. **Must have excellent communication skills and work well independently – self-starter and directed, able to lead and drive ambiguous data to root cause analysis, also able to work collaboratively. ** Video conference interview/technical screen.
Description:
Ally IT Production Operations team is looking for an experienced technician who will serve as a technical lead focusing on operational stability by driving IT operations readiness through the continuous improvement in our products. This role will involve working closely with architecture, development teams and business partners, coaching junior engineers, and implementing enhanced monitoring and alerting capabilities for our distributed platforms. Additionally, will aid in the triage of major incidents and will take ownership of problem management activities driving deep root cause analysis and corrective action. The ideal candidate will have good to significant experience managing production environments. We are looking for a high energy, team player with an innovative mindset interested in joining a group of IT professionals dedicated to enhancing IT operations. This position will report to a Manager of Production Operations. Passion for technology and problem solving are a must have.
Collaborates with Agile squads/developers, other production operations and business partners and provides significant contributions to develop specifications to resolve problems, and to address enhancement needs focusing in areas of logging, monitoring and metrics for operational readiness
Uses technical knowledge, creativity, and company practices and to drive down occurrences of incidents through development of proactive alerting and monitoring.
Provide continuous feedback to development teams on system stability, defect analysis and system enhancements
Contributes to runbooks and patterns to sustain applications in a production environment
Serves as a mentor to junior IT Engineers
Triages and Analyzes runtime problems to isolate root cause and resolution
Participates in technical discussions and drives operations readiness activities with the development teams, 3rd party service providers and business partners.
Lead RCA and SWAT investigations
Provides guidance in resolving performance related issues and recommending technical solutions.
Skills:
Holds BS (preferably MS) in Computer Science or related field.
5+ years of experience in a similar role and good knowledge of triage related processes
Shows deep knowledge and understanding of enterprise-scale platforms and architectures
Possesses strong analytical, problem-solving skills and exhibits strong leadership skills
Experience with Co-ordination between upstream applications to resolve incidents
Grasps new technologies and can adapt to rapid shifts in priorities
Applied AWS/Cloud experience preferred
Applied experience with as many of the following as possible: Unix and Windows platforms, Java EE, JavaScript, Spring, Spring Boot, REST API/Micro Services, Shell Scripting, Python, SQL and databases, specifically Oracle
Previous DevOps experience with tools such as Python, Terraform, Jenkins, GitLab, docker preferred
Experience with Ansible Tower or other automation tools are a bonus
Experience with Dynatrace, Splunk or other similar monitoring tools creating dashboards, alerting and reports
Correlate environment conditions and metrics to application events
Experience debugging problems in on-prem/could/hybrid distributed system