At Ally, you get a startup feel, but experience the benefits of a company that’s worked out the kinks and is fulfilling its purpose. We’re always evolving and see that as a good thing. From owning our work to seeing its impact in the real world, our team is relentless in finding new ways technology can help make experiences better and help people. We are problem solvers, we value diverse thinking, we support one another, and we challenge ourselves to think bigger in the journey to deliver customer-obsessed tech solutions. To read more about what our tech team does, be sure to visit our tech blog at ally.tech)
Are you passionate about ensuring the reliability and scalability of complex systems? Do you thrive on implementing efficient solutions to prevent and resolve incidents? We are seeking a talented and motivated Site Reliability Engineer (SRE) to join our dynamic team.
The Work
Collaborate with cross-functional teams to design, build, and maintain robust, scalable, and fault-tolerant systems
Monitor and analyze system performance, proactively identifying potential issues and implementing solutions to ensure optimal performance and reliability
Develop and maintain automated tools and processes to streamline operational tasks and reduce manual interventions
Participate in incident response and post-mortems, contributing to continuous improvement efforts
Work closely with development teams and architects to advocate for reliability best practices during the application development lifecycle
Implement and manage monitoring and alerting systems to provide real-time visibility into system health and performance
Conduct capacity planning and resource optimization to handle growing demands on our infrastructure
Continuously research and evaluate new technologies and practices to enhance the reliability and efficiency of our systems
The Skills You Bring
Bachelor's degree in Computer Science, Engineering, or related fields (or equivalent practical experience)
Strong technical documentation and verbal and written communication skills
Ability to collaborate effectively in a team environment and communicate technical concepts to non-technical stakeholders
Proven 5+ years' experience as a Site Reliability Engineer or similar role in a production environment
5+ years' experience with AWS services (ASG, Fargate, Lambda, Aurora DB, Dynamo DB, ALB/NLB)
5+ years' working experience with CI/CD pipelines and developing infrastructure-as-code (Gitlab, Terraform, Ansible, etc.)
Strong knowledge of Linux/Unix systems and network protocols
Experience with distributed systems and microservices architecture
Proficiency in programming or scripting languages such as Python, Java, or bash
Hands-on experience with monitoring and logging tools (DynaTrace, Cloudwatch, Prometheus, Grafana, etc.)
Working experience with designing Observability for enterprise applications
Familiarity with cybersecurity best practices and principles