Job Title: Site Reliability Engineer (SRE) / Infrastructure Operations MID LEVEL
Role Overview
Responsible for managing day-to-day infrastructure operations, including monitoring, alerting, and driving stability improvements across the environment.
Key Responsibilities
- Monitor overall infrastructure health and system performance
- Track key performance metrics such as CPU, memory, and disk utilization
- Tune alerts to improve signal-to-noise ratio and reduce alert fatigue
- Support disaster recovery (DR) rehearsals and readiness activities
- Maintain and update runbooks, documentation, and operational reports
Required Experience
- 4–6 years of experience in Site Reliability Engineering (SRE) or infrastructure operations
- Hands-on experience with VMware environments
- Experience with monitoring tools such as PRTG, Datadog, or similar platforms
- Strong incident management experience, including response and resolution processes
Core Skills & Competencies
- Solid understanding of infrastructure performance metrics (CPU, memory, disk, etc.)
- Experience with alert tuning and optimization
- Ability to proactively detect and troubleshoot performance issues
- Strong incident management and operational response capabilities
Screening Signals
Look for candidates who:
- Understand CPU Ready thresholds and their impact on performance
- Have hands-on experience tuning alerts to reduce noise
- Can proactively identify and resolve performance bottlenecks
- Demonstrate strong incident management experience in production environments
Site Reliability Engineer
INSIGHT GLOBAL
Rio de Janeiro, State of Rio de Janeiro