Job Title: Site Reliability Engineer (SRE) / Infrastructure Operations MID LEVEL
Role Overview
Responsible for managing day-to-day infrastructure operations, including monitoring, alerting, and driving stability improvements across the environment.
Key Responsibilities
Monitor overall infrastructure health and system performance
Track key performance metrics such as CPU, memory, and disk utilization
Tune alerts to improve signal-to-noise ratio and reduce alert fatigue
Support disaster recovery (DR) rehearsals and readiness activities
Maintain and update runbooks, documentation, and operational reports
Required Experience
4–6 years of experience in Site Reliability Engineering (SRE) or infrastructure operations
Hands-on experience with VMware environments
Experience with monitoring tools such as PRTG, Datadog, or similar platforms
Strong incident management experience, including response and resolution processes
Core Skills & Competencies
Solid understanding of infrastructure performance metrics (CPU, memory, disk, etc.)
Experience with alert tuning and optimization
Ability to proactively detect and troubleshoot performance issues
Strong incident management and operational response capabilities
Screening Signals
Look for candidates who:
Understand CPU Ready thresholds and their impact on performance
Have hands-on experience tuning alerts to reduce noise
Can proactively identify and resolve performance bottlenecks
Demonstrate strong incident management experience in production environments
Site Reliability Engineer
INSIGHT GLOBAL
Primavera do Leste, Mato Grosso