Job Title: Site Reliability Engineer (SRE) / Infrastructure Operations MID LEVELRole OverviewResponsible for managing day-to-day infrastructure operations, including monitoring, alerting, and driving stability improvements across the environment.Key ResponsibilitiesMonitor overall infrastructure health and system performanceTrack key performance metrics such as CPU, memory, and disk utilizationTune alerts to improve signal-to-noise ratio and reduce alert fatigueSupport disaster recovery (DR) rehearsals and readiness activitiesMaintain and update runbooks, documentation, and operational reportsRequired Experience4–6 years of experience in Site Reliability Engineering (SRE) or infrastructure operationsHands-on experience with VMware environmentsExperience with monitoring tools such as PRTG, Datadog, or similar platformsStrong incident management experience, including response and resolution processesCore Skills & CompetenciesSolid understanding of infrastructure performance metrics (CPU, memory, disk, etc.)Experience with alert tuning and optimizationAbility to proactively detect and troubleshoot performance issuesStrong incident management and operational response capabilitiesScreening SignalsLook for candidates who:Understand CPU Ready thresholds and their impact on performanceHave hands-on experience tuning alerts to reduce noiseCan proactively identify and resolve performance bottlenecksDemonstrate strong incident management experience in production environments
Site Reliability Engineer
INSIGHT GLOBAL
Conselheiro Lafaiete, Minas Gerais