SEGAS-00011 Infrastructure utilisation monitoring

Last updated: 20 September 2023
Relates to (tags): Observability, Monitoring, Infrastructure, SRE

Monitoring infrastructure utilisation enables the increased reliability and performance of Home Office services by improving:

Automation of infrastructure scaling
More predictable workloads
Trend analysis and capacity planning
Cost optimisation
Detecting, identifying and remediating issues
Assuring the reliability of services

Monitoring infrastructure utilisation without also monitoring other signals of service performance is not enough to ensure a high quality service. Teams should look to our patterns for monitoring (for example Monitoring-as-code) to meet this standard and complement other service monitoring.

Requirements

Infrastructure MUST be observable relative to defined service level expectations
CPU utilisation MUST be observable
Memory utilisation MUST be observable
Disk utilisation MUST be observable
Network utilisation MUST be observable
Historical infrastructure monitoring metrics MUST be retained for analysis

Infrastructure MUST be observable relative to defined service level expectations

Infrastructure utilisation should be baselined so that Service Level Objectives (SLOs) can be defined for infrastructure measures. This enables triggers for automated proactive measures.

CPU utilisation MUST be observable

CPU utilisation by applications, services, systems or pods are to be monitored so that effective measures such as scaling out can be triggered in periods of saturation.

Memory utilisation MUST be observable

Memory utilisation by applications, services, systems or pods are to be monitored so that effective measures such as scaling out can be triggered in periods of saturation.

Disk utilisation MUST be observable

Disk utilisation by applications, services, systems or pods are to be monitored so that effective measures such as scaling out can be triggered in periods of saturation.

Network utilisation MUST be observable

Network utilisation by applications, services, systems or pods are to be monitored so that effective measures such as scaling out can be triggered in periods of saturation.

Historical infrastructure monitoring metrics MUST be retained for analysis

In order to allow for trend analysis and capacity planning, infrastructure monitoring metrics must be retained for a time period appropriate to the usage profile of the service.

Content version permalink (GitHub) (opens in a new tab)