Efficient IT Operations

Goal:

Modernize IT Operations by adopting Agile Practices / SRE / DevOps
Deliver superior client experience and minimize operational risk

Pillars:

SLOs & Error Budgets
Monitoring / Observability
Reduce Toil

proposed transition to agile

evaluating the current state

Frontline support personnel are critical in managing firm risk. We can help if any of the below are priorities —
Superior and consistent client experience

Do you see a scope in improving reliability and client satisfaction while minimizing operational risk?
Are you striking the right balance between rapid product deployment, stability and scalability?
Do you solicit and leverage client feedback to continuously improve your support process?
Do you have the right mix of Engineering and Support skills to continuously reduce toil?
Are you leveraging all the latest tools to improve observability of your platform?
Are you looking Production Management as a risk management problem?
Are you asking the right questions?

defining SLIs, SLOs, SLAs

Service Level Indicator (SLI) – A key aspect of the level of service

Availability – fraction of time a service is usable
Request Latency – time taken to return a request
Error Rate – fraction of requests received
System Throughput – requests per sec
Durability – likelihood that data will be retained over a period of time
Correctness – was the right data retrieved

Service Level Objective (SLO) – Target level for reliability of service (usually expressed in nines)

Signed off by stakeholders as reasonable uptime targets (99.9% availability)
SLO targets are simple and measurable, process in place to continuously improve
Evaluate Expected Value vs. Reliability Costs (cost of increasing reliability vs return on investment)

Service Level Agreements (SLA) – Contracts with users along with consequences of not meeting SLOs

What is the penalty if an SLO is not met?
SLAs are closely tied to business and product decisions

defining error budgets

Error Budgets — Accepted level of unreliability

Calculation: 100% minus the agreed Availability Target, ie., the budget that can be allocated
For a service with 1 million monthly requests and SLO target of 99.9%, the error budget is 0.1%. Ie., 1000 errors per month
If a given outage results in 500 errors, 50% of error budget is consumed
Error Budget Burn Rate indicates how fast the Budget is being consumed
Policies define what is the consequence of consuming error budgets too fast (high burn rate)
Alerts are configured to notify users when error rate is high for a given look-back window

support as risk management

Defining Risk

100% uptime is prohibitively expensive;
Quantify risk tolerance as acceptable uptime/downtime with the Business
What is the minimum uptime required for customers satisfaction and retention?
Can you measure your achieved uptime wrt target uptime across multiple windows?
Can you engineer greater reliability and Manage Risk based on these metrics?

Managing Risk

Evaluate last 6 months of production stability (using tickets) and derive acceptable SLO
Design metrics that determine the error budget burn rate across multiple time windows?
Do you have policies in place to manage risk when error budgets are being consumed?

managing by Error Budgets

Big Picture

Start with what users care about; think stability and reliability from a business viewpoint
Note that Request / Response, Data Processing, Storage applications have different SLO requirements
Start with desired Objectives (SLOs), and work backwards to choosing specific Indicators (SLIs)
Set meaningful SLOs and alert when there are actionable threats to the Error Budget
Evaluate Expected Value vs. Cost for Increasing Reliability as a measure while calibrating the SLOs

Summary:

SLIs are useful when raw measurements are aggregated across different time periods
Keep as few and simple SLOs as possible – those that provide a good coverage of your system
Start with loose targets and tighten as you go

monitoring

Monitoring
Definition: Collecting, Processing, Aggregating, and Displaying real-time data about the system

Monitor for The Golden Signals – Latency, Throughput, Errors, and Saturation
Mine all available sources like Metrics, Text Logging, Distributed Traces, Structured Event logging, Database updates
Combine white-box monitoring with black-box to derive cause and effect
Store data as time series with drill down capabilities for individual metrics and trends
Create APIs and Dashboards to display data in graphs, heatmaps, histograms
Minimize the need to have constant Eyes on Glass; also email alerts are of limited value
Alert only on conditions that require attention and are actionable;
Standardize instrumentation tools like Prometheus (evaluating rules), AlertManager (alerts), Grafana (dashboarding)

alerting

Alerting on SLOs

Definition: Notification intended to be read by a human (what’s broken and why)

Design alerting strategy based on the specific needs of the Business
Generate Alerts from monitoring SLIs and Error Budget Burn Rate
Pay special attention to look-back window and burn-rate thresholds
Configure multi-window, multi-burn rate alerts to minimize false positives
Start with loose targets and tighten SLOs as you learn about system’s behaviour