Efficient IT Operations

Goal:

  • Modernize IT Operations by adopting Agile Practices / SRE / DevOps
  • Deliver superior client experience and minimize operational risk

 

Pillars:

  • SLOs & Error Budgets
  • Monitoring / Observability
  • Reduce Toil

proposed transition to agile

evaluating the current state

Frontline support personnel are critical in managing firm risk. We can help if any of the below are priorities —
Superior and consistent client experience

  • Do you see a scope in improving reliability and client satisfaction while minimizing operational risk?
  • Are you striking the right balance between rapid product deployment, stability and scalability?
  • Do you solicit and leverage client feedback to continuously improve your support process?
  • Do you have the right mix of Engineering and Support skills to continuously reduce toil?
  • Are you leveraging all the latest tools to improve observability of your platform?
  • Are you looking Production Management as a risk management problem?
  • Are you asking the right questions?

defining SLIs, SLOs, SLAs

Service Level Indicator (SLI) – A key aspect of the level of service

  • Availability – fraction of time a service is usable
  • Request Latency – time taken to return a request
  • Error Rate – fraction of requests received
  • System Throughput – requests per sec
  • Durability – likelihood that data will be retained over a period of time
  • Correctness – was the right data retrieved

Service Level Objective (SLO) – Target level for reliability of service (usually expressed in nines)

  • Signed off by stakeholders as reasonable uptime targets (99.9% availability)
  • SLO targets are simple and measurable, process in place to continuously improve
  • Evaluate Expected Value vs. Reliability Costs (cost of increasing reliability vs return on investment)

Service Level Agreements (SLA) – Contracts with users along with consequences of not meeting SLOs

  • What is the penalty if an SLO is not met?
  • SLAs are closely tied to business and product decisions

defining error budgets

Error Budgets — Accepted level of unreliability

  • Calculation: 100% minus the agreed Availability Target, ie., the budget that can be allocated
  • For a service with 1 million monthly requests and SLO target of 99.9%, the error budget is 0.1%. Ie., 1000 errors per month
  • If a given outage results in 500 errors, 50% of error budget is consumed
  • Error Budget Burn Rate indicates how fast the Budget is being consumed
  • Policies define what is the consequence of consuming error budgets too fast (high burn rate)
  • Alerts are configured to notify users when error rate is high for a given look-back window

support as risk management

Defining Risk

  • 100% uptime is prohibitively expensive;
  • Quantify risk tolerance as acceptable uptime/downtime with the Business
  • What is the minimum uptime required for customers satisfaction and retention?
  • Can you measure your achieved uptime wrt target uptime across multiple windows?
  • Can you engineer greater reliability and Manage Risk based on these metrics?

 

Managing Risk

  • Evaluate last 6 months of production stability (using tickets) and derive acceptable SLO
  • Design metrics that determine the error budget burn rate across multiple time windows?
  • Do you have policies in place to manage risk when error budgets are being consumed?

managing by Error Budgets

Big Picture

  • Start with what users care about; think stability and reliability from a business viewpoint
  • Note that Request / Response, Data Processing, Storage applications have different SLO requirements
  • Start with desired Objectives (SLOs), and work backwards to choosing specific Indicators (SLIs)
  • Set meaningful SLOs and alert when there are actionable threats to the Error Budget
  • Evaluate Expected Value vs. Cost for Increasing Reliability as a measure while calibrating the SLOs

Summary:

  • SLIs are useful when raw measurements are aggregated across different time periods
  • Keep as few and simple SLOs as possible – those that provide a good coverage of your system
  • Start with loose targets and tighten as you go

monitoring

Monitoring
Definition: Collecting, Processing, Aggregating, and Displaying real-time data about the system

  • Monitor for The Golden Signals – Latency, Throughput, Errors, and Saturation
  • Mine all available sources like Metrics, Text Logging, Distributed Traces, Structured Event logging, Database updates
  • Combine white-box monitoring with black-box to derive cause and effect
  • Store data as time series with drill down capabilities for individual metrics and trends
  • Create APIs and Dashboards to display data in graphs, heatmaps, histograms
  • Minimize the need to have constant Eyes on Glass; also email alerts are of limited value
  • Alert only on conditions that require attention and are actionable;
  • Standardize instrumentation tools like Prometheus (evaluating rules), AlertManager (alerts), Grafana (dashboarding)

alerting

Alerting on SLOs

Definition: Notification intended to be read by a human (what’s broken and why)

  • Design alerting strategy based on the specific needs of the Business
  • Generate Alerts from monitoring SLIs and Error Budget Burn Rate
  • Pay special attention to look-back window and burn-rate thresholds
  • Configure multi-window, multi-burn rate alerts to minimize false positives
  • Start with loose targets and tighten SLOs as you learn about system’s behaviour

toil management

strategies to manage toil

agile operations in action