インフラ監視・運用のベストプラクティス 2025年版 - 可観測性から自動化まで
現代的なインフラ監視の実践ガイド。Prometheus、Grafana、ELKスタック、分散トレーシング、インシデント対応、SRE実践まで、包括的に解説します。
約5分で読めます
技術記事
実践的
この記事のポイント
現代的なインフラ監視の実践ガイド。Prometheus、Grafana、ELKスタック、分散トレーシング、インシデント対応、SRE実践まで、包括的に解説します。
この記事では、実践的なアプローチで技術的な課題を解決する方法を詳しく解説します。具体的なコード例とともに、ベストプラクティスを学ぶことができます。
はじめに
2025年のインフラ運用では、単なる監視から「可観測性(Observability)」へとパラダイムシフトが進んでいます。本記事では、最新のツールとベストプラクティスを用いた、効果的なインフラ監視・運用の方法を解説します。
可観測性の3つの柱
メトリクス、ログ、トレースの統合
graph TB subgraph "Observability Stack" A[Applications] --> M[Metrics] A --> L[Logs] A --> T[Traces] M --> P[Prometheus] L --> E[Elasticsearch] T --> J[Jaeger] P --> G[Grafana] E --> G J --> G G --> D[Dashboards] G --> AL[Alerts] end subgraph "Data Flow" OT[OpenTelemetry] --> M OT --> L OT --> T end
OpenTelemetryの実装
// telemetry/setup.go
package telemetry
import (
"context"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/exporters/prometheus"
"go.opentelemetry.io/otel/metric"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/metric"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
)
type TelemetryConfig struct {
ServiceName string
ServiceVersion string
Environment string
OTLPEndpoint string
}
func InitTelemetry(cfg TelemetryConfig) (*sdktrace.TracerProvider, *metric.MeterProvider, error) {
ctx := context.Background()
// リソース定義
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceName(cfg.ServiceName),
semconv.ServiceVersion(cfg.ServiceVersion),
semconv.DeploymentEnvironment(cfg.Environment),
),
)
if err != nil {
return nil, nil, err
}
// トレース設定
traceExporter, err := otlptrace.New(
ctx,
otlptracegrpc.NewClient(
otlptracegrpc.WithEndpoint(cfg.OTLPEndpoint),
otlptracegrpc.WithInsecure(),
),
)
if err != nil {
return nil, nil, err
}
tracerProvider := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(traceExporter),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.AlwaysSample()),
)
otel.SetTracerProvider(tracerProvider)
otel.SetTextMapPropagator(
propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
),
)
// メトリクス設定
promExporter, err := prometheus.New()
if err != nil {
return nil, nil, err
}
meterProvider := metric.NewMeterProvider(
metric.WithResource(res),
metric.WithReader(promExporter),
)
otel.SetMeterProvider(meterProvider)
return tracerProvider, meterProvider, nil
}
// カスタムメトリクスの定義
type MetricsCollector struct {
meter metric.Meter
requestCount metric.Int64Counter
requestDuration metric.Float64Histogram
activeConnections metric.Int64UpDownCounter
}
func NewMetricsCollector(provider *metric.MeterProvider) (*MetricsCollector, error) {
meter := provider.Meter("app.metrics")
requestCount, err := meter.Int64Counter(
"http_requests_total",
metric.WithDescription("Total number of HTTP requests"),
metric.WithUnit("1"),
)
if err != nil {
return nil, err
}
requestDuration, err := meter.Float64Histogram(
"http_request_duration_seconds",
metric.WithDescription("HTTP request duration in seconds"),
metric.WithUnit("s"),
)
if err != nil {
return nil, err
}
activeConnections, err := meter.Int64UpDownCounter(
"http_active_connections",
metric.WithDescription("Number of active HTTP connections"),
metric.WithUnit("1"),
)
if err != nil {
return nil, err
}
return &MetricsCollector{
meter: meter,
requestCount: requestCount,
requestDuration: requestDuration,
activeConnections: activeConnections,
}, nil
}
Prometheusによるメトリクス収集
Prometheus設定
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'ap-northeast-1'
# アラートマネージャー設定
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# ルールファイル
rule_files:
- "alerts/*.yml"
- "recording_rules/*.yml"
# スクレイプ設定
scrape_configs:
# Kubernetesサービスディスカバリー
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Node Exporter
- job_name: 'node-exporter'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# カスタムアプリケーション
- job_name: 'custom-app'
static_configs:
- targets: ['app-1:8080', 'app-2:8080', 'app-3:8080']
metrics_path: '/metrics'
アラートルールの定義
# alerts/application.yml
groups:
- name: application_alerts
interval: 30s
rules:
# エラー率アラート
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate detected"
description: "Service {{ $labels.service }} has error rate of {{ $value | humanizePercentage }}"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
# レスポンスタイムアラート
- alert: HighResponseTime
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
) > 1
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "High response time detected"
description: "95th percentile response time for {{ $labels.service }} is {{ $value }}s"
# メモリ使用率アラート
- alert: HighMemoryUsage
expr: |
(
container_memory_working_set_bytes{pod!=""}
/
container_spec_memory_limit_bytes{pod!=""}
) > 0.8
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High memory usage detected"
description: "Pod {{ $labels.pod }} memory usage is {{ $value | humanizePercentage }}"
- name: infrastructure_alerts
interval: 30s
rules:
# ディスク使用率アラート
- alert: DiskSpaceRunningOut
expr: |
(
node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs|vfat"}
/
node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs|vfat"}
) < 0.1
for: 15m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Disk space running out"
description: "Node {{ $labels.instance }} has only {{ $value | humanizePercentage }} disk space left on {{ $labels.mountpoint }}"
# CPU使用率アラート
- alert: HighCPUUsage
expr: |
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 15m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU usage detected"
description: "Node {{ $labels.instance }} CPU usage is {{ $value }}%"
Grafanaによる可視化
ダッシュボード定義
{
"dashboard": {
"title": "Application Performance Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
],
"type": "graph",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
}
},
{
"title": "Error Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
],
"type": "graph",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
}
},
{
"title": "Response Time (95th percentile)",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))",
"legendFormat": "{{ service }}"
}
],
"type": "graph",
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 8
}
}
]
}
}
Grafana as Code
# terraform/grafana.tf
resource "grafana_dashboard" "application_performance" {
config_json = jsonencode({
title = "Application Performance Dashboard"
uid = "app-performance"
panels = [
{
id = 1
title = "Request Rate"
type = "timeseries"
gridPos = {
h = 8
w = 12
x = 0
y = 0
}
targets = [
{
expr = "sum(rate(http_requests_total[5m])) by (service)"
refId = "A"
datasource = "Prometheus"
legendFormat = "{{ service }}"
}
]
fieldConfig = {
defaults = {
unit = "reqps"
color = {
mode = "palette-classic"
}
}
}
}
]
})
}
resource "grafana_alert_rule" "high_error_rate" {
title = "High Error Rate Alert"
uid = "high-error-rate"
folder_uid = grafana_folder.alerts.uid
condition = "C"
data {
ref_id = "A"
query_type = "prometheus"
model = jsonencode({
expr = "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)"
refId = "A"
})
}
data {
ref_id = "B"
query_type = "prometheus"
model = jsonencode({
expr = "sum(rate(http_requests_total[5m])) by (service)"
refId = "B"
})
}
data {
ref_id = "C"
query_type = "math"
model = jsonencode({
expression = "$A / $B"
refId = "C"
})
}
no_data_state = "NoData"
exec_err_state = "Alerting"
for = "5m"
annotations = {
summary = "Service {{ $labels.service }} has high error rate"
description = "Error rate is {{ $values.C }}%"
}
}
ログ管理とELKスタック
Elasticsearch設定
# elasticsearch.yml
cluster.name: production-logging
node.name: es-node-1
# ネットワーク設定
network.host: 0.0.0.0
http.port: 9200
# ディスカバリー設定
discovery.seed_hosts:
- es-node-1
- es-node-2
- es-node-3
cluster.initial_master_nodes:
- es-node-1
- es-node-2
- es-node-3
# インデックス設定
action.auto_create_index: ".monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*"
# パフォーマンス設定
indices.memory.index_buffer_size: 30%
thread_pool.write.queue_size: 1000
# セキュリティ設定
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
Logstashパイプライン
# logstash/pipeline/application.conf
input {
beats {
port => 5044
ssl => true
ssl_certificate_authorities => ["/etc/logstash/certs/ca.crt"]
ssl_certificate => "/etc/logstash/certs/logstash.crt"
ssl_key => "/etc/logstash/certs/logstash.key"
}
kafka {
bootstrap_servers => "kafka-1:9092,kafka-2:9092,kafka-3:9092"
topics => ["application-logs"]
group_id => "logstash-consumer"
codec => "json"
}
}
filter {
# JSONログのパース
if [message] =~ /^\{/ {
json {
source => "message"
target => "parsed"
}
mutate {
add_field => {
"[@metadata][target_index]" => "app-logs-%{[parsed][service]}-%{+YYYY.MM.dd}"
}
}
}
# アクセスログのパース
if [type] == "nginx-access" {
grok {
match => {
"message" => '%{IPORHOST:remote_addr} - %{DATA:remote_user} \[%{HTTPDATE:time_local}\] "%{WORD:method} %{DATA:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:body_bytes_sent} "%{DATA:http_referer}" "%{DATA:http_user_agent}" "%{DATA:http_x_forwarded_for}" %{NUMBER:request_time}'
}
}
date {
match => ["time_local", "dd/MMM/yyyy:HH:mm:ss Z"]
target => "@timestamp"
}
mutate {
convert => {
"status" => "integer"
"body_bytes_sent" => "integer"
"request_time" => "float"
}
}
# GeoIP解決
geoip {
source => "remote_addr"
target => "geoip"
}
}
# エラーログの解析
if [level] == "ERROR" or [level] == "FATAL" {
mutate {
add_tag => ["alert"]
}
# スタックトレースの集約
if [stack_trace] {
fingerprint {
source => ["stack_trace"]
target => "[@metadata][fingerprint]"
method => "SHA256"
}
}
}
# メトリクス抽出
if [parsed][metrics] {
ruby {
code => '
metrics = event.get("[parsed][metrics]")
metrics.each do |key, value|
event.set("metric_#{key}", value)
end
'
}
}
}
output {
elasticsearch {
hosts => ["es-node-1:9200", "es-node-2:9200", "es-node-3:9200"]
index => "%{[@metadata][target_index]}"
template_name => "application-logs"
template => "/etc/logstash/templates/application-logs.json"
template_overwrite => true
# セキュリティ設定
ssl => true
ssl_certificate_verification => true
cacert => "/etc/logstash/certs/ca.crt"
user => "${ELASTIC_USER}"
password => "${ELASTIC_PASSWORD}"
}
# アラート用の出力
if "alert" in [tags] {
http {
url => "http://alertmanager:9093/api/v1/alerts"
http_method => "post"
format => "json"
mapping => {
"alerts" => [
{
"labels" => {
"alertname" => "ApplicationError"
"service" => "%{[parsed][service]}"
"severity" => "critical"
}
"annotations" => {
"summary" => "Application error detected"
"description" => "%{[message]}"
}
}
]
}
}
}
}
Kibanaダッシュボード
{
"version": "8.0.0",
"objects": [
{
"id": "application-logs-dashboard",
"type": "dashboard",
"attributes": {
"title": "Application Logs Dashboard",
"panels": [
{
"version": "8.0.0",
"type": "visualization",
"gridData": {
"x": 0,
"y": 0,
"w": 24,
"h": 15
},
"panelConfig": {
"title": "Log Volume Over Time",
"type": "line",
"params": {
"index": "app-logs-*",
"query": {
"match_all": {}
},
"aggs": {
"time_buckets": {
"date_histogram": {
"field": "@timestamp",
"interval": "5m"
},
"aggs": {
"log_levels": {
"terms": {
"field": "level.keyword"
}
}
}
}
}
}
}
},
{
"version": "8.0.0",
"type": "visualization",
"gridData": {
"x": 0,
"y": 15,
"w": 12,
"h": 15
},
"panelConfig": {
"title": "Top Errors",
"type": "data_table",
"params": {
"index": "app-logs-*",
"query": {
"bool": {
"filter": [
{
"term": {
"level.keyword": "ERROR"
}
}
]
}
},
"aggs": {
"error_messages": {
"terms": {
"field": "error.message.keyword",
"size": 10
}
}
}
}
}
}
]
}
}
]
}
分散トレーシング
Jaegerの設定とデプロイ
# k8s/jaeger-deployment.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger-production
spec:
strategy: production
collector:
replicas: 3
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 1
memory: 2Gi
options:
kafka:
producer:
topic: jaeger-spans
brokers: kafka-1:9092,kafka-2:9092
storage:
type: elasticsearch
options:
es:
server-urls: https://elasticsearch:9200
tls:
ca-cert: /es-certs/ca.crt
username: jaeger
password: ${JAEGER_ES_PASSWORD}
query:
replicas: 2
resources:
limits:
cpu: 1
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
options:
query:
max-clock-skew-adjustment: 30s
ingester:
replicas: 2
options:
kafka:
consumer:
topic: jaeger-spans
brokers: kafka-1:9092,kafka-2:9092
トレース分析の実装
# trace_analysis.py
import requests
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
class TraceAnalyzer:
def __init__(self, jaeger_url):
self.jaeger_url = jaeger_url
self.base_url = f"{jaeger_url}/api/traces"
def get_traces(self, service, operation=None, start_time=None, end_time=None, limit=1000):
"""Jaegerからトレースを取得"""
if not start_time:
start_time = datetime.now() - timedelta(hours=1)
if not end_time:
end_time = datetime.now()
params = {
'service': service,
'start': int(start_time.timestamp() * 1000000),
'end': int(end_time.timestamp() * 1000000),
'limit': limit
}
if operation:
params['operation'] = operation
response = requests.get(self.base_url, params=params)
return response.json()['data']
def analyze_performance(self, traces):
"""トレースからパフォーマンスメトリクスを抽出"""
durations = []
span_counts = []
error_counts = []
for trace in traces:
# トレース全体の期間
trace_duration = max(span['startTime'] + span['duration']
for span in trace['spans']) - \
min(span['startTime'] for span in trace['spans'])
durations.append(trace_duration)
# スパン数
span_counts.append(len(trace['spans']))
# エラー数
error_count = sum(1 for span in trace['spans']
if any(tag['key'] == 'error' and tag['value']
for tag in span.get('tags', [])))
error_counts.append(error_count)
df = pd.DataFrame({
'duration_us': durations,
'span_count': span_counts,
'error_count': error_counts
})
return {
'avg_duration_ms': df['duration_us'].mean() / 1000,
'p50_duration_ms': df['duration_us'].quantile(0.5) / 1000,
'p95_duration_ms': df['duration_us'].quantile(0.95) / 1000,
'p99_duration_ms': df['duration_us'].quantile(0.99) / 1000,
'avg_span_count': df['span_count'].mean(),
'error_rate': (df['error_count'] > 0).mean()
}
def find_bottlenecks(self, trace):
"""トレース内のボトルネックを特定"""
spans = trace['spans']
# 各スパンの自己時間を計算
span_self_times = {}
for span in spans:
span_id = span['spanID']
total_time = span['duration']
# 子スパンの時間を引く
child_time = sum(
child['duration']
for child in spans
if any(ref['spanID'] == span_id
for ref in child.get('references', []))
)
span_self_times[span_id] = {
'operation': span['operationName'],
'service': span['process']['serviceName'],
'self_time': total_time - child_time,
'total_time': total_time
}
# 自己時間でソート
bottlenecks = sorted(
span_self_times.values(),
key=lambda x: x['self_time'],
reverse=True
)[:5]
return bottlenecks
def detect_anomalies(self, traces, threshold_percentile=95):
"""異常なトレースを検出"""
durations = [
max(span['startTime'] + span['duration'] for span in trace['spans']) -
min(span['startTime'] for span in trace['spans'])
for trace in traces
]
threshold = np.percentile(durations, threshold_percentile)
anomalous_traces = [
trace for trace, duration in zip(traces, durations)
if duration > threshold
]
return anomalous_traces
インシデント対応
PagerDutyとの統合
# alertmanager.yml
global:
resolve_timeout: 5m
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: pagerduty-critical
continue: true
- match:
severity: warning
receiver: slack-warnings
- match_re:
service: database-.*
receiver: database-team
receivers:
- name: 'default'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '${PAGERDUTY_SERVICE_KEY}'
description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
alerts: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'slack-warnings'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#warnings'
send_resolved: true
- name: 'database-team'
email_configs:
- to: 'database-team@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: '${SMTP_USERNAME}'
auth_password: '${SMTP_PASSWORD}'
インシデント対応自動化
# incident_automation.py
import os
import json
import requests
from datetime import datetime
from typing import Dict, List, Optional
import asyncio
import aiohttp
class IncidentResponder:
def __init__(self):
self.pagerduty_token = os.getenv('PAGERDUTY_TOKEN')
self.slack_token = os.getenv('SLACK_TOKEN')
self.jira_url = os.getenv('JIRA_URL')
self.jira_auth = (os.getenv('JIRA_USER'), os.getenv('JIRA_TOKEN'))
async def handle_incident(self, alert: Dict):
"""インシデント対応の自動化"""
# インシデントの作成
incident_id = await self.create_pagerduty_incident(alert)
# Slackチャンネルの作成
channel_id = await self.create_incident_channel(incident_id, alert)
# 初期診断の実行
diagnostics = await self.run_diagnostics(alert)
# 診断結果をSlackに投稿
await self.post_to_slack(channel_id, diagnostics)
# Jiraチケットの作成
jira_ticket = await self.create_jira_ticket(incident_id, alert, diagnostics)
# 自動修復の試行
if alert.get('auto_remediate', False):
await self.attempt_auto_remediation(alert)
return {
'incident_id': incident_id,
'slack_channel': channel_id,
'jira_ticket': jira_ticket
}
async def create_pagerduty_incident(self, alert: Dict) -> str:
"""PagerDutyインシデントの作成"""
async with aiohttp.ClientSession() as session:
incident_data = {
'incident': {
'type': 'incident',
'title': alert['title'],
'service': {
'id': alert['service_id'],
'type': 'service_reference'
},
'body': {
'type': 'incident_body',
'details': alert['description']
},
'urgency': 'high' if alert['severity'] == 'critical' else 'low'
}
}
headers = {
'Authorization': f'Token token={self.pagerduty_token}',
'Content-Type': 'application/json'
}
async with session.post(
'https://api.pagerduty.com/incidents',
json=incident_data,
headers=headers
) as response:
result = await response.json()
return result['incident']['id']
async def create_incident_channel(self, incident_id: str, alert: Dict) -> str:
"""Slackインシデントチャンネルの作成"""
channel_name = f"incident-{incident_id[:8]}-{datetime.now().strftime('%Y%m%d')}"
async with aiohttp.ClientSession() as session:
# チャンネル作成
create_response = await session.post(
'https://slack.com/api/conversations.create',
headers={'Authorization': f'Bearer {self.slack_token}'},
json={
'name': channel_name,
'is_private': False
}
)
channel_data = await create_response.json()
channel_id = channel_data['channel']['id']
# トピック設定
await session.post(
'https://slack.com/api/conversations.setTopic',
headers={'Authorization': f'Bearer {self.slack_token}'},
json={
'channel': channel_id,
'topic': f"Incident: {alert['title']}"
}
)
# 関係者を招待
stakeholders = self.get_stakeholders(alert)
if stakeholders:
await session.post(
'https://slack.com/api/conversations.invite',
headers={'Authorization': f'Bearer {self.slack_token}'},
json={
'channel': channel_id,
'users': ','.join(stakeholders)
}
)
return channel_id
async def run_diagnostics(self, alert: Dict) -> Dict:
"""自動診断の実行"""
diagnostics = {
'timestamp': datetime.now().isoformat(),
'service': alert['service'],
'checks': []
}
# サービスごとの診断
if alert['service'] == 'api-gateway':
diagnostics['checks'].extend([
await self.check_api_health(),
await self.check_upstream_services(),
await self.check_rate_limits()
])
elif alert['service'] == 'database':
diagnostics['checks'].extend([
await self.check_database_connections(),
await self.check_query_performance(),
await self.check_replication_lag()
])
# 共通診断
diagnostics['checks'].extend([
await self.check_recent_deployments(),
await self.check_resource_usage(),
await self.check_error_logs()
])
return diagnostics
async def attempt_auto_remediation(self, alert: Dict) -> bool:
"""自動修復の試行"""
remediation_actions = {
'high_memory_usage': self.restart_service,
'connection_pool_exhausted': self.increase_connection_pool,
'rate_limit_exceeded': self.adjust_rate_limits,
'unhealthy_instances': self.replace_unhealthy_instances
}
action = remediation_actions.get(alert['type'])
if action:
try:
result = await action(alert)
await self.log_remediation(alert, result)
return result['success']
except Exception as e:
await self.log_remediation_failure(alert, str(e))
return False
return False
async def restart_service(self, alert: Dict) -> Dict:
"""サービスの再起動"""
service_name = alert['service']
# Kubernetesでのローリング再起動
async with aiohttp.ClientSession() as session:
async with session.patch(
f'https://k8s-api/apis/apps/v1/namespaces/default/deployments/{service_name}',
json={
'spec': {
'template': {
'metadata': {
'annotations': {
'restartedAt': datetime.now().isoformat()
}
}
}
}
}
) as response:
if response.status == 200:
return {'success': True, 'action': 'service_restarted'}
else:
return {'success': False, 'error': await response.text()}
SREプラクティス
SLI/SLO定義
# slo_definitions.yaml
slos:
- name: api-availability
description: "API Gateway availability"
sli:
query: |
sum(rate(http_requests_total{service="api-gateway"}[5m])) -
sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api-gateway"}[5m]))
target: 0.999 # 99.9% availability
window: 30d
- name: api-latency
description: "API response time"
sli:
query: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m]))
by (le)
)
target: 0.95 # 95% of requests under threshold
threshold: 0.3 # 300ms
window: 30d
- name: error-budget
description: "Error budget consumption"
sli:
query: |
1 - (
sum(increase(http_requests_total{status!~"5.."}[30d])) /
sum(increase(http_requests_total[30d]))
)
target: 0.001 # 0.1% error budget
window: 30d
エラーバジェット監視
// error_budget.go
package sre
import (
"context"
"fmt"
"time"
"github.com/prometheus/client_golang/api"
v1 "github.com/prometheus/client_golang/api/prometheus/v1"
)
type ErrorBudgetMonitor struct {
client api.Client
slos []SLO
}
type SLO struct {
Name string
Query string
Target float64
Window time.Duration
}
func (m *ErrorBudgetMonitor) CalculateErrorBudget(ctx context.Context, slo SLO) (*ErrorBudget, error) {
v1api := v1.NewAPI(m.client)
// SLIの計算
result, _, err := v1api.Query(ctx, slo.Query, time.Now())
if err != nil {
return nil, err
}
// 結果の解析
var currentSLI float64
switch v := result.(type) {
case model.Vector:
if len(v) > 0 {
currentSLI = float64(v[0].Value)
}
default:
return nil, fmt.Errorf("unexpected result type: %T", result)
}
// エラーバジェットの計算
errorBudget := &ErrorBudget{
SLO: slo.Name,
Target: slo.Target,
Current: currentSLI,
BudgetRemaining: (slo.Target - (1 - currentSLI)) / slo.Target * 100,
TimeWindow: slo.Window,
}
// バーンレートの計算
errorBudget.BurnRate = m.calculateBurnRate(ctx, slo)
// 予測
if errorBudget.BurnRate > 0 {
errorBudget.TimeToExhaustion = time.Duration(
float64(errorBudget.BudgetRemaining) / errorBudget.BurnRate * float64(time.Hour),
)
}
return errorBudget, nil
}
func (m *ErrorBudgetMonitor) calculateBurnRate(ctx context.Context, slo SLO) float64 {
// 過去1時間のエラー率を計算
query := fmt.Sprintf(`
increase(%s{status=~"5.."}[1h]) /
increase(%s[1h])
`, slo.Query, slo.Query)
v1api := v1.NewAPI(m.client)
result, _, err := v1api.Query(ctx, query, time.Now())
if err != nil {
return 0
}
switch v := result.(type) {
case model.Vector:
if len(v) > 0 {
hourlyErrorRate := float64(v[0].Value)
// バーンレート = 現在のエラー率 / 許容エラー率
return hourlyErrorRate / (1 - slo.Target)
}
}
return 0
}
type ErrorBudget struct {
SLO string
Target float64
Current float64
BudgetRemaining float64
BurnRate float64
TimeToExhaustion time.Duration
TimeWindow time.Duration
}
// アラート生成
func (m *ErrorBudgetMonitor) GenerateAlerts(budget *ErrorBudget) []Alert {
var alerts []Alert
// バジェット残量によるアラート
if budget.BudgetRemaining < 25 {
alerts = append(alerts, Alert{
Name: "ErrorBudgetCritical",
Severity: "critical",
Message: fmt.Sprintf("Error budget for %s is critically low: %.2f%%", budget.SLO, budget.BudgetRemaining),
})
} else if budget.BudgetRemaining < 50 {
alerts = append(alerts, Alert{
Name: "ErrorBudgetWarning",
Severity: "warning",
Message: fmt.Sprintf("Error budget for %s is low: %.2f%%", budget.SLO, budget.BudgetRemaining),
})
}
// バーンレートによるアラート
if budget.BurnRate > 10 {
alerts = append(alerts, Alert{
Name: "HighBurnRate",
Severity: "critical",
Message: fmt.Sprintf("Burn rate for %s is very high: %.2fx", budget.SLO, budget.BurnRate),
})
} else if budget.BurnRate > 2 {
alerts = append(alerts, Alert{
Name: "ElevatedBurnRate",
Severity: "warning",
Message: fmt.Sprintf("Burn rate for %s is elevated: %.2fx", budget.SLO, budget.BurnRate),
})
}
return alerts
}
自動化とInfrastructure as Code
Terraformによるインフラ管理
# monitoring/main.tf
module "prometheus" {
source = "./modules/prometheus"
namespace = "monitoring"
storage_size = "100Gi"
retention_days = 30
scrape_configs = [
{
job_name = "kubernetes-pods"
kubernetes_sd_configs = [{
role = "pod"
}]
}
]
alert_rules = file("${path.module}/alerts/*.yml")
}
module "grafana" {
source = "./modules/grafana"
namespace = "monitoring"
admin_password = var.grafana_admin_password
datasources = [
{
name = "Prometheus"
type = "prometheus"
url = "http://prometheus:9090"
},
{
name = "Elasticsearch"
type = "elasticsearch"
url = "http://elasticsearch:9200"
}
]
dashboards = {
for f in fileset("${path.module}/dashboards", "*.json") :
basename(f) => file("${path.module}/dashboards/${f}")
}
}
module "elasticsearch" {
source = "./modules/elasticsearch"
namespace = "logging"
cluster_name = "production-logs"
node_count = 3
node_resources = {
cpu = "2"
memory = "8Gi"
storage = "500Gi"
}
index_lifecycle_policies = {
logs = {
hot = {
min_age = "0ms"
actions = {
rollover = {
max_size = "50GB"
max_age = "7d"
}
}
}
warm = {
min_age = "7d"
actions = {
shrink = {
number_of_shards = 1
}
forcemerge = {
max_num_segments = 1
}
}
}
delete = {
min_age = "30d"
actions = {
delete = {}
}
}
}
}
}
Ansibleによる設定管理
# ansible/monitoring-playbook.yml
---
- name: Configure monitoring infrastructure
hosts: monitoring_servers
become: yes
vars:
prometheus_version: "2.45.0"
grafana_version: "10.0.0"
node_exporter_version: "1.6.0"
tasks:
- name: Install monitoring dependencies
package:
name:
- curl
- tar
- python3-pip
state: present
- name: Create monitoring user
user:
name: monitoring
system: yes
shell: /bin/false
home: /var/lib/monitoring
createhome: yes
- name: Install Prometheus
block:
- name: Download Prometheus
unarchive:
src: "https://github.com/prometheus/prometheus/releases/download/v{{ prometheus_version }}/prometheus-{{ prometheus_version }}.linux-amd64.tar.gz"
dest: /opt
remote_src: yes
owner: monitoring
group: monitoring
- name: Configure Prometheus
template:
src: prometheus.yml.j2
dest: /opt/prometheus-{{ prometheus_version }}.linux-amd64/prometheus.yml
owner: monitoring
group: monitoring
notify: restart prometheus
- name: Create Prometheus systemd service
template:
src: prometheus.service.j2
dest: /etc/systemd/system/prometheus.service
notify: restart prometheus
- name: Configure log rotation
template:
src: logrotate.conf.j2
dest: /etc/logrotate.d/monitoring
- name: Setup monitoring alerts
copy:
src: "{{ item }}"
dest: /opt/prometheus-{{ prometheus_version }}.linux-amd64/rules/
owner: monitoring
group: monitoring
with_fileglob:
- files/alerts/*.yml
notify: reload prometheus
- name: Configure firewall rules
firewalld:
port: "{{ item }}/tcp"
permanent: yes
state: enabled
immediate: yes
loop:
- 9090 # Prometheus
- 9093 # Alertmanager
- 3000 # Grafana
- 9100 # Node Exporter
handlers:
- name: restart prometheus
systemd:
name: prometheus
state: restarted
daemon_reload: yes
enabled: yes
- name: reload prometheus
systemd:
name: prometheus
state: reloaded
まとめ
現代のインフラ監視・運用において重要なポイント:
- 可観測性の実現 - メトリクス、ログ、トレースの統合
- 自動化 - インシデント対応とセルフヒーリング
- SREプラクティス - SLI/SLOによる信頼性管理
- Infrastructure as Code - 再現可能な環境構築
- プロアクティブな監視 - 予測的アラートと異常検知
これらの実践により、安定性の高いシステム運用と迅速な問題解決が可能になります。