DT インフラ・運用 インフラ監視・運用のベストプラクティス 2025年版

インフラ監視・運用のベストプラクティス 2025年版 - 可観測性から自動化まで

現代的なインフラ監視の実践ガイド。Prometheus、Grafana、ELKスタック、分散トレーシング、インシデント対応、SRE実践まで、包括的に解説します。

約5分で読めます
技術記事
実践的

この記事のポイント

現代的なインフラ監視の実践ガイド。Prometheus、Grafana、ELKスタック、分散トレーシング、インシデント対応、SRE実践まで、包括的に解説します。

この記事では、実践的なアプローチで技術的な課題を解決する方法を詳しく解説します。具体的なコード例とともに、ベストプラクティスを学ぶことができます。

はじめに

2025年のインフラ運用では、単なる監視から「可観測性(Observability)」へとパラダイムシフトが進んでいます。本記事では、最新のツールとベストプラクティスを用いた、効果的なインフラ監視・運用の方法を解説します。

可観測性の3つの柱

メトリクス、ログ、トレースの統合

graph TB
    subgraph "Observability Stack"
        A[Applications] --> M[Metrics]
        A --> L[Logs]
        A --> T[Traces]
        
        M --> P[Prometheus]
        L --> E[Elasticsearch]
        T --> J[Jaeger]
        
        P --> G[Grafana]
        E --> G
        J --> G
        
        G --> D[Dashboards]
        G --> AL[Alerts]
    end
    
    subgraph "Data Flow"
        OT[OpenTelemetry] --> M
        OT --> L
        OT --> T
    end

OpenTelemetryの実装

// telemetry/setup.go
package telemetry

import (
    "context"
    "time"
    
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/exporters/prometheus"
    "go.opentelemetry.io/otel/metric"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/metric"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
)

type TelemetryConfig struct {
    ServiceName    string
    ServiceVersion string
    Environment    string
    OTLPEndpoint   string
}

func InitTelemetry(cfg TelemetryConfig) (*sdktrace.TracerProvider, *metric.MeterProvider, error) {
    ctx := context.Background()
    
    // リソース定義
    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceName(cfg.ServiceName),
            semconv.ServiceVersion(cfg.ServiceVersion),
            semconv.DeploymentEnvironment(cfg.Environment),
        ),
    )
    if err != nil {
        return nil, nil, err
    }
    
    // トレース設定
    traceExporter, err := otlptrace.New(
        ctx,
        otlptracegrpc.NewClient(
            otlptracegrpc.WithEndpoint(cfg.OTLPEndpoint),
            otlptracegrpc.WithInsecure(),
        ),
    )
    if err != nil {
        return nil, nil, err
    }
    
    tracerProvider := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(traceExporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.AlwaysSample()),
    )
    
    otel.SetTracerProvider(tracerProvider)
    otel.SetTextMapPropagator(
        propagation.NewCompositeTextMapPropagator(
            propagation.TraceContext{},
            propagation.Baggage{},
        ),
    )
    
    // メトリクス設定
    promExporter, err := prometheus.New()
    if err != nil {
        return nil, nil, err
    }
    
    meterProvider := metric.NewMeterProvider(
        metric.WithResource(res),
        metric.WithReader(promExporter),
    )
    
    otel.SetMeterProvider(meterProvider)
    
    return tracerProvider, meterProvider, nil
}

// カスタムメトリクスの定義
type MetricsCollector struct {
    meter        metric.Meter
    requestCount metric.Int64Counter
    requestDuration metric.Float64Histogram
    activeConnections metric.Int64UpDownCounter
}

func NewMetricsCollector(provider *metric.MeterProvider) (*MetricsCollector, error) {
    meter := provider.Meter("app.metrics")
    
    requestCount, err := meter.Int64Counter(
        "http_requests_total",
        metric.WithDescription("Total number of HTTP requests"),
        metric.WithUnit("1"),
    )
    if err != nil {
        return nil, err
    }
    
    requestDuration, err := meter.Float64Histogram(
        "http_request_duration_seconds",
        metric.WithDescription("HTTP request duration in seconds"),
        metric.WithUnit("s"),
    )
    if err != nil {
        return nil, err
    }
    
    activeConnections, err := meter.Int64UpDownCounter(
        "http_active_connections",
        metric.WithDescription("Number of active HTTP connections"),
        metric.WithUnit("1"),
    )
    if err != nil {
        return nil, err
    }
    
    return &MetricsCollector{
        meter:             meter,
        requestCount:      requestCount,
        requestDuration:   requestDuration,
        activeConnections: activeConnections,
    }, nil
}

Prometheusによるメトリクス収集

Prometheus設定

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    region: 'ap-northeast-1'

# アラートマネージャー設定
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# ルールファイル
rule_files:
  - "alerts/*.yml"
  - "recording_rules/*.yml"

# スクレイプ設定
scrape_configs:
  # Kubernetesサービスディスカバリー
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
  
  # Node Exporter
  - job_name: 'node-exporter'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
  
  # カスタムアプリケーション
  - job_name: 'custom-app'
    static_configs:
      - targets: ['app-1:8080', 'app-2:8080', 'app-3:8080']
    metrics_path: '/metrics'

アラートルールの定義

# alerts/application.yml
groups:
  - name: application_alerts
    interval: 30s
    rules:
      # エラー率アラート
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
            /
            sum(rate(http_requests_total[5m])) by (service)
          ) > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate detected"
          description: "Service {{ $labels.service }} has error rate of {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
      
      # レスポンスタイムアラート
      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
          ) > 1
        for: 10m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High response time detected"
          description: "95th percentile response time for {{ $labels.service }} is {{ $value }}s"
      
      # メモリ使用率アラート
      - alert: HighMemoryUsage
        expr: |
          (
            container_memory_working_set_bytes{pod!=""}
            / 
            container_spec_memory_limit_bytes{pod!=""}
          ) > 0.8
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High memory usage detected"
          description: "Pod {{ $labels.pod }} memory usage is {{ $value | humanizePercentage }}"

  - name: infrastructure_alerts
    interval: 30s
    rules:
      # ディスク使用率アラート
      - alert: DiskSpaceRunningOut
        expr: |
          (
            node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs|vfat"}
            / 
            node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs|vfat"}
          ) < 0.1
        for: 15m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Disk space running out"
          description: "Node {{ $labels.instance }} has only {{ $value | humanizePercentage }} disk space left on {{ $labels.mountpoint }}"
      
      # CPU使用率アラート
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 15m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU usage detected"
          description: "Node {{ $labels.instance }} CPU usage is {{ $value }}%"

Grafanaによる可視化

ダッシュボード定義

{
  "dashboard": {
    "title": "Application Performance Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ],
        "type": "graph",
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 0
        }
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ],
        "type": "graph",
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 0
        }
      },
      {
        "title": "Response Time (95th percentile)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))",
            "legendFormat": "{{ service }}"
          }
        ],
        "type": "graph",
        "gridPos": {
          "h": 8,
          "w": 24,
          "x": 0,
          "y": 8
        }
      }
    ]
  }
}

Grafana as Code

# terraform/grafana.tf
resource "grafana_dashboard" "application_performance" {
  config_json = jsonencode({
    title = "Application Performance Dashboard"
    uid   = "app-performance"
    panels = [
      {
        id    = 1
        title = "Request Rate"
        type  = "timeseries"
        gridPos = {
          h = 8
          w = 12
          x = 0
          y = 0
        }
        targets = [
          {
            expr         = "sum(rate(http_requests_total[5m])) by (service)"
            refId        = "A"
            datasource   = "Prometheus"
            legendFormat = "{{ service }}"
          }
        ]
        fieldConfig = {
          defaults = {
            unit = "reqps"
            color = {
              mode = "palette-classic"
            }
          }
        }
      }
    ]
  })
}

resource "grafana_alert_rule" "high_error_rate" {
  title      = "High Error Rate Alert"
  uid        = "high-error-rate"
  folder_uid = grafana_folder.alerts.uid
  
  condition = "C"
  data {
    ref_id = "A"
    query_type = "prometheus"
    model = jsonencode({
      expr = "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)"
      refId = "A"
    })
  }
  
  data {
    ref_id = "B"
    query_type = "prometheus"
    model = jsonencode({
      expr = "sum(rate(http_requests_total[5m])) by (service)"
      refId = "B"
    })
  }
  
  data {
    ref_id = "C"
    query_type = "math"
    model = jsonencode({
      expression = "$A / $B"
      refId = "C"
    })
  }
  
  no_data_state  = "NoData"
  exec_err_state = "Alerting"
  for            = "5m"
  
  annotations = {
    summary     = "Service {{ $labels.service }} has high error rate"
    description = "Error rate is {{ $values.C }}%"
  }
}

ログ管理とELKスタック

Elasticsearch設定

# elasticsearch.yml
cluster.name: production-logging
node.name: es-node-1

# ネットワーク設定
network.host: 0.0.0.0
http.port: 9200

# ディスカバリー設定
discovery.seed_hosts:
  - es-node-1
  - es-node-2
  - es-node-3

cluster.initial_master_nodes:
  - es-node-1
  - es-node-2
  - es-node-3

# インデックス設定
action.auto_create_index: ".monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*"

# パフォーマンス設定
indices.memory.index_buffer_size: 30%
thread_pool.write.queue_size: 1000

# セキュリティ設定
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate

Logstashパイプライン

# logstash/pipeline/application.conf
input {
  beats {
    port => 5044
    ssl => true
    ssl_certificate_authorities => ["/etc/logstash/certs/ca.crt"]
    ssl_certificate => "/etc/logstash/certs/logstash.crt"
    ssl_key => "/etc/logstash/certs/logstash.key"
  }
  
  kafka {
    bootstrap_servers => "kafka-1:9092,kafka-2:9092,kafka-3:9092"
    topics => ["application-logs"]
    group_id => "logstash-consumer"
    codec => "json"
  }
}

filter {
  # JSONログのパース
  if [message] =~ /^\{/ {
    json {
      source => "message"
      target => "parsed"
    }
    
    mutate {
      add_field => {
        "[@metadata][target_index]" => "app-logs-%{[parsed][service]}-%{+YYYY.MM.dd}"
      }
    }
  }
  
  # アクセスログのパース
  if [type] == "nginx-access" {
    grok {
      match => {
        "message" => '%{IPORHOST:remote_addr} - %{DATA:remote_user} \[%{HTTPDATE:time_local}\] "%{WORD:method} %{DATA:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:body_bytes_sent} "%{DATA:http_referer}" "%{DATA:http_user_agent}" "%{DATA:http_x_forwarded_for}" %{NUMBER:request_time}'
      }
    }
    
    date {
      match => ["time_local", "dd/MMM/yyyy:HH:mm:ss Z"]
      target => "@timestamp"
    }
    
    mutate {
      convert => {
        "status" => "integer"
        "body_bytes_sent" => "integer"
        "request_time" => "float"
      }
    }
    
    # GeoIP解決
    geoip {
      source => "remote_addr"
      target => "geoip"
    }
  }
  
  # エラーログの解析
  if [level] == "ERROR" or [level] == "FATAL" {
    mutate {
      add_tag => ["alert"]
    }
    
    # スタックトレースの集約
    if [stack_trace] {
      fingerprint {
        source => ["stack_trace"]
        target => "[@metadata][fingerprint]"
        method => "SHA256"
      }
    }
  }
  
  # メトリクス抽出
  if [parsed][metrics] {
    ruby {
      code => '
        metrics = event.get("[parsed][metrics]")
        metrics.each do |key, value|
          event.set("metric_#{key}", value)
        end
      '
    }
  }
}

output {
  elasticsearch {
    hosts => ["es-node-1:9200", "es-node-2:9200", "es-node-3:9200"]
    index => "%{[@metadata][target_index]}"
    template_name => "application-logs"
    template => "/etc/logstash/templates/application-logs.json"
    template_overwrite => true
    
    # セキュリティ設定
    ssl => true
    ssl_certificate_verification => true
    cacert => "/etc/logstash/certs/ca.crt"
    user => "${ELASTIC_USER}"
    password => "${ELASTIC_PASSWORD}"
  }
  
  # アラート用の出力
  if "alert" in [tags] {
    http {
      url => "http://alertmanager:9093/api/v1/alerts"
      http_method => "post"
      format => "json"
      mapping => {
        "alerts" => [
          {
            "labels" => {
              "alertname" => "ApplicationError"
              "service" => "%{[parsed][service]}"
              "severity" => "critical"
            }
            "annotations" => {
              "summary" => "Application error detected"
              "description" => "%{[message]}"
            }
          }
        ]
      }
    }
  }
}

Kibanaダッシュボード

{
  "version": "8.0.0",
  "objects": [
    {
      "id": "application-logs-dashboard",
      "type": "dashboard",
      "attributes": {
        "title": "Application Logs Dashboard",
        "panels": [
          {
            "version": "8.0.0",
            "type": "visualization",
            "gridData": {
              "x": 0,
              "y": 0,
              "w": 24,
              "h": 15
            },
            "panelConfig": {
              "title": "Log Volume Over Time",
              "type": "line",
              "params": {
                "index": "app-logs-*",
                "query": {
                  "match_all": {}
                },
                "aggs": {
                  "time_buckets": {
                    "date_histogram": {
                      "field": "@timestamp",
                      "interval": "5m"
                    },
                    "aggs": {
                      "log_levels": {
                        "terms": {
                          "field": "level.keyword"
                        }
                      }
                    }
                  }
                }
              }
            }
          },
          {
            "version": "8.0.0",
            "type": "visualization",
            "gridData": {
              "x": 0,
              "y": 15,
              "w": 12,
              "h": 15
            },
            "panelConfig": {
              "title": "Top Errors",
              "type": "data_table",
              "params": {
                "index": "app-logs-*",
                "query": {
                  "bool": {
                    "filter": [
                      {
                        "term": {
                          "level.keyword": "ERROR"
                        }
                      }
                    ]
                  }
                },
                "aggs": {
                  "error_messages": {
                    "terms": {
                      "field": "error.message.keyword",
                      "size": 10
                    }
                  }
                }
              }
            }
          }
        ]
      }
    }
  ]
}

分散トレーシング

Jaegerの設定とデプロイ

# k8s/jaeger-deployment.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger-production
spec:
  strategy: production
  
  collector:
    replicas: 3
    resources:
      limits:
        cpu: 2
        memory: 4Gi
      requests:
        cpu: 1
        memory: 2Gi
    options:
      kafka:
        producer:
          topic: jaeger-spans
          brokers: kafka-1:9092,kafka-2:9092
    
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: https://elasticsearch:9200
        tls:
          ca-cert: /es-certs/ca.crt
        username: jaeger
        password: ${JAEGER_ES_PASSWORD}
    
  query:
    replicas: 2
    resources:
      limits:
        cpu: 1
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi
    options:
      query:
        max-clock-skew-adjustment: 30s
    
  ingester:
    replicas: 2
    options:
      kafka:
        consumer:
          topic: jaeger-spans
          brokers: kafka-1:9092,kafka-2:9092

トレース分析の実装

# trace_analysis.py
import requests
from datetime import datetime, timedelta
import pandas as pd
import numpy as np

class TraceAnalyzer:
    def __init__(self, jaeger_url):
        self.jaeger_url = jaeger_url
        self.base_url = f"{jaeger_url}/api/traces"
    
    def get_traces(self, service, operation=None, start_time=None, end_time=None, limit=1000):
        """Jaegerからトレースを取得"""
        if not start_time:
            start_time = datetime.now() - timedelta(hours=1)
        if not end_time:
            end_time = datetime.now()
        
        params = {
            'service': service,
            'start': int(start_time.timestamp() * 1000000),
            'end': int(end_time.timestamp() * 1000000),
            'limit': limit
        }
        
        if operation:
            params['operation'] = operation
        
        response = requests.get(self.base_url, params=params)
        return response.json()['data']
    
    def analyze_performance(self, traces):
        """トレースからパフォーマンスメトリクスを抽出"""
        durations = []
        span_counts = []
        error_counts = []
        
        for trace in traces:
            # トレース全体の期間
            trace_duration = max(span['startTime'] + span['duration'] 
                               for span in trace['spans']) - \
                           min(span['startTime'] for span in trace['spans'])
            durations.append(trace_duration)
            
            # スパン数
            span_counts.append(len(trace['spans']))
            
            # エラー数
            error_count = sum(1 for span in trace['spans'] 
                            if any(tag['key'] == 'error' and tag['value'] 
                                  for tag in span.get('tags', [])))
            error_counts.append(error_count)
        
        df = pd.DataFrame({
            'duration_us': durations,
            'span_count': span_counts,
            'error_count': error_counts
        })
        
        return {
            'avg_duration_ms': df['duration_us'].mean() / 1000,
            'p50_duration_ms': df['duration_us'].quantile(0.5) / 1000,
            'p95_duration_ms': df['duration_us'].quantile(0.95) / 1000,
            'p99_duration_ms': df['duration_us'].quantile(0.99) / 1000,
            'avg_span_count': df['span_count'].mean(),
            'error_rate': (df['error_count'] > 0).mean()
        }
    
    def find_bottlenecks(self, trace):
        """トレース内のボトルネックを特定"""
        spans = trace['spans']
        
        # 各スパンの自己時間を計算
        span_self_times = {}
        for span in spans:
            span_id = span['spanID']
            total_time = span['duration']
            
            # 子スパンの時間を引く
            child_time = sum(
                child['duration'] 
                for child in spans 
                if any(ref['spanID'] == span_id 
                      for ref in child.get('references', []))
            )
            
            span_self_times[span_id] = {
                'operation': span['operationName'],
                'service': span['process']['serviceName'],
                'self_time': total_time - child_time,
                'total_time': total_time
            }
        
        # 自己時間でソート
        bottlenecks = sorted(
            span_self_times.values(), 
            key=lambda x: x['self_time'], 
            reverse=True
        )[:5]
        
        return bottlenecks
    
    def detect_anomalies(self, traces, threshold_percentile=95):
        """異常なトレースを検出"""
        durations = [
            max(span['startTime'] + span['duration'] for span in trace['spans']) -
            min(span['startTime'] for span in trace['spans'])
            for trace in traces
        ]
        
        threshold = np.percentile(durations, threshold_percentile)
        
        anomalous_traces = [
            trace for trace, duration in zip(traces, durations)
            if duration > threshold
        ]
        
        return anomalous_traces

インシデント対応

PagerDutyとの統合

# alertmanager.yml
global:
  resolve_timeout: 5m
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
      continue: true
    - match:
        severity: warning
      receiver: slack-warnings
    - match_re:
        service: database-.*
      receiver: database-team

receivers:
  - name: 'default'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          resolved: '{{ .Alerts.Resolved | len }}'
          alerts: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'slack-warnings'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#warnings'
        send_resolved: true

  - name: 'database-team'
    email_configs:
      - to: 'database-team@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: '${SMTP_USERNAME}'
        auth_password: '${SMTP_PASSWORD}'

インシデント対応自動化

# incident_automation.py
import os
import json
import requests
from datetime import datetime
from typing import Dict, List, Optional
import asyncio
import aiohttp

class IncidentResponder:
    def __init__(self):
        self.pagerduty_token = os.getenv('PAGERDUTY_TOKEN')
        self.slack_token = os.getenv('SLACK_TOKEN')
        self.jira_url = os.getenv('JIRA_URL')
        self.jira_auth = (os.getenv('JIRA_USER'), os.getenv('JIRA_TOKEN'))
    
    async def handle_incident(self, alert: Dict):
        """インシデント対応の自動化"""
        # インシデントの作成
        incident_id = await self.create_pagerduty_incident(alert)
        
        # Slackチャンネルの作成
        channel_id = await self.create_incident_channel(incident_id, alert)
        
        # 初期診断の実行
        diagnostics = await self.run_diagnostics(alert)
        
        # 診断結果をSlackに投稿
        await self.post_to_slack(channel_id, diagnostics)
        
        # Jiraチケットの作成
        jira_ticket = await self.create_jira_ticket(incident_id, alert, diagnostics)
        
        # 自動修復の試行
        if alert.get('auto_remediate', False):
            await self.attempt_auto_remediation(alert)
        
        return {
            'incident_id': incident_id,
            'slack_channel': channel_id,
            'jira_ticket': jira_ticket
        }
    
    async def create_pagerduty_incident(self, alert: Dict) -> str:
        """PagerDutyインシデントの作成"""
        async with aiohttp.ClientSession() as session:
            incident_data = {
                'incident': {
                    'type': 'incident',
                    'title': alert['title'],
                    'service': {
                        'id': alert['service_id'],
                        'type': 'service_reference'
                    },
                    'body': {
                        'type': 'incident_body',
                        'details': alert['description']
                    },
                    'urgency': 'high' if alert['severity'] == 'critical' else 'low'
                }
            }
            
            headers = {
                'Authorization': f'Token token={self.pagerduty_token}',
                'Content-Type': 'application/json'
            }
            
            async with session.post(
                'https://api.pagerduty.com/incidents',
                json=incident_data,
                headers=headers
            ) as response:
                result = await response.json()
                return result['incident']['id']
    
    async def create_incident_channel(self, incident_id: str, alert: Dict) -> str:
        """Slackインシデントチャンネルの作成"""
        channel_name = f"incident-{incident_id[:8]}-{datetime.now().strftime('%Y%m%d')}"
        
        async with aiohttp.ClientSession() as session:
            # チャンネル作成
            create_response = await session.post(
                'https://slack.com/api/conversations.create',
                headers={'Authorization': f'Bearer {self.slack_token}'},
                json={
                    'name': channel_name,
                    'is_private': False
                }
            )
            channel_data = await create_response.json()
            channel_id = channel_data['channel']['id']
            
            # トピック設定
            await session.post(
                'https://slack.com/api/conversations.setTopic',
                headers={'Authorization': f'Bearer {self.slack_token}'},
                json={
                    'channel': channel_id,
                    'topic': f"Incident: {alert['title']}"
                }
            )
            
            # 関係者を招待
            stakeholders = self.get_stakeholders(alert)
            if stakeholders:
                await session.post(
                    'https://slack.com/api/conversations.invite',
                    headers={'Authorization': f'Bearer {self.slack_token}'},
                    json={
                        'channel': channel_id,
                        'users': ','.join(stakeholders)
                    }
                )
            
            return channel_id
    
    async def run_diagnostics(self, alert: Dict) -> Dict:
        """自動診断の実行"""
        diagnostics = {
            'timestamp': datetime.now().isoformat(),
            'service': alert['service'],
            'checks': []
        }
        
        # サービスごとの診断
        if alert['service'] == 'api-gateway':
            diagnostics['checks'].extend([
                await self.check_api_health(),
                await self.check_upstream_services(),
                await self.check_rate_limits()
            ])
        elif alert['service'] == 'database':
            diagnostics['checks'].extend([
                await self.check_database_connections(),
                await self.check_query_performance(),
                await self.check_replication_lag()
            ])
        
        # 共通診断
        diagnostics['checks'].extend([
            await self.check_recent_deployments(),
            await self.check_resource_usage(),
            await self.check_error_logs()
        ])
        
        return diagnostics
    
    async def attempt_auto_remediation(self, alert: Dict) -> bool:
        """自動修復の試行"""
        remediation_actions = {
            'high_memory_usage': self.restart_service,
            'connection_pool_exhausted': self.increase_connection_pool,
            'rate_limit_exceeded': self.adjust_rate_limits,
            'unhealthy_instances': self.replace_unhealthy_instances
        }
        
        action = remediation_actions.get(alert['type'])
        if action:
            try:
                result = await action(alert)
                await self.log_remediation(alert, result)
                return result['success']
            except Exception as e:
                await self.log_remediation_failure(alert, str(e))
                return False
        
        return False
    
    async def restart_service(self, alert: Dict) -> Dict:
        """サービスの再起動"""
        service_name = alert['service']
        
        # Kubernetesでのローリング再起動
        async with aiohttp.ClientSession() as session:
            async with session.patch(
                f'https://k8s-api/apis/apps/v1/namespaces/default/deployments/{service_name}',
                json={
                    'spec': {
                        'template': {
                            'metadata': {
                                'annotations': {
                                    'restartedAt': datetime.now().isoformat()
                                }
                            }
                        }
                    }
                }
            ) as response:
                if response.status == 200:
                    return {'success': True, 'action': 'service_restarted'}
                else:
                    return {'success': False, 'error': await response.text()}

SREプラクティス

SLI/SLO定義

# slo_definitions.yaml
slos:
  - name: api-availability
    description: "API Gateway availability"
    sli:
      query: |
        sum(rate(http_requests_total{service="api-gateway"}[5m])) - 
        sum(rate(http_requests_total{service="api-gateway",status=~"5.."}[5m]))
        / 
        sum(rate(http_requests_total{service="api-gateway"}[5m]))
    target: 0.999  # 99.9% availability
    window: 30d
    
  - name: api-latency
    description: "API response time"
    sli:
      query: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m])) 
          by (le)
        )
    target: 0.95  # 95% of requests under threshold
    threshold: 0.3  # 300ms
    window: 30d
    
  - name: error-budget
    description: "Error budget consumption"
    sli:
      query: |
        1 - (
          sum(increase(http_requests_total{status!~"5.."}[30d])) /
          sum(increase(http_requests_total[30d]))
        )
    target: 0.001  # 0.1% error budget
    window: 30d

エラーバジェット監視

// error_budget.go
package sre

import (
    "context"
    "fmt"
    "time"
    
    "github.com/prometheus/client_golang/api"
    v1 "github.com/prometheus/client_golang/api/prometheus/v1"
)

type ErrorBudgetMonitor struct {
    client api.Client
    slos   []SLO
}

type SLO struct {
    Name      string
    Query     string
    Target    float64
    Window    time.Duration
}

func (m *ErrorBudgetMonitor) CalculateErrorBudget(ctx context.Context, slo SLO) (*ErrorBudget, error) {
    v1api := v1.NewAPI(m.client)
    
    // SLIの計算
    result, _, err := v1api.Query(ctx, slo.Query, time.Now())
    if err != nil {
        return nil, err
    }
    
    // 結果の解析
    var currentSLI float64
    switch v := result.(type) {
    case model.Vector:
        if len(v) > 0 {
            currentSLI = float64(v[0].Value)
        }
    default:
        return nil, fmt.Errorf("unexpected result type: %T", result)
    }
    
    // エラーバジェットの計算
    errorBudget := &ErrorBudget{
        SLO:             slo.Name,
        Target:          slo.Target,
        Current:         currentSLI,
        BudgetRemaining: (slo.Target - (1 - currentSLI)) / slo.Target * 100,
        TimeWindow:      slo.Window,
    }
    
    // バーンレートの計算
    errorBudget.BurnRate = m.calculateBurnRate(ctx, slo)
    
    // 予測
    if errorBudget.BurnRate > 0 {
        errorBudget.TimeToExhaustion = time.Duration(
            float64(errorBudget.BudgetRemaining) / errorBudget.BurnRate * float64(time.Hour),
        )
    }
    
    return errorBudget, nil
}

func (m *ErrorBudgetMonitor) calculateBurnRate(ctx context.Context, slo SLO) float64 {
    // 過去1時間のエラー率を計算
    query := fmt.Sprintf(`
        increase(%s{status=~"5.."}[1h]) / 
        increase(%s[1h])
    `, slo.Query, slo.Query)
    
    v1api := v1.NewAPI(m.client)
    result, _, err := v1api.Query(ctx, query, time.Now())
    if err != nil {
        return 0
    }
    
    switch v := result.(type) {
    case model.Vector:
        if len(v) > 0 {
            hourlyErrorRate := float64(v[0].Value)
            // バーンレート = 現在のエラー率 / 許容エラー率
            return hourlyErrorRate / (1 - slo.Target)
        }
    }
    
    return 0
}

type ErrorBudget struct {
    SLO              string
    Target           float64
    Current          float64
    BudgetRemaining  float64
    BurnRate         float64
    TimeToExhaustion time.Duration
    TimeWindow       time.Duration
}

// アラート生成
func (m *ErrorBudgetMonitor) GenerateAlerts(budget *ErrorBudget) []Alert {
    var alerts []Alert
    
    // バジェット残量によるアラート
    if budget.BudgetRemaining < 25 {
        alerts = append(alerts, Alert{
            Name:     "ErrorBudgetCritical",
            Severity: "critical",
            Message:  fmt.Sprintf("Error budget for %s is critically low: %.2f%%", budget.SLO, budget.BudgetRemaining),
        })
    } else if budget.BudgetRemaining < 50 {
        alerts = append(alerts, Alert{
            Name:     "ErrorBudgetWarning",
            Severity: "warning",
            Message:  fmt.Sprintf("Error budget for %s is low: %.2f%%", budget.SLO, budget.BudgetRemaining),
        })
    }
    
    // バーンレートによるアラート
    if budget.BurnRate > 10 {
        alerts = append(alerts, Alert{
            Name:     "HighBurnRate",
            Severity: "critical",
            Message:  fmt.Sprintf("Burn rate for %s is very high: %.2fx", budget.SLO, budget.BurnRate),
        })
    } else if budget.BurnRate > 2 {
        alerts = append(alerts, Alert{
            Name:     "ElevatedBurnRate",
            Severity: "warning",
            Message:  fmt.Sprintf("Burn rate for %s is elevated: %.2fx", budget.SLO, budget.BurnRate),
        })
    }
    
    return alerts
}

自動化とInfrastructure as Code

Terraformによるインフラ管理

# monitoring/main.tf
module "prometheus" {
  source = "./modules/prometheus"
  
  namespace      = "monitoring"
  storage_size   = "100Gi"
  retention_days = 30
  
  scrape_configs = [
    {
      job_name = "kubernetes-pods"
      kubernetes_sd_configs = [{
        role = "pod"
      }]
    }
  ]
  
  alert_rules = file("${path.module}/alerts/*.yml")
}

module "grafana" {
  source = "./modules/grafana"
  
  namespace     = "monitoring"
  admin_password = var.grafana_admin_password
  
  datasources = [
    {
      name = "Prometheus"
      type = "prometheus"
      url  = "http://prometheus:9090"
    },
    {
      name = "Elasticsearch"
      type = "elasticsearch"
      url  = "http://elasticsearch:9200"
    }
  ]
  
  dashboards = {
    for f in fileset("${path.module}/dashboards", "*.json") :
    basename(f) => file("${path.module}/dashboards/${f}")
  }
}

module "elasticsearch" {
  source = "./modules/elasticsearch"
  
  namespace    = "logging"
  cluster_name = "production-logs"
  node_count   = 3
  
  node_resources = {
    cpu    = "2"
    memory = "8Gi"
    storage = "500Gi"
  }
  
  index_lifecycle_policies = {
    logs = {
      hot = {
        min_age = "0ms"
        actions = {
          rollover = {
            max_size = "50GB"
            max_age  = "7d"
          }
        }
      }
      warm = {
        min_age = "7d"
        actions = {
          shrink = {
            number_of_shards = 1
          }
          forcemerge = {
            max_num_segments = 1
          }
        }
      }
      delete = {
        min_age = "30d"
        actions = {
          delete = {}
        }
      }
    }
  }
}

Ansibleによる設定管理

# ansible/monitoring-playbook.yml
---
- name: Configure monitoring infrastructure
  hosts: monitoring_servers
  become: yes
  vars:
    prometheus_version: "2.45.0"
    grafana_version: "10.0.0"
    node_exporter_version: "1.6.0"
  
  tasks:
    - name: Install monitoring dependencies
      package:
        name:
          - curl
          - tar
          - python3-pip
        state: present
    
    - name: Create monitoring user
      user:
        name: monitoring
        system: yes
        shell: /bin/false
        home: /var/lib/monitoring
        createhome: yes
    
    - name: Install Prometheus
      block:
        - name: Download Prometheus
          unarchive:
            src: "https://github.com/prometheus/prometheus/releases/download/v{{ prometheus_version }}/prometheus-{{ prometheus_version }}.linux-amd64.tar.gz"
            dest: /opt
            remote_src: yes
            owner: monitoring
            group: monitoring
        
        - name: Configure Prometheus
          template:
            src: prometheus.yml.j2
            dest: /opt/prometheus-{{ prometheus_version }}.linux-amd64/prometheus.yml
            owner: monitoring
            group: monitoring
          notify: restart prometheus
        
        - name: Create Prometheus systemd service
          template:
            src: prometheus.service.j2
            dest: /etc/systemd/system/prometheus.service
          notify: restart prometheus
    
    - name: Configure log rotation
      template:
        src: logrotate.conf.j2
        dest: /etc/logrotate.d/monitoring
    
    - name: Setup monitoring alerts
      copy:
        src: "{{ item }}"
        dest: /opt/prometheus-{{ prometheus_version }}.linux-amd64/rules/
        owner: monitoring
        group: monitoring
      with_fileglob:
        - files/alerts/*.yml
      notify: reload prometheus
    
    - name: Configure firewall rules
      firewalld:
        port: "{{ item }}/tcp"
        permanent: yes
        state: enabled
        immediate: yes
      loop:
        - 9090  # Prometheus
        - 9093  # Alertmanager
        - 3000  # Grafana
        - 9100  # Node Exporter
  
  handlers:
    - name: restart prometheus
      systemd:
        name: prometheus
        state: restarted
        daemon_reload: yes
        enabled: yes
    
    - name: reload prometheus
      systemd:
        name: prometheus
        state: reloaded

まとめ

現代のインフラ監視・運用において重要なポイント:

  1. 可観測性の実現 - メトリクス、ログ、トレースの統合
  2. 自動化 - インシデント対応とセルフヒーリング
  3. SREプラクティス - SLI/SLOによる信頼性管理
  4. Infrastructure as Code - 再現可能な環境構築
  5. プロアクティブな監視 - 予測的アラートと異常検知

これらの実践により、安定性の高いシステム運用と迅速な問題解決が可能になります。