Skip to content

モニタリングワークフロー

このガイドでは、Probeを使用して包括的な監視システムを構築する方法を説明します。サービスヘルスのチェック、パフォーマンスの監視、問題のアラートを行うワークフローの作成方法を学びます。

基本的なサービス監視

シンプルなヘルスチェック

基本的なヘルスチェックワークフローから始めます:

yaml
name: Basic Service Health Check
description: 必須サービスエンドポイントを監視

vars:
  API_BASE_URL: https://api.yourcompany.com
  HEALTH_ENDPOINT: /health
  TIMEOUT: 30s

jobs:
- name: Service Health Check
  defaults:
    http:
      timeout: "{{vars.TIMEOUT}}"
      headers:
        User-Agent: "Probe Monitor v1.0"
  steps:
    - name: API Health Check
      uses: http
      with:
        url: "{{vars.API_BASE_URL}}{{vars.HEALTH_ENDPOINT}}"
      test: res.code == 200
      outputs:
        api_healthy: res.code == 200
        response_time: res.time
        api_version: res.body.json.version

    - name: Health Status Report
      echo: |
        🏥 Health Check Results:
        
        API Status: {{outputs.api_healthy ? "✅ Healthy" : "❌ Down"}}
        Response Time: {{outputs.response_time}}ms
        API Version: {{outputs.api_version}}
        Timestamp: {{unixtime()}}

使用方法:

bash
probe health-check.yml

マルチサービスヘルス監視

複数のサービスを並列で監視します:

yaml
name: Multi-Service Health Monitor
description: すべての重要なサービスのヘルスをチェック

vars:
  USER_SERVICE_URL: https://users.api.yourcompany.com
  ORDER_SERVICE_URL: https://orders.api.yourcompany.com
  PAYMENT_SERVICE_URL: https://payments.api.yourcompany.com
  NOTIFICATION_SERVICE_URL: https://notifications.api.yourcompany.com

jobs:
- name: User Service Health
  steps:
    - name: User Service Check
      uses: http
      with:
        url: "{{vars.USER_SERVICE_URL}}/health"
      test: res.code == 200
      outputs:
        healthy: res.code == 200
        response_time: res.time
        user_count: res.body.json.active_users

- name: Order Service Health
  steps:
    - name: Order Service Check
      uses: http
      with:
        url: "{{vars.ORDER_SERVICE_URL}}/health"
      test: res.code == 200
      outputs:
        healthy: res.code == 200
        response_time: res.time
        pending_orders: res.body.json.pending_orders

- name: Payment Service Health
  steps:
    - name: Payment Service Check
      uses: http
      with:
        url: "{{vars.PAYMENT_SERVICE_URL}}/health"
      test: res.code == 200
      outputs:
        healthy: res.code == 200
        response_time: res.time
        transaction_queue: res.body.json.queue_length

- name: Notification Service Health
  steps:
    - name: Notification Service Check
      uses: http
      with:
        url: "{{vars.NOTIFICATION_SERVICE_URL}}/health"
      test: res.code == 200
      outputs:
        healthy: res.code == 200
        response_time: res.time
        queue_size: res.body.json.notification_queue

- name: Health Summary
  needs: [user-service-health, order-service-health, payment-service-health, notification-service-health]
  steps:
    - name: Generate Health Report
      echo: |
        🎯 Multi-Service Health Report
        ===============================
        
        User Service: {{outputs.user-service-health.healthy ? "✅" : "❌"}} ({{outputs.user-service-health.response_time}}ms)
          Active Users: {{outputs.user-service-health.user_count}}
        
        Order Service: {{outputs.order-service-health.healthy ? "✅" : "❌"}} ({{outputs.order-service-health.response_time}}ms)
          Pending Orders: {{outputs.order-service-health.pending_orders}}
        
        Payment Service: {{outputs.payment-service-health.healthy ? "✅" : "❌"}} ({{outputs.payment-service-health.response_time}}ms)
          Transaction Queue: {{outputs.payment-service-health.transaction_queue}}
        
        Notification Service: {{outputs.notification-service-health.healthy ? "✅" : "❌"}} ({{outputs.notification-service-health.response_time}}ms)
          Notification Queue: {{outputs.notification-service-health.queue_size}}
        
        Overall System Status: {{
          outputs.user-service-health.healthy && 
          outputs.order-service-health.healthy && 
          outputs.payment-service-health.healthy && 
          outputs.notification-service-health.healthy ? 
          "🟢 ALL SYSTEMS OPERATIONAL" : "🔴 ISSUES DETECTED"
        }}
        
        Timestamp: {{unixtime()}}

データベースとインフラストラクチャ監視

データベースヘルス監視

yaml
name: Database Health Monitor
description: データベース接続とパフォーマンスを監視

vars:
  DB_HOST: db.yourcompany.com
  DB_PORT: 5432
  DB_NAME: production
  REDIS_HOST: redis.yourcompany.com
  REDIS_PORT: 6379

jobs:
- name: Database Connectivity
  steps:
    - name: PostgreSQL Connection Test
      uses: http
      with:
        url: "{{vars.DB_API_URL}}/ping"
      test: res.code == 200
      outputs:
        db_connected: res.code == 200
        connection_time: res.time
        active_connections: res.body.json.active_connections
        max_connections: res.body.json.max_connections

    - name: Database Performance Check
      uses: http
      with:
        url: "{{vars.DB_API_URL}}/stats"
      test: res.code == 200 && res.body.json.query_performance.avg_ms < 100
      outputs:
        avg_query_time: res.body.json.query_performance.avg_ms
        slow_queries: res.body.json.slow_queries.count
        db_size_mb: res.body.json.database_size_mb

- name: Cache System Health
  steps:
    - name: Redis Connection Test
      uses: http
      with:
        url: "{{vars.CACHE_API_URL}}/ping"
      test: res.code == 200
      outputs:
        cache_connected: res.code == 200
        cache_response_time: res.time
        memory_usage_percent: res.body.json.memory.usage_percent
        keys_count: res.body.json.keys.total

    - name: Cache Performance Check
      uses: http
      with:
        url: "{{vars.CACHE_API_URL}}/stats"
      test: res.code == 200 && res.body.json.hit_rate > 0.8
      outputs:
        hit_rate: res.body.json.hit_rate
        miss_rate: res.body.json.miss_rate
        evicted_keys: res.body.json.evicted_keys

- name: Infrastructure Report
  needs: [database-connectivity, cache-system-health]
  steps:
    - name: Infrastructure Health Summary
      echo: |
        🏗️ Infrastructure Health Report
        =================================
        
        Database Status:
        Connection: {{outputs.database-connectivity.db_connected ? "✅ Connected" : "❌ Failed"}}
        Response Time: {{outputs.database-connectivity.connection_time}}ms
        Active Connections: {{outputs.database-connectivity.active_connections}}/{{outputs.database-connectivity.max_connections}}
        Average Query Time: {{outputs.database-connectivity.avg_query_time}}ms
        Slow Queries: {{outputs.database-connectivity.slow_queries}}
        Database Size: {{outputs.database-connectivity.db_size_mb}}MB
        
        Cache Status:
        Connection: {{outputs.cache-system-health.cache_connected ? "✅ Connected" : "❌ Failed"}}
        Response Time: {{outputs.cache-system-health.cache_response_time}}ms
        Memory Usage: {{outputs.cache-system-health.memory_usage_percent}}%
        Total Keys: {{outputs.cache-system-health.keys_count}}
        Hit Rate: {{(outputs.cache-system-health.hit_rate * 100)}}%
        Miss Rate: {{(outputs.cache-system-health.miss_rate * 100)}}%
        
        Performance Alerts:
        {{outputs.database-connectivity.avg_query_time > 100 ? "⚠️ Database queries are slow (>" + outputs.database-connectivity.avg_query_time + "ms)" : ""}}
        {{outputs.database-connectivity.slow_queries > 10 ? "⚠️ High number of slow queries (" + outputs.database-connectivity.slow_queries + ")" : ""}}
        {{outputs.cache-system-health.memory_usage_percent > 80 ? "⚠️ Cache memory usage high (" + outputs.cache-system-health.memory_usage_percent + "%)" : ""}}
        {{outputs.cache-system-health.hit_rate < 0.8 ? "⚠️ Cache hit rate low (" + (outputs.cache-system-health.hit_rate * 100) + "%)" : ""}}

包括的システム監視

フルスタック監視ワークフロー

yaml
name: Full-Stack System Monitor
description: すべてのシステムコンポーネントの包括的監視

vars:
  # サービスURL
  FRONTEND_URL: https://app.yourcompany.com
  API_GATEWAY_URL: https://api.yourcompany.com
  
  # 監視しきい値
  MAX_RESPONSE_TIME: 2000
  MIN_SUCCESS_RATE: 0.95
  MAX_ERROR_RATE: 0.05

jobs:
  # Tier 1: インフラストラクチャ層
- name: Infrastructure Health Check
  steps:
    - name: Load Balancer Health
      uses: http
      with:
        url: "{{vars.LOAD_BALANCER_URL}}/health"
      test: res.code == 200
      outputs:
        lb_healthy: res.code == 200
        active_backends: res.body.json.active_backends
        total_backends: res.body.json.total_backends

    - name: CDN Performance
      uses: http
      with:
        url: "{{vars.CDN_URL}}/health"
      test: res.code == 200 && res.time < 500
      outputs:
        cdn_healthy: res.code == 200
        cdn_response_time: res.time
        cache_hit_ratio: res.body.json.cache_hit_ratio

  # Tier 2: アプリケーション層
- name: Application Health Check
  needs: [infrastructure-health-check]
  steps:
    - name: Frontend Health
      uses: http
      with:
        url: "{{vars.FRONTEND_URL}}/health"
      test: res.code == 200
      outputs:
        frontend_healthy: res.code == 200
        frontend_version: res.body.json.version
        frontend_build: res.body.json.build

    - name: API Gateway Health
      uses: http
      with:
        url: "{{vars.API_GATEWAY_URL}}/health"
      test: res.code == 200
      outputs:
        gateway_healthy: res.code == 200
        gateway_version: res.body.json.version
        registered_services: res.body.json.services.length

  # Tier 3: ビジネスロジック層
- name: Business Logic Health
  needs: [application-health-check]
  steps:
    - name: User Service Functional Test
      uses: http
      with:
        url: "{{vars.API_GATEWAY_URL}}/users/health-check"
      test: res.code == 200 && res.body.json.functional_test_passed == true
      outputs:
        user_service_functional: res.body.json.functional_test_passed
        active_sessions: res.body.json.active_sessions

    - name: Order Service Functional Test
      uses: http
      with:
        url: "{{vars.API_GATEWAY_URL}}/orders/health-check"
      test: res.code == 200 && res.body.json.functional_test_passed == true
      outputs:
        order_service_functional: res.body.json.functional_test_passed
        processing_queue_length: res.body.json.queue_length

  # Tier 4: パフォーマンス検証
- name: Performance Validation
  needs: [business-logic-health]
  steps:
    - name: End-to-End Performance Test
      uses: http
      with:
        url: "{{vars.API_GATEWAY_URL}}/performance/e2e-test"
        method: POST
        body: |
          {
            "test_type": "quick_validation",
            "max_duration_seconds": 30
          }
      test: |
        res.code == 200 && 
        res.body.json.success_rate >= {{vars.MIN_SUCCESS_RATE}} &&
        res.body.json.avg_response_time <= {{vars.MAX_RESPONSE_TIME}}
      outputs:
        success_rate: res.body.json.success_rate
        avg_response_time: res.body.json.avg_response_time
        p95_response_time: res.body.json.p95_response_time
        error_rate: res.body.json.error_rate

  # Tier 5: セキュリティとコンプライアンス
- name: Security Health Checks
  needs: [performance-validation]
  steps:
    - name: SSL Certificate Check
      uses: http
      with:
        url: "{{vars.SECURITY_API_URL}}/ssl-check"
        method: POST
        body: |
          {
            "domains": [
              "{{vars.FRONTEND_URL}}",
              "{{vars.API_GATEWAY_URL}}"
            ]
          }
      test: res.code == 200 && res.body.json.all_certificates_valid == true
      outputs:
        ssl_valid: res.body.json.all_certificates_valid
        cert_expiry_days: res.body.json.min_days_to_expiry

    - name: Security Headers Check
      uses: http
      with:
        url: "{{vars.SECURITY_API_URL}}/headers-check"
        method: POST
        body: |
          {
            "url": "{{vars.FRONTEND_URL}}"
          }
      test: res.code == 200 && res.body.json.security_score >= 0.8
      outputs:
        security_score: res.body.json.security_score
        missing_headers: res.body.json.missing_headers

  # 最終レポート
- name: System Health Report
  needs: [infrastructure-health-check, application-health-check, business-logic-health, performance-validation, security-health-checks]
  steps:
    - name: Generate Comprehensive Report
      echo: |
        🌐 Full-Stack System Health Report
        ===================================
        Generated: {{unixtime()}}
        
        📊 INFRASTRUCTURE LAYER
        Load Balancer: {{outputs.infrastructure-health-check.lb_healthy ? "✅ Healthy" : "❌ Issues"}}
          Backends: {{outputs.infrastructure-health-check.active_backends}}/{{outputs.infrastructure-health-check.total_backends}} active
        CDN: {{outputs.infrastructure-health-check.cdn_healthy ? "✅ Healthy" : "❌ Issues"}} ({{outputs.infrastructure-health-check.cdn_response_time}}ms)
          Cache Hit Ratio: {{(outputs.infrastructure-health-check.cache_hit_ratio * 100)}}%
        
        🖥️ APPLICATION LAYER
        Frontend: {{outputs.application-health-check.frontend_healthy ? "✅ Healthy" : "❌ Issues"}}
          Version: {{outputs.application-health-check.frontend_version}} (Build: {{outputs.application-health-check.frontend_build}})
        API Gateway: {{outputs.application-health-check.gateway_healthy ? "✅ Healthy" : "❌ Issues"}}
          Version: {{outputs.application-health-check.gateway_version}}
          Services: {{outputs.application-health-check.registered_services}} registered
        
        🏢 BUSINESS LOGIC LAYER
        User Service: {{outputs.business-logic-health.user_service_functional ? "✅ Functional" : "❌ Issues"}}
          Active Sessions: {{outputs.business-logic-health.active_sessions}}
        Order Service: {{outputs.business-logic-health.order_service_functional ? "✅ Functional" : "❌ Issues"}}
          Processing Queue: {{outputs.business-logic-health.processing_queue_length}} items
        
        ⚡ PERFORMANCE METRICS
        Success Rate: {{(outputs.performance-validation.success_rate * 100)}}%
        Average Response Time: {{outputs.performance-validation.avg_response_time}}ms
        95th Percentile: {{outputs.performance-validation.p95_response_time}}ms
        Error Rate: {{(outputs.performance-validation.error_rate * 100)}}%
        
        🔒 SECURITY STATUS
        SSL Certificates: {{outputs.security-health-checks.ssl_valid ? "✅ Valid" : "❌ Issues"}}
          Expiry: {{outputs.security-health-checks.cert_expiry_days}} days minimum
        Security Headers: Score {{(outputs.security-health-checks.security_score * 100)}}%
        {{outputs.security-health-checks.missing_headers ? "Missing Headers: " + outputs.security-health-checks.missing_headers : ""}}
        
        🎯 OVERALL SYSTEM STATUS
        {{
          outputs.infrastructure-health-check.lb_healthy &&
          outputs.infrastructure-health-check.cdn_healthy &&
          outputs.application-health-check.frontend_healthy &&
          outputs.application-health-check.gateway_healthy &&
          outputs.business-logic-health.user_service_functional &&
          outputs.business-logic-health.order_service_functional &&
          outputs.performance-validation.success_rate >= vars.MIN_SUCCESS_RATE &&
          outputs.performance-validation.avg_response_time <= vars.MAX_RESPONSE_TIME &&
          outputs.security-health-checks.ssl_valid &&
          outputs.security-health-checks.security_score >= 0.8
          ? "🟢 ALL SYSTEMS OPERATIONAL" 
          : "🔴 ISSUES REQUIRE ATTENTION"
        }}
        
        ⚠️ ALERTS
        {{outputs.infrastructure-health-check.active_backends != outputs.infrastructure-health-check.total_backends ? "• Load balancer has inactive backends" : ""}}
        {{outputs.infrastructure-health-check.cdn_response_time > 1000 ? "• CDN response time is high" : ""}}
        {{outputs.performance-validation.success_rate < vars.MIN_SUCCESS_RATE ? "• Success rate below threshold" : ""}}
        {{outputs.performance-validation.avg_response_time > vars.MAX_RESPONSE_TIME ? "• Average response time exceeds threshold" : ""}}
        {{outputs.security-health-checks.cert_expiry_days < 30 ? "• SSL certificates expiring soon" : ""}}
        {{outputs.security-health-checks.security_score < 0.8 ? "• Security headers need improvement" : ""}}

アラートと通知の統合

メールアラート付き監視

yaml
name: Monitoring with Email Alerts
description: 自動メール通知付きヘルス監視

vars:
  # SMTP設定
  SMTP_HOST: smtp.gmail.com
  SMTP_PORT: 587
  SMTP_USERNAME: alerts@yourcompany.com
  ALERT_RECIPIENTS: ["ops@yourcompany.com", "dev-team@yourcompany.com"]
  
  # 監視設定
  CRITICAL_SERVICES: ["user-service", "payment-service", "order-service"]

jobs:
- name: Health Monitoring
  steps:
    - name: User Service Check
      id: user-service
      uses: http
      with:
        url: "{{vars.USER_SERVICE_URL}}/health"
      test: res.code == 200
      continue_on_error: true
      outputs:
        healthy: res.code == 200
        status_code: res.code
        response_time: res.time

    - name: Payment Service Check
      id: payment-service
      uses: http
      with:
        url: "{{vars.PAYMENT_SERVICE_URL}}/health"
      test: res.code == 200
      continue_on_error: true
      outputs:
        healthy: res.code == 200
        status_code: res.code
        response_time: res.time

    - name: Order Service Check
      id: order-service
      uses: http
      with:
        url: "{{vars.ORDER_SERVICE_URL}}/health"
      test: res.code == 200
      continue_on_error: true
      outputs:
        healthy: res.code == 200
        status_code: res.code
        response_time: res.time

- name: Alert Processing
  needs: [health-monitoring]
  steps:
    - name: Critical Service Alert
      if: "!outputs.user-service.healthy || !outputs.payment-service.healthy || !outputs.order-service.healthy"
      uses: smtp
      with:
        host: "{{vars.SMTP_HOST}}"
        port: "{{vars.SMTP_PORT}}"
        username: "{{vars.SMTP_USERNAME}}"
        password: "{{vars.SMTP_PASSWORD}}"
        from: "{{vars.SMTP_USERNAME}}"
        to: "{{vars.ALERT_RECIPIENTS}}"
        subject: "🚨 CRITICAL: Service Health Alert - {{unixtime()}}"
        body: |
          CRITICAL SERVICE HEALTH ALERT
          =============================
          
          Time: {{unixtime()}}
          Environment: {{vars.ENVIRONMENT || "Production"}}
          
          Service Status:
          User Service: {{outputs.user-service.healthy ? "✅ Healthy" : "❌ DOWN (HTTP " + outputs.user-service.status_code + ")"}}
          Payment Service: {{outputs.payment-service.healthy ? "✅ Healthy" : "❌ DOWN (HTTP " + outputs.payment-service.status_code + ")"}}
          Order Service: {{outputs.order-service.healthy ? "✅ Healthy" : "❌ DOWN (HTTP " + outputs.order-service.status_code + ")"}}
          
          Response Times:
          User Service: {{outputs.user-service.response_time}}ms
          Payment Service: {{outputs.payment-service.response_time}}ms
          Order Service: {{outputs.order-service.response_time}}ms
          
          IMMEDIATE ACTION REQUIRED
          
          Please investigate the failing services immediately.
          
          Monitoring Dashboard: {{vars.DASHBOARD_URL}}
          Incident Management: {{vars.INCIDENT_URL}}

    - name: Performance Warning Alert
      if: |
        (outputs.user-service.healthy && outputs.user-service.response_time > 2000) ||
        (outputs.payment-service.healthy && outputs.payment-service.response_time > 2000) ||
        (outputs.order-service.healthy && outputs.order-service.response_time > 2000)
      uses: smtp
      with:
        host: "{{vars.SMTP_HOST}}"
        port: "{{vars.SMTP_PORT}}"
        username: "{{vars.SMTP_USERNAME}}"
        password: "{{vars.SMTP_PASSWORD}}"
        from: "{{vars.SMTP_USERNAME}}"
        to: "{{vars.ALERT_RECIPIENTS}}"
        subject: "⚠️ WARNING: Performance Degradation Detected"
        body: |
          PERFORMANCE WARNING
          ===================
          
          Time: {{unixtime()}}
          Environment: {{vars.ENVIRONMENT || "Production"}}
          
          Performance Issues Detected:
          {{outputs.user-service.response_time > 2000 ? "• User Service: " + outputs.user-service.response_time + "ms (threshold: 2000ms)" : ""}}
          {{outputs.payment-service.response_time > 2000 ? "• Payment Service: " + outputs.payment-service.response_time + "ms (threshold: 2000ms)" : ""}}
          {{outputs.order-service.response_time > 2000 ? "• Order Service: " + outputs.order-service.response_time + "ms (threshold: 2000ms)" : ""}}
          
          While services are responding, performance degradation may impact user experience.
          Please investigate at your earliest convenience.

    - name: All Clear Notification
      if: outputs.user-service.healthy && outputs.payment-service.healthy && outputs.order-service.healthy && outputs.user-service.response_time <= 2000 && outputs.payment-service.response_time <= 2000 && outputs.order-service.response_time <= 2000
      echo: |
        ✅ All Services Healthy
        
        All critical services are operating normally:
        • User Service: {{outputs.user-service.response_time}}ms
        • Payment Service: {{outputs.payment-service.response_time}}ms  
        • Order Service: {{outputs.order-service.response_time}}ms
        
        No alerts sent - system is healthy.

環境固有の監視

マルチ環境設定

base-monitoring.yml:

yaml
name: Service Health Monitor
description: すべての環境用のベース監視ワークフロー

jobs:
- name: Service Health Check
  defaults:
    http:
      headers:
        User-Agent: "Probe Monitor"
        Accept: "application/json"
  steps:
    - name: API Health
      uses: http
      with:
        url: "{{vars.API_URL}}/health"
      test: res.code == 200
      outputs:
        api_healthy: res.code == 200
        response_time: res.time

    - name: Database Health
      uses: http
      with:
        url: "{{vars.DB_API_URL}}/ping"
      test: res.code == 200
      outputs:
        db_healthy: res.code == 200
        db_response_time: res.time

- name: Monitoring Report
  needs: [service-health-check]
  steps:
    - name: Status Report
      echo: |
        Environment: {{vars.ENVIRONMENT}}
        API: {{outputs.service-health-check.api_healthy ? "✅" : "❌"}} ({{outputs.service-health-check.response_time}}ms)
        Database: {{outputs.service-health-check.db_healthy ? "✅" : "❌"}} ({{outputs.service-health-check.db_response_time}}ms)

development.yml:

yaml
vars:
  ENVIRONMENT: development
  API_URL: http://localhost:3000
  DB_API_URL: http://localhost:5432

jobs:
- name: Service Health Check
  defaults:
    http:
      timeout: 60s  # 開発環境ではより寛容

production.yml:

yaml
vars:
  ENVIRONMENT: production
  API_URL: https://api.yourcompany.com
  DB_API_URL: https://db-api.yourcompany.com

jobs:
- name: Service Health Check
  defaults:
    http:
      timeout: 10s  # プロダクション用の厳密なタイムアウト

# プロダクション固有のセキュリティ監視を追加
- name: Security Monitoring
  needs: [service-health-check]
  steps:
    - name: SSL Certificate Check
      uses: http
      with:
        url: "{{vars.SECURITY_API_URL}}/ssl-status"
      test: res.code == 200 && res.body.json.all_valid == true
      outputs:
        ssl_valid: res.body.json.all_valid
        days_to_expiry: res.body.json.min_days_to_expiry

使用方法:

bash
# 開発環境監視
probe base-monitoring.yml,development.yml

# プロダクション環境監視(セキュリティチェック含む)
probe base-monitoring.yml,production.yml

ベストプラクティス

1. 監視戦略

  • 監視を階層化: インフラストラクチャ → アプリケーション → ビジネスロジック
  • 適切なタイムアウトを設定: プロダクション用は厳密、開発用は寛容
  • continue_on_errorを使用: 重要でないチェックに対して
  • 段階的アラート: 情報 → 警告 → 重大

2. アラート疲れの防止

yaml
# 良い例: 条件付きアラート
- name: Smart Alerting
  if: errors.count > 5 && duration > 300  # 持続的な問題のみでアラート
  uses: smtp
  # ...

# 避ける: すべての問題でアラート
- name: Noisy Alerting
  if: any_error_detected
  uses: smtp
  # アラート疲れを引き起こす

3. パフォーマンスの考慮事項

yaml
# 良い例: 独立したチェックを並列実行
jobs:
- name: service-a-check    # 並列実行
- name: service-b-check    # 並列実行
- name: service-c-check    # 並列実行

# 良い例: 効率的な出力
outputs:
  service_healthy: res.code == 200  # ブール値フラグ
  response_time: res.time            # 特定のメトリクス
  # 避ける: レスポンス全体の保存: full_response: res.body.json

4. 文書化と保守

yaml
name: Well-Documented Monitor
description: |
  E-commerceプラットフォーム用の監視ワークフロー。
  
  チェック:
  - ユーザーサービスのヘルスとパフォーマンス
  - 注文処理サービス
  - ペイメントゲートウェイの接続性
  - データベースパフォーマンス
  
  アラート:
  - 重大: サービス完全停止
  - 警告: パフォーマンス劣化
  - 情報: すべてのシステム正常
  
  予想実行時間: 30-60秒
  
  保守:
  - しきい値を月次レビュー
  - サービス移行時にサービスURLを更新
  - アラートチャネルを四半期毎にテスト

一般的な問題のトラブルシューティング

1. サービス発見の問題

yaml
- name: Service Discovery Check
  uses: http
  with:
    url: "{{vars.SERVICE_REGISTRY_URL}}/services"
  test: res.code == 200 && res.body.json.services.length > 0
  outputs:
    available_services: res.body.json.services.map(s -> s.name)
    service_count: res.body.json.services.length

2. ネットワーク接続の問題

yaml
- name: Network Connectivity Test
  uses: http
  with:
    url: "{{vars.EXTERNAL_HEALTH_CHECK_URL}}"
    timeout: 5s
  test: res.code == 200
  continue_on_error: true
  outputs:
    external_connectivity: res.code == 200

3. 認証の問題

yaml
- name: Authentication Health Check
  uses: http
  with:
    url: "{{vars.AUTH_SERVICE_URL}}/health"
    headers:
      Authorization: "Bearer {{vars.HEALTH_CHECK_TOKEN}}"
  test: res.code == 200
  continue_on_error: true
  outputs:
    auth_service_healthy: res.code == 200

次のステップ

監視ワークフローの構築ができるようになったので、次を探索してください:

監視は信頼性の高いシステムの基盤です。これらのパターンを使用して、ユーザーに影響を与える前に問題をキャッチする包括的な監視を構築しましょう。

Released under the MIT License.