Metrics Module
The Metrics module is a comprehensive monitoring and observability system for the Comdeall platform that provides real-time application performance monitoring, request tracking, and operational insights. It integrates with Prometheus for metrics collection, Grafana for visualization, and implements automated alerting for critical performance indicators. The module tracks API performance, user behavior patterns, system health, and provides detailed analytics for optimization and troubleshooting.
Table of Contents
- Module Structure
- Metrics Endpoints
- Core Features
- Prometheus Integration
- Metrics Collection
- Performance Monitoring
- User Analytics
- Alerting System
- Integration Points
- Technical Implementation
- Best Practices
- Conclusion
Module Structure
The Metrics module follows a middleware-based architecture with Prometheus integration:
@Module({
imports: [
PrometheusModule.register({
defaultMetrics: {
enabled: true,
},
path: `${RouteNames.METRICS}`,
}),
],
providers: [MetricsService],
controllers: [MetricsController],
exports: [MetricsService],
})
export class MetricsModule {}
Core Components:
-
Controller Layer (
metrics.controller.ts): Exposes Prometheus metrics endpoint for scraping and monitoring -
Service Layer (
metrics.service.ts): Implements custom metrics collection with counters, gauges, and histograms -
Middleware Layer (
metrics.middleware.ts): Automatically collects request metrics for all API endpoints -
APM Configuration (
apm/): Prometheus configuration, alerting rules, and Grafana dashboards -
Integration Layer: Seamless integration with the NestJS application lifecycle and monitoring infrastructure
Metrics Endpoints
| Endpoint | Method | Description | Auth Type |
|---|---|---|---|
/metrics | GET | Prometheus metrics scraping endpoint | None (Public) |
Metrics Endpoint Features:
- Content-Type:
text/plainfor Prometheus compatibility - Real-time Data: Live metrics collection and aggregation
- Performance Optimized: Efficient metrics generation without blocking requests
- Comprehensive Coverage: Application, system, and custom metrics
Core Features
Application Performance Monitoring
The module provides comprehensive application performance tracking with detailed request analytics:
Request Metrics:
- Total Requests: Complete count of all HTTP requests processed
- Request Duration: Histogram-based latency tracking with percentile calculations
- Concurrent Requests: Real-time gauge of active request processing
- Error Rates: Failure tracking with detailed error categorization
Performance Analytics:
- Response Time Tracking: End-to-end request processing duration
- Throughput Monitoring: Requests per second and volume analytics
- Resource Utilization: System resource consumption patterns
- Bottleneck Identification: Slowest endpoints and performance optimization insights
User Behavior Analytics
Advanced user behavior tracking provides insights into application usage patterns:
User Agent Analysis:
- Browser Detection: Automatic browser family identification using UAParser
- Platform Tracking: Mobile vs web request categorization
- Device Analytics: User agent string analysis for device insights
Traffic Source Analysis:
- Referer Tracking: Source domain identification for traffic analytics
- Direct vs Referred Traffic: User acquisition channel analysis
- Invalid URL Handling: Graceful error handling for malformed referer data
System Health Monitoring
Real-time system health tracking ensures optimal application performance:
System Metrics:
- Active Users: Unique IP address tracking for user activity monitoring
- Request Load: Peak traffic identification and capacity planning
- Error Distribution: Error pattern analysis across different endpoints
- Performance Trends: Historical data analysis for trend identification
Prometheus Integration
Metrics Collection Configuration
The module integrates with Prometheus using industry-standard metrics types:
Prometheus Metrics Types:
- Counters: Monotonically increasing values for totals (requests, errors)
- Gauges: Current values that can increase or decrease (concurrent requests, active users)
- Histograms: Distribution tracking with configurable buckets (request duration)
Default Metrics Integration:
PrometheusModule.register({
defaultMetrics: {
enabled: true, // Node.js runtime metrics
},
path: '/metrics', // Scraping endpoint
})
Custom Metrics Implementation
The service implements custom business logic metrics:
Business Metrics:
api_requests_total: Total API requests with method, route, and status labelsapi_request_duration_seconds: Request processing time distributionapi_request_errors_total: Error tracking with detailed categorizationconcurrent_http_requests: Real-time concurrent request monitoring
User Analytics Metrics:
api_requests_by_user_agent: Browser and device usage analyticsapi_requests_by_referer: Traffic source and referral analyticstotal_mobile_requests: Mobile platform usage trackingtotal_web_requests: Web platform usage monitoring
Scraping Configuration
Prometheus scraping is configured for optimal performance and reliability:
Scraping Parameters:
- Scrape Interval: 15 seconds for application metrics
- Scrape Timeout: 10 seconds to prevent blocking
- Metrics Path:
/api/metricsfor NestJS application - Target Configuration: Docker-compatible networking with
host.docker.internal
Metrics Collection
Automatic Request Tracking
The MetricsMiddleware automatically collects comprehensive request data:
Request Lifecycle Tracking:
- Request Start: Performance timer initiation and concurrent request increment
- Request Processing: User agent, referer, and mobile detection
- Request Completion: Duration calculation, status tracking, and metrics recording
- Error Handling: Automatic error categorization and failure tracking
Middleware Implementation:
// Key tracking points
this.metricsService.incrementHttpRequests();
this.metricsService.observeRequestDuration(method, route, status, duration);
this.metricsService.incrementApiRequestCounter(method, route, status);
User Agent Processing
Advanced user agent analysis provides detailed client insights:
Browser Detection:
- UAParser Integration: Accurate browser family identification
- Fallback Handling: Unknown user agent graceful processing
- Mobile Detection: Device type categorization for analytics
Referer Analysis:
- URL Parsing: Domain extraction from referer headers
- Error Handling: Invalid URL graceful processing
- Unknown Source Tracking: Direct traffic identification
Performance Monitoring
Request Duration Analysis
Histogram-based request duration tracking provides detailed performance insights:
Duration Buckets:
buckets: [0.1, 0.3, 0.5, 1, 1.5, 2, 5, 10] // Seconds
Performance Calculations:
- Average Latency:
rate(api_request_duration_seconds_sum[5m]) / rate(api_request_duration_seconds_count[5m]) - P99 Latency:
histogram_quantile(0.99, rate(api_request_duration_seconds_bucket[5m])) - Slowest APIs:
topk(5, rate(api_request_duration_seconds_sum[5m]) / rate(api_request_duration_seconds_count[5m]))
Error Rate Monitoring
Comprehensive error tracking enables proactive issue resolution:
Error Metrics:
- Error Rate Percentage:
(sum(rate(api_request_errors_total[5m])) / sum(rate(api_requests_total[5m]))) * 100 - Error Distribution: Error categorization by endpoint, method, and status code
- Error Trends: Historical error pattern analysis for trend identification
Throughput Analytics
Request volume monitoring supports capacity planning and optimization:
Throughput Metrics:
- Requests Per Second:
rate(api_requests_total[1m]) - Request Volume:
rate(api_requests_total[5m]) - Peak Load Analysis: Maximum concurrent request tracking
- Traffic Patterns: User activity pattern identification
User Analytics
Platform Usage Analysis
Detailed platform usage analytics support product decision-making:
Platform Metrics:
- Mobile vs Web Ratio:
(sum(rate(total_mobile_requests[1m])) / sum(rate(total_web_requests[1m]))) - Browser Usage:
topk(5, api_requests_by_user_agent) - Device Distribution: Mobile and web platform usage patterns
Traffic Source Analytics
Comprehensive traffic source analysis provides marketing insights:
Source Analytics:
- Top Referers:
topk(5, api_requests_by_referer) - Direct Traffic: Users accessing application directly
- Referral Traffic: External website traffic analysis
- Traffic Quality: User engagement patterns by source
Alerting System
Critical Performance Alerts
Automated alerting ensures rapid response to performance degradation:
High Error Rate Alert:
alert: HighErrorRate
expr: nestjs:api_error_rate_percent > 5
for: 5m
severity: critical
High Latency Alert:
alert: HighP99Latency
expr: nestjs:api_latency_p99 > 2
for: 5m
severity: warning
System Health Alerts
System resource monitoring prevents service degradation:
Concurrent Request Alert:
alert: HighConcurrentRequests
expr: concurrent_http_requests > 500
for: 2m
severity: warning
Alert Integration:
- Severity Levels: Critical, warning, and informational alerts
- Team Assignment: Backend team notification and escalation
- Detailed Descriptions: Actionable alert messages with context
Integration Points
Application Middleware Integration
Seamless integration with NestJS middleware pipeline:
Middleware Registration:
// Global middleware application
consumer.apply(MetricsMiddleware).forRoutes('*');
Performance Considerations:
- Non-blocking Operations: Metrics collection doesn't impact request processing
- Efficient Processing: Optimized metric calculation and storage
- Memory Management: Proper metric lifecycle and cleanup
External Monitoring Integration
Integration with external monitoring and alerting systems:
Prometheus Integration:
- Standard Metrics Format: Industry-standard Prometheus exposition format
- Label-based Querying: Flexible metric filtering and aggregation
- Time Series Storage: Historical data retention and analysis
Grafana Dashboard Integration:
- Real-time Visualization: Live metric dashboard and alerting
- Custom Dashboards: Business-specific metric visualization
- Alert Management: Visual alert status and management interface
Technical Implementation
Metric Types and Usage
Strategic use of different Prometheus metric types:
Counter Implementation:
// Monotonically increasing values
private readonly apiRequestCounter: Counter<string>;
this.apiRequestCounter.labels(method, route, status).inc();
Gauge Implementation:
// Current state values
private readonly concurrentRequests: Gauge<string>;
this.concurrentRequests.inc(); // Increment
this.concurrentRequests.dec(); // Decrement
Histogram Implementation:
// Distribution tracking
private readonly apiRequestDuration: Histogram<string>;
this.apiRequestDuration.labels(method, route, status).observe(duration);
Label Strategy
Comprehensive labeling strategy for flexible querying:
Request Labels:
method: HTTP method (GET, POST, PUT, DELETE)route: API endpoint path for specific endpoint analysisstatus: HTTP status code for error categorization
User Analytics Labels:
browser_family: Browser type for client analysisreferer_domain: Traffic source for marketing analyticsmobile_request/web_request: Platform categorization
Performance Optimization
Metrics collection is optimized for minimal performance impact:
Optimization Strategies:
- Asynchronous Processing: Metrics recorded on response completion
- Efficient Labeling: Strategic label usage to prevent cardinality explosion
- Memory Management: Proper metric cleanup and resource management
- Sampling Strategy: Configurable sampling for high-traffic scenarios
Best Practices
Metric Design Principles
Strategic metric design ensures actionable insights:
Metric Selection:
- Business Relevance: Metrics aligned with business objectives and SLAs
- Actionable Data: Metrics that drive operational decisions and improvements
- Performance Impact: Minimal overhead metric collection and processing
- Cardinality Management: Controlled label combinations to prevent storage issues
Monitoring Strategy
Comprehensive monitoring approach for proactive system management:
Monitoring Hierarchy:
- Real-time Alerts: Critical issues requiring immediate attention
- Trend Analysis: Long-term performance pattern identification
- Capacity Planning: Resource utilization forecasting and optimization
- User Experience: Client-side performance impact assessment
Alert Management
Effective alerting strategy prevents alert fatigue and ensures rapid response:
Alert Design:
- Threshold Tuning: Data-driven alert threshold configuration
- Alert Grouping: Related alert consolidation to prevent notification flooding
- Escalation Policies: Clear escalation paths for different severity levels
- Documentation: Detailed runbooks for alert response and resolution
Conclusion
The Metrics module provides a comprehensive monitoring and observability foundation for the Comdeall platform. Key strengths include:
Comprehensive Coverage:
- Application Performance: End-to-end request tracking with detailed analytics
- User Behavior: Client usage patterns and traffic source analysis
- System Health: Real-time system resource and performance monitoring
- Business Metrics: Custom metrics aligned with business objectives
Production-Ready Integration:
- Prometheus Compatibility: Industry-standard metrics format and collection
- Grafana Dashboard Support: Real-time visualization and alerting capabilities
- Automated Alerting: Proactive issue detection and notification systems
- Performance Optimized: Minimal overhead metric collection and processing
Operational Excellence:
- Real-time Insights: Live performance monitoring and issue detection
- Historical Analysis: Trend identification and capacity planning support
- Actionable Alerts: Targeted notifications with clear resolution guidance
- Scalable Architecture: Supports high-traffic scenarios with efficient processing
The module's architecture enables data-driven decision making, proactive system management, and optimal user experience through comprehensive monitoring and alerting capabilities essential for production-grade applications serving the child development and therapy management platform.