Real User Monitoring (RUM) and Frontend Observability: Core Web Vitals, Error Tracking, and Session Replay

What Real User Monitoring Measures#

Real User Monitoring (RUM) collects performance and behavior data from actual users interacting with your application in their real browsers, on their real networks, with their real hardware. Unlike synthetic monitoring, which tests a controlled scenario from a known location, RUM captures the full spectrum of user experience – including the user on a slow 3G connection in rural Brazil using a 4-year-old phone.

RUM answers questions that no amount of server-side monitoring can: How fast does the page actually load for users? Which JavaScript errors are users hitting in production? Where do users abandon a workflow? Which geographic regions experience worse performance?

Server-side metrics tell you your API responded in 50ms. RUM tells you the user waited 4 seconds because DNS resolution took 800ms, the TLS handshake took 600ms, the 2MB JavaScript bundle took 1.5 seconds to download, and the browser spent 1.1 seconds parsing and executing it. The 50ms API response time was the least of the user’s problems.

Core Web Vitals#

Core Web Vitals are Google’s standardized metrics for measuring user experience. They influence search ranking and provide a shared vocabulary for discussing frontend performance.

Largest Contentful Paint (LCP)#

LCP measures how long it takes for the largest visible content element (image, video, or text block) to render. It represents when the user perceives the page as “loaded.”

Good:      LCP <= 2.5 seconds
Needs work: 2.5s < LCP <= 4.0 seconds
Poor:       LCP > 4.0 seconds

Common causes of poor LCP:

Large unoptimized images as the hero element.
Render-blocking CSS or JavaScript that delays the main content.
Server response time (TTFB) is slow, pushing everything later.
Client-side rendering that requires JavaScript execution before any content appears.

Interaction to Next Paint (INP)#

INP measures the responsiveness of a page to user interactions. It tracks the latency from a user input (click, tap, keypress) to the next visual update. INP replaced First Input Delay (FID) as a Core Web Vital in March 2024.

Good:       INP <= 200 milliseconds
Needs work: 200ms < INP <= 500 milliseconds
Poor:       INP > 500 milliseconds

Common causes of poor INP:

Long-running JavaScript tasks that block the main thread.
Expensive DOM manipulation triggered by user interactions.
Third-party scripts (analytics, ads) competing for main thread time.
Layout recalculations triggered by dynamic content insertion.

Cumulative Layout Shift (CLS)#

CLS measures visual stability – how much the page content shifts unexpectedly during loading. A CLS of 0 means nothing moved. Users experience layout shift when they try to click a button and the button moves because an ad loaded above it.

Good:       CLS <= 0.1
Needs work: 0.1 < CLS <= 0.25
Poor:       CLS > 0.25

Common causes of poor CLS:

Images without explicit width and height attributes (browser does not know the size until the image loads).
Ads, embeds, or iframes injected dynamically without reserved space.
Web fonts that cause text to reflow when they replace fallback fonts (FOIT/FOUT).
Content dynamically inserted above existing content (banners, cookie consent modals).

Collecting Performance Metrics#

The web-vitals Library#

Google’s web-vitals JavaScript library is the standard way to measure Core Web Vitals in the browser.

import { onLCP, onINP, onCLS, onFCP, onTTFB } from 'web-vitals';

function sendToAnalytics(metric) {
  const body = JSON.stringify({
    name: metric.name,
    value: metric.value,
    rating: metric.rating,    // "good", "needs-improvement", "poor"
    delta: metric.delta,
    id: metric.id,
    navigationType: metric.navigationType,
    url: window.location.href,
    userAgent: navigator.userAgent,
    timestamp: new Date().toISOString()
  });

  // Use sendBeacon for reliability (survives page unload)
  if (navigator.sendBeacon) {
    navigator.sendBeacon('/api/v1/rum/metrics', body);
  } else {
    fetch('/api/v1/rum/metrics', {
      method: 'POST',
      body: body,
      keepalive: true
    });
  }
}

onLCP(sendToAnalytics);
onINP(sendToAnalytics);
onCLS(sendToAnalytics);
onFCP(sendToAnalytics);
onTTFB(sendToAnalytics);

Use navigator.sendBeacon instead of fetch for metric reporting. sendBeacon is designed for analytics data – it survives page navigation and unload events, ensuring metrics are not lost when the user clicks away.

Performance Observer API#

For metrics beyond Core Web Vitals, use the Performance Observer API directly.

// Observe long tasks (>50ms) that block the main thread
const longTaskObserver = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    sendToAnalytics({
      name: 'long-task',
      value: entry.duration,
      startTime: entry.startTime,
      url: window.location.href
    });
  }
});
longTaskObserver.observe({ type: 'longtask', buffered: true });

// Observe resource loading (images, scripts, stylesheets)
const resourceObserver = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    if (entry.duration > 1000) {  // Only report slow resources
      sendToAnalytics({
        name: 'slow-resource',
        resourceName: entry.name,
        duration: entry.duration,
        transferSize: entry.transferSize,
        initiatorType: entry.initiatorType
      });
    }
  }
});
resourceObserver.observe({ type: 'resource', buffered: true });

The Navigation Timing API provides detailed breakdown of the page load waterfall.

window.addEventListener('load', () => {
  // Wait for load event to ensure all timing data is available
  setTimeout(() => {
    const timing = performance.getEntriesByType('navigation')[0];
    sendToAnalytics({
      name: 'navigation-timing',
      dns: timing.domainLookupEnd - timing.domainLookupStart,
      tcp: timing.connectEnd - timing.connectStart,
      tls: timing.secureConnectionStart > 0
        ? timing.connectEnd - timing.secureConnectionStart
        : 0,
      ttfb: timing.responseStart - timing.requestStart,
      download: timing.responseEnd - timing.responseStart,
      domParsing: timing.domInteractive - timing.responseEnd,
      domContentLoaded: timing.domContentLoadedEventEnd - timing.fetchStart,
      fullLoad: timing.loadEventEnd - timing.fetchStart
    });
  }, 0);
});

This breakdown reveals where time is spent. If DNS resolution consistently takes 500ms, switch to a faster DNS resolver or implement DNS prefetching. If TTFB is high, the server or CDN configuration needs attention. If DOM parsing is slow, the HTML document is too large or contains render-blocking resources.

Frontend Error Tracking#

Performance metrics tell you the application is slow. Error tracking tells you it is broken.

Capturing JavaScript Errors#

// Global error handler for uncaught exceptions
window.addEventListener('error', (event) => {
  reportError({
    type: 'uncaught_exception',
    message: event.message,
    filename: event.filename,
    lineno: event.lineno,
    colno: event.colno,
    stack: event.error?.stack,
    url: window.location.href,
    timestamp: new Date().toISOString()
  });
});

// Unhandled promise rejections
window.addEventListener('unhandledrejection', (event) => {
  reportError({
    type: 'unhandled_rejection',
    message: event.reason?.message || String(event.reason),
    stack: event.reason?.stack,
    url: window.location.href,
    timestamp: new Date().toISOString()
  });
});

// Network errors (failed fetch/XHR)
const originalFetch = window.fetch;
window.fetch = async function (...args) {
  try {
    const response = await originalFetch.apply(this, args);
    if (!response.ok) {
      reportError({
        type: 'http_error',
        status: response.status,
        url: args[0],
        pageUrl: window.location.href
      });
    }
    return response;
  } catch (error) {
    reportError({
      type: 'network_error',
      message: error.message,
      url: args[0],
      pageUrl: window.location.href
    });
    throw error;
  }
};

Source Maps for Production Errors#

Production JavaScript is minified and bundled, making stack traces unreadable without source maps. Upload source maps to your error tracking service so that main.abc123.js:1:45678 resolves to UserProfile.tsx:42:getAccountDetails.

# Upload source maps to Sentry during CI/CD
sentry-cli releases new "$VERSION"
sentry-cli releases files "$VERSION" upload-sourcemaps ./dist \
  --url-prefix '~/static/js'
sentry-cli releases finalize "$VERSION"

Keep source maps out of production bundles (do not serve them to users) but upload them to your error tracking service. This gives you readable stack traces without exposing your source code.

Error Grouping and Prioritization#

Raw error counts are overwhelming. Group errors by:

Error message (after normalizing dynamic values like IDs and URLs).
Affected page/route: errors on the checkout page are higher priority than errors on the about page.
User impact: an error that blocks the user from proceeding (broken form submission) is higher priority than a cosmetic error (tooltip fails to render).
Frequency trend: a new error that appeared after the latest deployment needs immediate attention.

Session Replay#

Session replay records user interactions (mouse movements, clicks, scrolls, DOM changes) and reconstructs them as a video-like playback. It bridges the gap between metrics (“error rate is 5%”) and understanding (“here is what the user was doing when the error occurred”).

Tools#

Sentry Session Replay: Integrated with Sentry’s error tracking. When an error occurs, the session replay is automatically linked. Privacy controls mask sensitive content by default.

Grafana Faro: Open-source web SDK from Grafana Labs. Collects RUM data, errors, and session recordings, and sends them to a Grafana backend. Integrates naturally with Grafana dashboards, Loki, and Tempo.

LogRocket, FullStory, Datadog RUM: Commercial options with polished UIs, advanced search, and team collaboration features. Higher cost, lower operational burden.

Privacy Considerations#

Session replay captures what users see and do. This raises significant privacy concerns.

// Sentry session replay with privacy controls
Sentry.init({
  dsn: 'https://...@sentry.io/...',
  integrations: [
    Sentry.replayIntegration({
      // Mask all text content by default
      maskAllText: true,
      // Block recording of specific elements
      blockSelector: '.sensitive-data, [data-private]',
      // Mask all user input
      maskAllInputs: true,
    }),
  ],
  // Record 10% of sessions, 100% of sessions with errors
  replaysSessionSampleRate: 0.1,
  replaysOnErrorSampleRate: 1.0,
});

Mask by default, unmask selectively. Start with all text and inputs masked. Unmask non-sensitive elements only after confirming they contain no PII. This is safer than the reverse approach.

Sample sessions. Recording every session generates enormous data volume. Record 100% of sessions that contain errors (these are the most valuable for debugging) and a small percentage of normal sessions (for understanding typical user flows).

Comply with regulations. GDPR and CCPA may require user consent before session recording. Consult your legal team. Do not store session replays longer than necessary – 30 days is typical.

Integration with Backend Observability#

The most powerful RUM setup connects frontend events to backend traces, creating a complete picture from user click to database query.

Frontend-to-Backend Trace Propagation#

The OpenTelemetry JavaScript SDK can inject trace context into outgoing HTTP requests, so the backend trace continues from the same trace that started in the browser.

import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
import { ZoneContextManager } from '@opentelemetry/context-zone';
import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch';
import { W3CTraceContextPropagator } from '@opentelemetry/core';
import { registerInstrumentations } from '@opentelemetry/instrumentation';

const provider = new WebTracerProvider();
provider.register({
  contextManager: new ZoneContextManager(),
  propagator: new W3CTraceContextPropagator(),
});

registerInstrumentations({
  instrumentations: [
    new FetchInstrumentation({
      // Propagate trace context to these backend URLs
      propagateTraceHeaderCorsUrls: [
        /https:\/\/api\.example\.com\/.*/,
      ],
    }),
  ],
});

With this setup, a user clicking “Submit Order” generates a trace that spans:

Browser: button click event, form validation, fetch request.
API gateway: request processing, authentication check.
Order service: order creation, inventory check.
Database: INSERT query.

Every step is connected by the same trace ID. When investigating a slow checkout, you see the entire path in a single trace view.

Grafana Faro Integration#

Grafana Faro connects RUM data to the Grafana observability stack.

import { initializeFaro } from '@grafana/faro-web-sdk';
import { TracingInstrumentation } from '@grafana/faro-web-tracing';

const faro = initializeFaro({
  url: 'https://faro-collector.example.com/collect',
  app: {
    name: 'my-frontend',
    version: '1.0.0',
    environment: 'production',
  },
  instrumentations: [
    new TracingInstrumentation({
      instrumentationOptions: {
        propagateTraceHeaderCorsUrls: [
          /https:\/\/api\.example\.com\/.*/,
        ],
      },
    }),
  ],
});

Faro sends performance metrics, errors, and traces to a Grafana Alloy or OpenTelemetry Collector endpoint, which routes them to Prometheus (metrics), Loki (logs and errors), and Tempo (traces). The result is a unified view in Grafana where you can jump from a frontend LCP measurement to the backend trace that served the page.

Synthetic Monitoring vs. Real User Monitoring#

These are complementary approaches, not alternatives. Understanding their differences guides when to use each.

                          Synthetic Monitoring          Real User Monitoring
Coverage                  Predefined test paths         All user interactions
Traffic dependency        Runs regardless of traffic    Requires real users
Environment diversity     Controlled, consistent        Real devices, networks, locations
Measurement consistency   High (same test, same env)    Variable (real-world conditions)
Issue detection           Outages, availability         Performance degradation, errors
User behavior insight     None                          Full (flows, abandonment, errors)
Geographic coverage       Limited to probe locations    Wherever users are
Cost driver               Number of probes and freq     Number of users and session volume
Alerting suitability      Excellent for uptime          Good for trends, poor for instant
Baseline comparison       Strong (controlled env)       Weak (too many variables)

When to Use Synthetic Monitoring#

Uptime monitoring: Is the service reachable right now? Synthetic probes answer this question 24/7 regardless of whether users are active.
SLA measurement: Controlled, consistent measurements from known locations provide defensible SLA data.
Pre-production testing: Run synthetic checks against staging to catch performance regressions before deployment.
Third-party dependency monitoring: Probe APIs and services you depend on to detect their outages before they affect your users.
Off-hours coverage: At 3 AM when user traffic is near zero, synthetic probes are the only signal.

When to Use Real User Monitoring#

Performance budgets: Are real users experiencing the performance you designed for? Synthetic probes on a fast data center connection do not represent a user on mobile data.
Error discovery: Which JavaScript errors are users hitting that you never saw in testing? RUM captures errors from browser versions, extensions, and network conditions you cannot reproduce in a lab.
User behavior analysis: Where do users drop off? Which pages are frustratingly slow? RUM provides the data to prioritize performance work by actual user impact.
Geographic performance: Users in Southeast Asia may experience 3x worse latency than users in North America. RUM reveals these disparities.
Impact measurement: After a performance optimization, did real users actually experience improvement? Synthetic tests from a data center might show improvement that users on real networks do not experience.

Using Both Together#

The strongest observability posture uses both. Synthetic monitoring provides the alert: “The checkout page is down.” RUM provides the context: “Users in Europe started experiencing 5-second load times 30 minutes ago, and 15% are abandoning the checkout flow.”

Set up synthetic probes for critical user paths (login, checkout, dashboard) with tight alerting thresholds. Instrument RUM for all pages, reporting Core Web Vitals and errors. When a synthetic alert fires, look at the RUM data for the same time window to understand the scope and severity of the issue from the user’s perspective.

Practical Setup Checklist#

1. INSTRUMENT CORE WEB VITALS
   - Add web-vitals library to your frontend build
   - Report LCP, INP, CLS, FCP, TTFB to your backend
   - Set up dashboards with p75 and p95 breakdowns by page

2. IMPLEMENT ERROR TRACKING
   - Global error and unhandledrejection handlers
   - Source map upload in CI/CD pipeline
   - Error grouping and priority rules

3. ENABLE TRACE PROPAGATION
   - OpenTelemetry JS SDK or Grafana Faro
   - W3C Trace Context headers on fetch/XHR to your API
   - CORS configuration on backend to accept traceparent header

4. CONFIGURE SESSION REPLAY (if applicable)
   - Privacy-first: mask all text and inputs by default
   - 100% error session recording, 1-10% normal session sampling
   - Retention policy aligned with privacy requirements

5. BUILD DASHBOARDS
   - Core Web Vitals overview (p75 by page, trend over time)
   - Error rate by page and error type
   - Performance by geography and device type
   - Frontend-to-backend trace correlation view

6. SET UP ALERTS
   - LCP p75 > 4s for any page (sustained 15 minutes)
   - New error type detected with > 100 occurrences in 1 hour
   - CLS > 0.25 after a deployment (regression detection)
   - JavaScript error rate > 1% of page views

This checklist is ordered by implementation priority. Core Web Vitals and error tracking provide the most immediate value. Trace propagation and session replay add depth but require more setup effort. Start with steps 1 and 2, and add the rest incrementally.