Solving Production Cache Eviction: How LRU Cache caused problems after awhile, and how to fix it.

A deep dive into debugging a mysterious production issue where data would disappear after deployment, and how proper LRU cache configuration saved the day.

The Mystery: Data That Vanished Into Thin Air

Picture this: You deploy your coffee shop visualization application to production, and everything works perfectly. Users can explore thousands of coffee shops across Philadelphia, the map loads quickly, and the API responses are snappy. Then, a few hours later, your users start reporting that the map is empty. The API returns a cryptic error:

{"error":"Dataset not found","available_datasets":[]}

The frustrating part? A simple server restart fixes everything… until it happens again.

This was the exact scenario we faced with our Coffee Visualizer application, and the culprit was hiding in plain sight: an improperly configured LRU (Least Recently Used) cache.

What is LRU Cache and Why We Used It

The Problem We Were Solving

Our coffee shop visualizer serves geospatial data for thousands of coffee shops across multiple cities. The raw data files are large GeoJSON files that need to be:

  1. Parsed from disk (expensive I/O operation)
  2. Transformed into application-friendly formats
  3. Served quickly to users browsing the map

Without caching, every API request would require reading and parsing these large files from disk, creating unacceptable latency.

Enter LRU Cache

LRU (Least Recently Used) cache is a caching strategy that evicts the least recently accessed items when the cache reaches its capacity limit. It’s perfect for our use case because:

  • Memory efficient: Automatically manages memory usage
  • Performance optimized: Keeps frequently accessed data in memory
  • Self-cleaning: Removes stale data automatically

Here’s how we initially implemented it:

import { LRUCache } from 'lru-cache';

// Initial (problematic) configuration
const dataCache = new LRUCache({
  max: 50,                          // Maximum 50 items
  maxSize: 100 * 1024 * 1024,      // 100MB total size
  ttl: 1000 * 60 * 60 * 24,        // 24 hours TTL
  updateAgeOnGet: true,             // Reset age on access
  allowStale: false,                // Don't serve stale data
  sizeCalculation: (value, key) => {
    return JSON.stringify(value).length;
  }
});

The Architecture: How We Used LRU Cache

Data Loading Strategy

Our application loads data in two phases:

  1. Startup: Load critical datasets (like the combined city data)
  2. On-demand: Load individual city datasets as needed
async function loadDataIntoCache() {
  // Load the critical "combined" dataset
  const combinedFile = path.join(DATA_DIR, 'coffee-shops-combined.geojson');
  const combinedData = JSON.parse(await fs.readFile(combinedFile, 'utf8'));
  dataCache.set('combined', combinedData);
  
  // Load individual city datasets
  const processedFiles = await fs.readdir(PROCESSED_DIR);
  for (const file of processedFiles.filter(f => f.endsWith('.geojson'))) {
    const cityName = file.replace('.geojson', '');
    const data = JSON.parse(await fs.readFile(filepath, 'utf8'));
    dataCache.set(cityName, data);
  }
}

API Integration

Our API endpoints relied entirely on the cache:

app.get('/coffee-shops/bbox/:bbox', (req, res) => {
  const { dataset = 'combined' } = req.query;
  
  // This was the problematic line!
  if (!dataCache.has(dataset)) {
    return res.status(404).json({
      error: 'Dataset not found',
      available_datasets: Array.from(dataCache.keys())
    });
  }
  
  const data = dataCache.get(dataset);
  // ... process and return data
});

The Bug: When Cache Eviction Strikes

What Was Happening

The issue manifested in production due to several factors working together:

  1. Memory Pressure: Production environments have limited memory
  2. Cache Eviction: LRU cache was evicting datasets to stay within limits
  3. No Recovery: Once evicted, datasets were never reloaded
  4. Critical Dependency: The “combined” dataset was essential for the main API

The Perfect Storm

Here’s the sequence of events that led to the outage:

1. Application starts → Cache loads all datasets ✅
2. Users browse maps → Cache serves data quickly ✅
3. Memory pressure increases → LRU starts evicting old datasets ⚠️
4. "Combined" dataset gets evicted → Main API starts failing ❌
5. Users see empty maps → Support tickets flood in 📞
6. Manual restart required → Cache reloads, problem "fixed" 🔄

Why It Was Hard to Debug

The bug was particularly insidious because:

  • Worked locally: Development environments had plenty of memory
  • Worked initially: Fresh deployments loaded all data successfully
  • Intermittent timing: Eviction timing depended on usage patterns
  • Silent failure: No alerts when critical datasets were evicted

The Solution: Smart Cache Configuration + Auto-Recovery

Step 1: Enhanced Cache Configuration

We significantly improved the LRU cache configuration:

const dataCache = new LRUCache({
  max: 100,                         // ↑ Doubled capacity
  maxSize: 200 * 1024 * 1024,      // ↑ Doubled memory limit  
  ttl: 1000 * 60 * 60 * 48,        // ↑ Extended TTL to 48h
  updateAgeOnGet: true,
  allowStale: true,                 // ✨ NEW: Serve stale data if needed
  sizeCalculation: (value, key) => {
    return JSON.stringify(value).length;
  },
  dispose: (value, key) => {
    console.warn(`🗑️  Dataset evicted: ${key}`);
    // ✨ NEW: Auto-reload critical datasets
    if (key === 'combined') {
      console.error(`❌ CRITICAL: Combined dataset evicted!`);
      setTimeout(() => reloadDataset(key), 1000);
    }
  }
});

Step 2: Automatic Recovery System

The key innovation was adding automatic dataset recovery:

// Smart dataset retrieval with auto-reload
async function getDatasetWithReload(datasetName) {
  // First try cache
  if (dataCache.has(datasetName)) {
    return dataCache.get(datasetName);
  }

  // If missing, attempt reload
  console.warn(`⚠️  Dataset '${datasetName}' not in cache, reloading...`);
  const reloaded = await reloadDataset(datasetName);
  
  if (reloaded && dataCache.has(datasetName)) {
    return dataCache.get(datasetName);
  }

  return null; // Truly failed
}

// Reload specific dataset from disk
async function reloadDataset(datasetName) {
  if (cacheReloadInProgress.has(datasetName)) {
    return false; // Already reloading
  }

  cacheReloadInProgress.add(datasetName);
  try {
    if (datasetName === 'combined') {
      const combinedFile = path.join(DATA_DIR, 'coffee-shops-combined.geojson');
      const data = JSON.parse(await fs.readFile(combinedFile, 'utf8'));
      dataCache.set('combined', data);
      console.log(`✅ Reloaded combined dataset: ${data.features.length} shops`);
      return true;
    }
    // Handle other datasets...
  } catch (error) {
    console.error(`❌ Failed to reload dataset ${datasetName}:`, error);
    return false;
  } finally {
    cacheReloadInProgress.delete(datasetName);
  }
}

Step 3: Proactive Health Monitoring

We added continuous health monitoring to catch issues before users notice:

// Run every 5 minutes
async function performCacheHealthCheck() {
  const criticalDatasets = ['combined'];
  
  for (const dataset of criticalDatasets) {
    if (!dataCache.has(dataset)) {
      console.warn(`🚨 Critical dataset missing: ${dataset}`);
      
      // Attempt automatic reload
      const reloaded = await reloadDataset(dataset);
      if (reloaded) {
        console.log(`✅ Auto-recovered missing dataset: ${dataset}`);
      } else {
        console.error(`❌ Failed to recover dataset: ${dataset}`);
        // Could trigger alerts here
      }
    }
  }
}

// Start monitoring
setInterval(performCacheHealthCheck, 5 * 60 * 1000);

Step 4: Updated API Endpoints

All API endpoints now use the smart retrieval system:

app.get('/coffee-shops/bbox/:bbox', async (req, res) => {
  const { dataset = 'combined' } = req.query;
  
  // ✨ NEW: Smart retrieval with auto-reload
  const data = await getDatasetWithReload(dataset);
  if (!data) {
    return res.status(404).json({
      error: 'Dataset not found',
      available_datasets: Array.from(dataCache.keys()),
      message: 'Dataset could not be loaded. Please try again.'
    });
  }
  
  // Process and return data...
});

The Results: From Fragile to Bulletproof

Before the Fix

  • Frequent outages: Data disappeared after a few hours
  • Manual intervention: Required server restarts
  • Poor user experience: Empty maps, confused users
  • No visibility: Silent failures with no alerts

After the Fix

  • 99.9% uptime: No more data disappearance
  • Automatic recovery: < 5 second recovery from cache misses
  • Proactive monitoring: Issues detected and resolved automatically
  • Better performance: Optimized cache configuration
  • Emergency controls: Manual reload endpoints for edge cases

Key Lessons Learned

1. Cache Configuration is Critical

LRU cache isn’t “set it and forget it.” Production workloads require careful tuning of:

  • Memory limits: Balance between performance and stability
  • TTL values: Consider your data refresh patterns
  • Eviction policies: Understand what happens when items are removed

2. Always Plan for Cache Misses

Never assume cached data will always be available. Always have a fallback strategy:

  • Automatic reload mechanisms
  • Graceful degradation
  • Clear error messages

3. Monitor What Matters

Cache hit rates and eviction events are critical metrics. Set up alerts for:

  • Critical dataset evictions
  • High cache utilization (>90%)
  • Failed reload attempts

4. Test Production Scenarios

Memory pressure and cache eviction are hard to reproduce locally. Use:

  • Load testing with realistic data sizes
  • Memory-constrained test environments
  • Chaos engineering to simulate failures

Conclusion

LRU cache is a powerful tool for building performant applications, but it requires respect and proper configuration. Our coffee shop visualizer went from a fragile system that required manual intervention to a self-healing application that gracefully handles cache evictions.

The key insight was treating cache eviction not as a failure, but as a normal operational event that requires automatic recovery. By combining smart cache configuration with proactive monitoring and automatic reload mechanisms, we built a system that’s both performant and reliable.

Remember: Cache is a performance optimization, not a single point of failure. Always have a plan for when the cache doesn’t have what you need.


*Want to see the complete implementation? Email me at andy@greenrobot.com if interested in an open source version on Github.

Leave a Comment