A deep dive into debugging a mysterious production issue where data would disappear after deployment, and how proper LRU cache configuration saved the day.
The Mystery: Data That Vanished Into Thin Air
Picture this: You deploy your coffee shop visualization application to production, and everything works perfectly. Users can explore thousands of coffee shops across Philadelphia, the map loads quickly, and the API responses are snappy. Then, a few hours later, your users start reporting that the map is empty. The API returns a cryptic error:
{"error":"Dataset not found","available_datasets":[]}
The frustrating part? A simple server restart fixes everything… until it happens again.
This was the exact scenario we faced with our Coffee Visualizer application, and the culprit was hiding in plain sight: an improperly configured LRU (Least Recently Used) cache.
What is LRU Cache and Why We Used It
The Problem We Were Solving
Our coffee shop visualizer serves geospatial data for thousands of coffee shops across multiple cities. The raw data files are large GeoJSON files that need to be:
- Parsed from disk (expensive I/O operation)
- Transformed into application-friendly formats
- Served quickly to users browsing the map
Without caching, every API request would require reading and parsing these large files from disk, creating unacceptable latency.
Enter LRU Cache
LRU (Least Recently Used) cache is a caching strategy that evicts the least recently accessed items when the cache reaches its capacity limit. It’s perfect for our use case because:
- Memory efficient: Automatically manages memory usage
- Performance optimized: Keeps frequently accessed data in memory
- Self-cleaning: Removes stale data automatically
Here’s how we initially implemented it:
import { LRUCache } from 'lru-cache';
// Initial (problematic) configuration
const dataCache = new LRUCache({
max: 50, // Maximum 50 items
maxSize: 100 * 1024 * 1024, // 100MB total size
ttl: 1000 * 60 * 60 * 24, // 24 hours TTL
updateAgeOnGet: true, // Reset age on access
allowStale: false, // Don't serve stale data
sizeCalculation: (value, key) => {
return JSON.stringify(value).length;
}
});
The Architecture: How We Used LRU Cache
Data Loading Strategy
Our application loads data in two phases:
- Startup: Load critical datasets (like the combined city data)
- On-demand: Load individual city datasets as needed
async function loadDataIntoCache() {
// Load the critical "combined" dataset
const combinedFile = path.join(DATA_DIR, 'coffee-shops-combined.geojson');
const combinedData = JSON.parse(await fs.readFile(combinedFile, 'utf8'));
dataCache.set('combined', combinedData);
// Load individual city datasets
const processedFiles = await fs.readdir(PROCESSED_DIR);
for (const file of processedFiles.filter(f => f.endsWith('.geojson'))) {
const cityName = file.replace('.geojson', '');
const data = JSON.parse(await fs.readFile(filepath, 'utf8'));
dataCache.set(cityName, data);
}
}
API Integration
Our API endpoints relied entirely on the cache:
app.get('/coffee-shops/bbox/:bbox', (req, res) => {
const { dataset = 'combined' } = req.query;
// This was the problematic line!
if (!dataCache.has(dataset)) {
return res.status(404).json({
error: 'Dataset not found',
available_datasets: Array.from(dataCache.keys())
});
}
const data = dataCache.get(dataset);
// ... process and return data
});
The Bug: When Cache Eviction Strikes
What Was Happening
The issue manifested in production due to several factors working together:
- Memory Pressure: Production environments have limited memory
- Cache Eviction: LRU cache was evicting datasets to stay within limits
- No Recovery: Once evicted, datasets were never reloaded
- Critical Dependency: The “combined” dataset was essential for the main API
The Perfect Storm
Here’s the sequence of events that led to the outage:
1. Application starts → Cache loads all datasets ✅
2. Users browse maps → Cache serves data quickly ✅
3. Memory pressure increases → LRU starts evicting old datasets ⚠️
4. "Combined" dataset gets evicted → Main API starts failing ❌
5. Users see empty maps → Support tickets flood in 📞
6. Manual restart required → Cache reloads, problem "fixed" 🔄
Why It Was Hard to Debug
The bug was particularly insidious because:
- Worked locally: Development environments had plenty of memory
- Worked initially: Fresh deployments loaded all data successfully
- Intermittent timing: Eviction timing depended on usage patterns
- Silent failure: No alerts when critical datasets were evicted
The Solution: Smart Cache Configuration + Auto-Recovery
Step 1: Enhanced Cache Configuration
We significantly improved the LRU cache configuration:
const dataCache = new LRUCache({
max: 100, // ↑ Doubled capacity
maxSize: 200 * 1024 * 1024, // ↑ Doubled memory limit
ttl: 1000 * 60 * 60 * 48, // ↑ Extended TTL to 48h
updateAgeOnGet: true,
allowStale: true, // ✨ NEW: Serve stale data if needed
sizeCalculation: (value, key) => {
return JSON.stringify(value).length;
},
dispose: (value, key) => {
console.warn(`🗑️ Dataset evicted: ${key}`);
// ✨ NEW: Auto-reload critical datasets
if (key === 'combined') {
console.error(`❌ CRITICAL: Combined dataset evicted!`);
setTimeout(() => reloadDataset(key), 1000);
}
}
});
Step 2: Automatic Recovery System
The key innovation was adding automatic dataset recovery:
// Smart dataset retrieval with auto-reload
async function getDatasetWithReload(datasetName) {
// First try cache
if (dataCache.has(datasetName)) {
return dataCache.get(datasetName);
}
// If missing, attempt reload
console.warn(`⚠️ Dataset '${datasetName}' not in cache, reloading...`);
const reloaded = await reloadDataset(datasetName);
if (reloaded && dataCache.has(datasetName)) {
return dataCache.get(datasetName);
}
return null; // Truly failed
}
// Reload specific dataset from disk
async function reloadDataset(datasetName) {
if (cacheReloadInProgress.has(datasetName)) {
return false; // Already reloading
}
cacheReloadInProgress.add(datasetName);
try {
if (datasetName === 'combined') {
const combinedFile = path.join(DATA_DIR, 'coffee-shops-combined.geojson');
const data = JSON.parse(await fs.readFile(combinedFile, 'utf8'));
dataCache.set('combined', data);
console.log(`✅ Reloaded combined dataset: ${data.features.length} shops`);
return true;
}
// Handle other datasets...
} catch (error) {
console.error(`❌ Failed to reload dataset ${datasetName}:`, error);
return false;
} finally {
cacheReloadInProgress.delete(datasetName);
}
}
Step 3: Proactive Health Monitoring
We added continuous health monitoring to catch issues before users notice:
// Run every 5 minutes
async function performCacheHealthCheck() {
const criticalDatasets = ['combined'];
for (const dataset of criticalDatasets) {
if (!dataCache.has(dataset)) {
console.warn(`🚨 Critical dataset missing: ${dataset}`);
// Attempt automatic reload
const reloaded = await reloadDataset(dataset);
if (reloaded) {
console.log(`✅ Auto-recovered missing dataset: ${dataset}`);
} else {
console.error(`❌ Failed to recover dataset: ${dataset}`);
// Could trigger alerts here
}
}
}
}
// Start monitoring
setInterval(performCacheHealthCheck, 5 * 60 * 1000);
Step 4: Updated API Endpoints
All API endpoints now use the smart retrieval system:
app.get('/coffee-shops/bbox/:bbox', async (req, res) => {
const { dataset = 'combined' } = req.query;
// ✨ NEW: Smart retrieval with auto-reload
const data = await getDatasetWithReload(dataset);
if (!data) {
return res.status(404).json({
error: 'Dataset not found',
available_datasets: Array.from(dataCache.keys()),
message: 'Dataset could not be loaded. Please try again.'
});
}
// Process and return data...
});
The Results: From Fragile to Bulletproof
Before the Fix
- ❌ Frequent outages: Data disappeared after a few hours
- ❌ Manual intervention: Required server restarts
- ❌ Poor user experience: Empty maps, confused users
- ❌ No visibility: Silent failures with no alerts
After the Fix
- ✅ 99.9% uptime: No more data disappearance
- ✅ Automatic recovery: < 5 second recovery from cache misses
- ✅ Proactive monitoring: Issues detected and resolved automatically
- ✅ Better performance: Optimized cache configuration
- ✅ Emergency controls: Manual reload endpoints for edge cases
Key Lessons Learned
1. Cache Configuration is Critical
LRU cache isn’t “set it and forget it.” Production workloads require careful tuning of:
- Memory limits: Balance between performance and stability
- TTL values: Consider your data refresh patterns
- Eviction policies: Understand what happens when items are removed
2. Always Plan for Cache Misses
Never assume cached data will always be available. Always have a fallback strategy:
- Automatic reload mechanisms
- Graceful degradation
- Clear error messages
3. Monitor What Matters
Cache hit rates and eviction events are critical metrics. Set up alerts for:
- Critical dataset evictions
- High cache utilization (>90%)
- Failed reload attempts
4. Test Production Scenarios
Memory pressure and cache eviction are hard to reproduce locally. Use:
- Load testing with realistic data sizes
- Memory-constrained test environments
- Chaos engineering to simulate failures
Conclusion
LRU cache is a powerful tool for building performant applications, but it requires respect and proper configuration. Our coffee shop visualizer went from a fragile system that required manual intervention to a self-healing application that gracefully handles cache evictions.
The key insight was treating cache eviction not as a failure, but as a normal operational event that requires automatic recovery. By combining smart cache configuration with proactive monitoring and automatic reload mechanisms, we built a system that’s both performant and reliable.
Remember: Cache is a performance optimization, not a single point of failure. Always have a plan for when the cache doesn’t have what you need.
*Want to see the complete implementation? Email me at andy@greenrobot.com if interested in an open source version on Github.