Cache Management

The package uses intelligent caching to speed up data access. Understanding how caching works helps you optimize performance and manage disk space.

How Caching Works

The package implements a three-tier caching system:

┌─────────────────┐
│  Memory Cache   │  ← Fastest: In-RAM storage
├─────────────────┤
│   Query Cache   │  ← Fast: Filtered datasets
├─────────────────┤
│   Bulk Cache    │  ← Large: Complete database files
└─────────────────┘

When you request data:

Check memory cache: Is it already loaded in RAM?
Check query cache: Is this exact filtered query cached?
Check bulk cache: Is the bulk file downloaded?
Fetch from API: Last resort, download from OECD

Setting Up Caching

Set the Data Path

Configure Cache Location

from oda_data import set_data_path

# All cached data goes here
set_data_path("data")

# Creates directory structure:
# data/
# ├── bulk/          # Bulk downloads
# ├── queries/       # Filtered query results
# └── .pydeflate/    # Currency conversion data

Default Cache Location

If you don't set a path, the package uses a default:

# Default: .raw_data/ in your working directory
# Automatically created when needed

Cache Tiers Explained

1. Memory Cache

Fast, temporary storage while your Python session runs.

Memory Cache in Action

from oda_data import OECDClient

client = OECDClient(years=[2022])

# First call: Downloads and caches in memory
data1 = client.get_indicators("DAC1.10.1010")  # Slow

# Second call: Retrieved from memory
data2 = client.get_indicators("DAC1.10.1010")  # Fast!

# Memory cache clears when Python session ends

Characteristics:

Very fast access
Limited by available RAM
Cleared when Python exits
Shared across queries in same session

2. Query Cache

Stores filtered datasets on disk for reuse.

Query Cache Persistence

from oda_data import OECDClient, set_data_path

set_data_path("data")

client = OECDClient(
    years=[2022],
    providers=[4, 12],
    currency="EUR"
)

# First run: Processes and caches result
data = client.get_indicators("DAC1.10.1010")

# Next day, same query: Instant!
# Loads from query cache, no processing needed

Characteristics:

Persists across sessions
Specific to filter combinations
Smaller than bulk files
Automatically managed

3. Bulk Cache

Stores complete database downloads.

Bulk Cache for Large Datasets

from oda_data import DAC1Data, set_data_path

set_data_path("data")

dac1 = DAC1Data(years=range(2010, 2024))

# First time: Downloads entire bulk file (few minutes)
dac1.download(bulk=True)

# Subsequent queries: Reads from cached file (seconds)
data = dac1.read(using_bulk_download=True)

Characteristics:

Complete database tables
Large file size (100s of MB)
Persists across sessions
Shared across all queries using that database

Managing Cache

Clear All Caches

Clear All Cached Data

from oda_data import clear_cache

# Removes all cached data
clear_cache()

# Everything will be re-downloaded on next use

Disk Space

Clearing cache means the next queries will download data again. Only clear cache when you need to free disk space or force fresh downloads.

Disable Caching Temporarily

Disable Cache for Development

from oda_data import disable_cache, enable_cache

# Turn off caching
disable_cache()

# Now every query fetches fresh data
client = OECDClient(years=[2022])
data = client.get_indicators("DAC1.10.1010")  # Always downloads

# Re-enable caching
enable_cache()

Use cases for disabling cache:

Testing with latest OECD data
Debugging cache-related issues
Development and testing
Ensuring data freshness

Check Cache Location

Find Your Cache Directory

from oda_data.config import ODAPaths

print(f"Cache directory: {ODAPaths.raw_data}")

Manual Cache Cleanup

You can manually manage cache files:

# View cache size
du -sh data/

# Remove old bulk files
rm data/bulk/*

# Remove query cache
rm -rf data/queries/

Performance Optimization

Pattern: Use Bulk Downloads for Multiple Queries

Optimize with Bulk Downloads

from oda_data import OECDClient, set_data_path

set_data_path("data")

# Enable bulk download once
client = OECDClient(
    years=range(2015, 2024),
    use_bulk_download=True
)

# First call downloads bulk file (slow initial download)
data1 = client.get_indicators("DAC1.10.1010")

# Subsequent calls are very fast
data2 = client.get_indicators("DAC1.10.1015")
data3 = client.get_indicators("DAC1.10.1210")
# All fast because they use the same cached bulk file

Cache Lifetime and Refresh

When Cache is Invalidated

Cache is automatically refreshed when:

Bulk file is older than 30 days (stale data check)
You explicitly clear cache
OECD releases new data versions

Force Fresh Data

Force Download of Latest Data

from oda_data import clear_cache, OECDClient

# Clear cache to force fresh download
clear_cache()

# Next query gets latest data
client = OECDClient(years=[2023])
fresh_data = client.get_indicators("DAC1.10.1010")

Check Data Freshness

Verify When Data Was Cached

import os
from pathlib import Path
from datetime import datetime
from oda_data.config import ODAPaths

# Check bulk file modification time
bulk_dir = ODAPaths.raw_data / "bulk"

if bulk_dir.exists():
    for file in bulk_dir.glob("*.parquet"):
        mtime = os.path.getmtime(file)
        mod_date = datetime.fromtimestamp(mtime)
        print(f"{file.name}: Last modified {mod_date}")

Troubleshooting

Issue: Cache taking too much space

Solution: Clear old caches or remove specific bulk files:

from oda_data import clear_cache

# Nuclear option: clear everything
clear_cache()

# Or manually remove large bulk files you don't need
# rm data/bulk/CRS_*.parquet

Issue: Getting old data

Solution: Force refresh:

from oda_data import clear_cache, OECDClient

clear_cache()

# Now get fresh data
client = OECDClient(years=[2023])
data = client.get_indicators("DAC1.10.1010")

Issue: Cache corrupted or causing errors

Solution: Clear and rebuild:

from oda_data import clear_cache, set_data_path

# Clear all caches
clear_cache()

# Re-set path to ensure clean state
set_data_path("data")

# Cache will rebuild correctly on next use

Issue: Working across multiple projects

Solution: Use project-specific cache paths:

# Project 1
from oda_data import set_data_path
set_data_path("project1/data")

# Project 2
from oda_data import set_data_path
set_data_path("project2/data")

# Each project has its own cache

Advanced: Multi-Process Safety

The package uses file locks to prevent cache corruption in multi-process scenarios:

Safe for Parallel Processing

from multiprocessing import Pool
from oda_data import OECDClient, set_data_path

def fetch_indicator(indicator):
    set_data_path("data")  # Same cache for all processes
    client = OECDClient(years=[2022])
    return client.get_indicators(indicator)

# Safe: Multiple processes can share cache
if __name__ == "__main__":
    indicators = ["DAC1.10.1010", "DAC1.10.1015", "DAC1.10.1210"]
    with Pool(3) as pool:
        results = pool.map(fetch_indicator, indicators)

File locks prevent:

Race conditions during downloads
Corrupted cache files
Duplicate downloads

Best Practices

Do:

Set a persistent data path for your project
Use bulk downloads for multiple queries
Reuse client configurations
Clear cache periodically to free space
Pre-download bulk files for offline work

Don't:

Clear cache unnecessarily (wastes time re-downloading)
Use different data paths for the same project
Disable caching in production code
Manually edit cache files (can corrupt them)
Keep very old cache files (may be stale)

Summary

The caching system makes the package fast and efficient:

Memory cache: Instant access during your session
Query cache: Fast retrieval of previous filtered queries
Bulk cache: Quick access to complete databases

Manage caches with:

set_data_path() - Configure where cache is stored
clear_cache() - Remove all cached data
disable_cache()/enable_cache() - Control caching behavior

Understanding caching helps you:

Optimize query performance
Manage disk space
Ensure data freshness
Work efficiently offline