EurostatAPI.jl

A Julia package for accessing and processing data from the Eurostat API.

Overview

EurostatAPI.jl provides a simple and robust interface to fetch data from the Eurostat API. It handles all the complexities of working with the Eurostat SDMX (Statistical Data and Metadata eXchange) API, including:

  • Making HTTP requests to Eurostat API endpoints with automatic retries
  • Parsing complex JSON responses from the SDMX format
  • Correctly interpreting dimension references and metadata
  • Converting raw API data to clean, structured DataFrames
  • Handling special values and missing data appropriately

Eurostat is the statistical office of the European Union, providing high-quality statistics on Europe covering areas such as:

  • Economy and finance
  • Population and social conditions
  • Industry, trade and services
  • Agriculture and fisheries
  • Environment and energy
  • Science, technology and digital society

This package provides programmatic access to any Eurostat dataset available through their unified SDMX API.

Features

  • Simple Interface: Fetch any Eurostat dataset with just a dataset ID and year
  • Advanced Filtering: Filter by indicators, geographic regions, and product codes to reduce data size
  • Automatic Fallback: Handles "response too large" errors by automatically applying smart filters
  • Chunked Fetching: Fetch very large datasets in manageable chunks
  • Robust Error Handling: Automatic retries, timeout handling, and informative error messages
  • Efficient Data Processing: Handles large datasets with progress logging and memory-efficient processing
  • Clean Data Output: Automatic conversion to DataFrame format with proper data types
  • Special Value Handling: Proper interpretation of confidential (:C), not available (:) and not applicable (-) values
  • Flexible Time Periods: Support for any time period supported by the underlying dataset
  • Metadata Support: Query dataset dimensions and available codes before fetching
  • Comprehensive Logging: Detailed processing information for transparency and debugging

Installation

using Pkg
Pkg.add("EurostatAPI")

Or from the NILU Julia registry:

using Pkg
Pkg.Registry.add(url="https://git.nilu.no/julia/registry")
Pkg.add("EurostatAPI")

Quick Start

using EurostatAPI
using DataFrames

# Fetch a dataset for a specific year
# Example: European GDP data
df = fetch_dataset("nama_10_gdp", 2022)

# Fetch with filters - only specific countries
df_filtered = fetch_dataset("nama_10_gdp", 2022; geo=["EU27_2020", "DE", "FR"])

# Automatic handling of large datasets
df_auto = fetch_with_fallback("nama_10_gdp", 2022)  # Adds filters if response too large

# Fetch large dataset in chunks
df_chunked = fetch_dataset_chunked("nama_10_gdp", 2022; chunk_by=:geo, chunk_size=10)

# Display the first few rows
first(df, 5)

# Check the structure of the data
describe(df)

# Basic analysis examples
# Count records by country (if geo dimension exists)
if :geo in names(df)
    country_counts = combine(groupby(df, :geo), nrow => :count)
    sort!(country_counts, :count, rev=true)
    println("Top 5 countries by record count:")
    println(first(country_counts, 5))
end

# Find non-missing values
non_missing_data = filter(row -> !ismissing(row.value), df)
println("Records with actual values: $(nrow(non_missing_data))")

Finding Dataset IDs

To use EurostatAPI.jl, you need to know the Eurostat dataset ID. You can find these:

  1. Browse the Eurostat Data Explorer
  2. Look at the URL or dataset information - the ID is typically shown
  3. Check the Eurostat API documentation

Common dataset patterns:

  • nama_* - National accounts
  • demo_* - Demography and migration
  • env_* - Environment
  • nrg_* - Energy
  • t2020_* - Europe 2020 indicators

Understanding the Data

Eurostat datasets are multi-dimensional, typically organized along dimensions such as:

  • Geographic units (geo): Countries, regions, etc.
  • Time periods (time): Years, quarters, months
  • Statistical indicators (indic_*): What is being measured
  • Economic sectors (nace_*): Industry classifications
  • Demographics (age, sex): Population breakdowns

Special Values

Eurostat data uses special codes for missing or restricted data:

  • :C or :c - Confidential data
  • : - Not available
  • - - Not applicable
  • 0 - Zero or rounded to zero

These special values are converted to missing in the DataFrame, but the original codes are preserved in the original_value column when present.

Data Structure

The DataFrame returned by fetch_dataset contains:

ColumnDescription
datasetThe Eurostat dataset ID
yearThe year requested
valueThe actual data value (numeric or missing)
original_valueOriginal string value for special codes
fetch_dateWhen the data was retrieved
original_keyInternal API key reference
Various dimensionsDepends on dataset (geo, time, indicators, etc.)

Advanced Usage

Filtering Large Datasets

The enhanced fetch_dataset function now supports filtering to reduce response size:

# Filter by geographic regions
df_germany = fetch_dataset("nama_10_gdp", 2022; geo=["DE"])

# Filter by indicators for economic data
df_indicators = fetch_dataset("nama_10_gdp", 2022; 
                             geo=["DE", "FR"])

# Combine multiple filters
df_filtered = fetch_dataset("nama_10_gdp", 2022;
                           geo=["EU27_2020", "DE", "FR"])

Automatic Fallback

Use fetch_with_fallback to automatically handle "response too large" errors:

# Automatically adds filters if the response is too large
df = fetch_with_fallback("nama_10_gdp", 2022)

Chunked Fetching

For extremely large datasets, fetch data in chunks:

# Get metadata first
metadata = get_dataset_metadata("nama_10_gdp")
println("Total geographic codes: $(length(metadata.geo_codes))")

# Fetch in chunks
df_chunked = fetch_dataset_chunked("nama_10_gdp", 2022;
                                  chunk_by=:geo,
                                  chunk_size=10)

Error Handling

using EurostatAPI

try
    df = fetch_dataset("nama_10_gdp", 2023)
    println("Successfully fetched $(nrow(df)) records")
catch e
    if isa(e, HTTP.ExceptionRequest.StatusError)
        println("HTTP error: ", e.status)
        if e.status == 404
            println("Dataset not found - check the dataset ID")
        elseif e.status == 400
            println("Bad request - check the year parameter")
        end
    elseif isa(e, HTTP.TimeoutError)
        println("Request timed out - try again or check your connection")
    else
        println("Unexpected error: ", e)
    end
end

Working with Time Series

# Get data for multiple recent years
years = [2020, 2021, 2022, 2023]
all_data = DataFrame()

for year in years
    try
        yearly_data = fetch_dataset("nama_10_gdp", year)
        append!(all_data, yearly_data)
        println("Added data for $year: $(nrow(yearly_data)) records")
    catch e
        println("Failed to get data for $year: $e")
    end
end

println("Total records collected: $(nrow(all_data))")

Memory Management

For very large datasets:

# Monitor memory usage
using Base: gc

println("Memory before: $(Base.gc_bytes() / 1024^2) MB")
df = fetch_dataset("large_dataset_id", 2022)
gc()  # Force garbage collection
println("Memory after: $(Base.gc_bytes() / 1024^2) MB")
println("Dataset size: $(nrow(df)) rows")

Available Years

Most Eurostat datasets provide historical data, but availability varies:

# Check what years might be available (returns conservative estimate)
available_years = get_dataset_years("nama_10_gdp")
println("Potentially available years: $(first(available_years, 5))...$(last(available_years, 5))")

Note: get_dataset_years provides a conservative estimate. The actual available years depend on the specific dataset and are determined by Eurostat's data release schedule.

Performance Considerations

  • Large datasets: Some Eurostat datasets contain millions of observations
  • Network timeouts: The package includes automatic retries with 120-second timeouts
  • Memory usage: Large datasets may require substantial RAM
  • API limits: Eurostat may have rate limiting (though not typically restrictive)

Contributing

This package is part of the CirQuant project. Contributions are welcome via the project's GitLab repository.

License

Licensed under the MIT License. See the LICENSE file for details.

Citation

If you use EurostatAPI.jl in your research, please cite:

Boero, R. (2025). EurostatAPI.jl: A Julia package for accessing Eurostat data.

And acknowledge the data source:

Eurostat. European Statistics. https://ec.europa.eu/eurostat

Documentation

Contents: