Data Commons knowledge graph

ONE Data is built on top of Data Commons. At the core of Data Commons is a knowledge graph that integrates data from a wide range of sources into a unified, structured schema. This knowledge graph enables us to seamlessly combine and query data across diverse topics, geographies, and time periods. It serves as the foundation for our analytical workflows, making complex data exploration and insight generation more efficient and scalable.

What is a knowledge graph?

A knowledge graph is a structured representation of real-world entities and the relationships between them, organized as a network of nodes and edges. The Data Commons knowledge graph represents the world in this way—as a directed labeled graph—where information is organized as a set of nodes (entities) connected by edges (relationships), each with defined labels known as a properties.

This flexible structure allows Data Commons to capture and link data across a wide range of domains, from time series on demographics and employment, to information about hurricanes, or even proteins. This structure comes from applying a schema or vocabulary to the data, which allows Data Commons to have a consistent way of representing entities and their relationships. This schema is largely derived from Schema.org.

To illustrate at a basic level how a knowledge graph works consider the following statements:

Zimbabwe is a country
Harare and Bulawayo are cities in Zimbabwe
The latitude of Harare is 17.8292

These statements can be represented as a set of nodes and edges in a knowledge graph as shown in the diagram below.

flowchart TD
    HRE((" Harare ")) -->|"containedInPlace"| ZIM((" Zimbabwe "))
    BUL(("Bulawayo")) -->|"containedInPlace"| ZIM
    ZIM -->|"typeOf"| COUNTRY((" Country "))
    HRE -->|"latitude"| LAT_HARARE(("17.8292 "))

Key concepts

Below is a brief overview of the key concepts in the Data Commons knowledge graph. The full official documentation can be found here.

Nodes

A node is a uniquely identified entity, concept, or value in the Data Commons knowledge graph. It is represented as a subject and identified by a DCID. Each node is associated with a set of relationships or properties, also known as edges.

Each node includes the following components:

One or more types: such as an entity, event, statistical variable, or statistical observation.
A unique identifier: the DCID.
Various properties: relationships to other nodes or attributes.
Provenance: information about the origin of the data.

As in other knowledge graphs, connections between nodes are expressed as triples, consisting of a subject node, a predicate (or edge), and an object node. The Data Commons knowledge graph is composed of billions of such triples.

Type

Every node has at least one type, and types can be subclasses of other types. For entities and events, the type is usually another entity (e.g., Harare is a type of City). At the root, all types are instances of the Class type.

For statistical variables and observations, the type is always StatisticalVariable and StatVarObservation, respectively.

DCID

Each node has a DCID, a unique identifier used to reference the node within the knowledge graph. DCIDs can be viewed in the Knowledge Graph browser and are used for both entities and statistical variables.

Property

Nodes have properties that describe their characteristics. Each property is represented as an edge to another node, labeled with the property name.

If the object of the property is a primitive value (e.g., a string, number, or date), it is a "leaf" node, referred to as an attribute. Examples include latitude, year, unique identifiers, etc.

Other properties may link the node to other nodes such as entities, events, etc. For instance, the node Addis Ababa has a typeOf property (linked to City) and a containedInPlace property (linked to Ethiopia).

Note: The DCID of a property generally matches its name.

Provenance

Every node and triple includes important properties that describe the origin of the data.

Provenance: All triples have a provenance, typically the URL of the data provider’s website (e.g., www.abs.gov.au). Entity types also have a provenance, often represented by a DCID (e.g., AustraliaStatistics). For many property types defined by the Data Commons schema, the provenance is always datacommons.org.
Source: A source is a property of a provenance or dataset. It is usually the name of the organization that provides the data or defines the schema. For example, for the provenance www.abs.gov.au, the source is the Australian Bureau of Statistics.
Dataset: A dataset refers to a specific collection of data provided by a source. A single source may provide multiple datasets. For instance, the Australian Bureau of Statistics provides both the Australia Statistics dataset (not to be confused with the provenance DCID) and the Australia Subnational Administrative Boundaries dataset.

A statistical variable may have multiple provenances, since many datasets define the same variables.

Statistical variable

In Data Commons, statistical measurements and time series data are modeled as nodes. A statistical variable (statVar) represents any type of metric, statistic, or measurement that can be taken for an entity at a given time, such as a count, an average, a percentage, etc.

The type of a statistical variable is always the special subclass StatisticalVariable. For example, the metric Median Age of Female Population is a node whose type is a statistical variable.

A statistical variable can be simple, such as Total Population, or more complex, such as Hispanic Female Population. Complex variables may be broken down into constituent parts, or not.

Entity

An entity represents a persistent real-world object or concept. Examples include cities, countries, elections, high schools, or even Earth itself.

While Data Commons contains information about a wide range of types of entities, most information current is about places. There are about 2.9 million places catalogued. For each place, metadata includes type, geographic containment, shape, area, and more. See the place types page for a full list of available place types.

Time

Time is measured at any date resolution in Data Commons. Generally, the date of measurement. Specified in ISO 8601 format.

Examples:

2011 – the year 2011
2019-06 – June 2019
2019-06-05T17:21:00-06:00 – 5:17 PM on June 5, 2019, in CST

Observation

An observation is a single measured value for a statistical variable, for a specific entity and time period. Its type is always StatVarObservation.

Time series data for a statistical variable (e.g., population over several years) is represented as a sequence of observations.

Facets refer to metadata on properties of the data and its provenance. For example, multiple sources might provide data on the same variable, but use different measurement methods, cover data spanning different time spans, or use different underlying predictive models.

Data Commons uses facets to refer to the source and its associated metadata of data.

measurementMethod: The technique used to measure a variable. It describes how the measurement was made—whether by count, estimate, or another approach—and may name the organization responsible for the measurement. For example, WorldHealthOrganizationEstimates. Multiple measurement methods may be associated with a single node.
observationPeriod: The time span over which an observation is made, specified using ISO 8601 duration formatting.
measurementDenominator: The denominator used in a fractional measurement.
scalingFactor: Used with proportion-based variables. It indicates the multiplier applied to the measurementDenominator to produce the final measurement value, particularly when the numerator and denominator are on different scales.
unit: The unit in which the variable is measured. Examples include IndianRupee, kilowatt hours, etc.

StatVarGroup

A StatVarGroup is a collection of conceptually related statistical variables.

Example:

Global Health Observatory is a StatVarGroup.
- It contains a child statVarGroup: Health Expenditure.
  - Which includes variables like Current Health Expenditure (Che) Per Capita in USD.

Groups can also be based on shared characteristics. For instance:

The statVarGroup Person With Gender = Female includes variables like Female Median Age and Female Median Income.

Groups may also be hierarchical. For instance:

The statVarGroup Person With Age, Gender = Female is a subgroup of Person With Gender = Female.

StatVarPeerGroup

A StatVarPeerGroup groups statistical variables nodes that are meaningful peers. These are used to organize variables around a broader concept.

Example:

Completion rate, by location and education level (%) is a StatVarPeerGroup. It includes members such as:
- Completion rate [Rural, Primary education]
- Completion rate [Rural, Lower secondary education]

Members of a StatVarPeerGroup are enumerated using the member property.

Topic

A Topic represents a broad conceptual area like economy, poverty, or crime. Topics help organize variables under common themes.

Like StatVarGroups, topics can be nested. For example:

The UN defines 12 thematic areas, one of which is Health.
- Which contains subtopics like Infectious Diseases.
  - Which includes variables such as Number of reported cases of cholera, listed as a relevantVariable.

A variable may belong to multiple topics, allowing for flexible categorization.

Choosing between StatVarGroup, StatVarPeerGroup and Topic

Each of these concepts serves a different purpose in organizing and exploring data and should be used based on the context:

StatVarPeerGroup is like a single row in a tidy DataFrame—variables differ by just one qualifier (e.g. age). Use it for comparing related indicators side-by-side.
StatVarGroup is like a folder hierarchy—grouping variables by structure or concept. Use it to explore related variables within a dataset or domain.
Topic is like a subject heading—categorizing variables by broad real-world themes (e.g. Health, Poverty). Use it for thematic discovery and navigation.

Event

An event is a real-world occurrence tied to a specific point in time, such as an election, weather disaster, or financial disbursement.

In modelling ODA data for example, events might represent commitments or disbursements.