Modelling data for the ONE knowledge graph

This page explains the transformation step of the ETL pipeline, which is responsible for structuring the raw data to be loaded into the knowledge graph, and enhancing the data by adding metadata and other information. Understanding how to structure data for a Data Commons knowledge graph is essential to correctly load data into the ONE knowledge graph and make the best use of knowledge graph features.

Custom Data Commons requires that data is provided in a specific schema, format, and file structure. This page focuses on the approach used by ONE to model data for the knowledge graph. For more information about modelling data for the Data Commons knowledge graph, refer to the Data Commons documentation.

ONE's data model

ONE models data using the Explicit Schema, which required DCIDs to be explicitly defined for all variables (and entity types if needed) as nodes in MCF files. Data must be provided in CSV files referencing these DCIDs. This approach allows for a more flexible and detailed representation of data, enabling the specification of variables in a variable-per-row format and the inclusion of additional properties for variables or entities.

At a top level, the following components need to be provided:

CSV data files: All data must be in CSV format.

JSON configuration file: A JSON configuration file, named config.json, that specifies how to map and resolve the CSV contents to the Data Commons schema knowledge graph.

MCF file: MCF files that describe the statistical variables, variable peer groups, variable groups, topics, custom entities, and other nodes that are needed to model the data.

Directory structure

You can organize data into as many CSV and MCF files as needed, and the directory can be structured with multiple subdirectories. But there can only be one config.json file in the root directory. The structure of the directory should aim to be clear and logical, making it easy to find and maintain the files. Generally the files for each source should be grouped into a subdirectory.

data directory/
├── config.json
├── source1/
│   ├── nodes1.mcf
│   ├── datafile1.csv
│   └── datafile2.csv
└── source2/
    ├── nodes2.mcf
    ├── datafile3.csv
    └── datafile4.csv

Data structure

The explicit schema uses a variable-per-row format. Generally, it would be formatted as below:

entity	date	variable	value	unit	measurementMethod
country/FRA	2019	ONE/OECD_DAC1_10_1010-GE-USD	12200000000	USDollar	GrantEquivalent
country/GER	2020	ONE/OECD_DAC1_10_1010-GE-USD	28700000000	USDollar	GrantEquivalent
country/FRA	2019	ONE/OECD_DAC1_10_1010-ND-USD	10700000000	USDollar	NetDisbursement
country/GER	2020	ONE/OECD_DAC1_10_1010-ND-USD	25700000000	USDollar	NetDisbursement
country/FRA	2020	ONE/OECD_DAC1_10_1820-GD-EUR24	1210000000	ConstantEUR_2024	GrossDisbursement
country/GER	2020	ONE/OECD_DAC1_10_1820-GD-EUR24	2760000000	ConstantEUR_2024	GrossDisbursement

While not shown, scalingFactor and observationPeriod can also be added as columns and specified for some or all rows. See facets

The names and order of the columns aren’t important, as you can map them to the expected columns in the JSON file. However, the entity and variable codes must be valid DCIDs. If such DCIDs don’t already exist in the base Data Commons, you must provide definitions of them in MCF files.

The type of the entities in a single file should be unique; do not mix multiple entity types in the same CSV file. For example, if you have observations for cities and counties, put all the city data in one CSV file and all the county data in another one.

Defining Statistical Variables

Nodes in the Data Commons knowledge graph are defined in MCF format. When you define a variable, you must explicitly assign a DCID.

You can define your statistical variables in a single MCF file, or split them into as many separate MCF files. MCF files must have a .mcf suffix.

Here’s an example of defining the same statistical variables. These are the same variables shown in the data structure section. You will notice that they represent the same concept (“Official Development Assistance”). However, each node (StatVar) models a specific version (with constraints like grant equivalent, or constant US dollars).

Node: ONE/DAC1_10_1010-GE-USD
name: "Official Development Assistance (ODA) [Grant Equivalent]"
typeOf: dcid:StatisticalVariable
description: "DAC1 data for Official Development Assistance (ODA), as Grant Equivalents, in current US Dollars."
shortDisplayName: "Official Development Assistance (ODA)"
statType: dcid:measuredValue
measuredProperty: dcid:value
measurementQualifier: dcid:Nominal
memberOf: dcid:ONE/g/dac1_FlowsByProvider
searchDescription: "Total aid", "total ODA", "Total foreign aid"
populationType: dcid:EconomicActivity
flowType: dcid:ODA
dac1Measure: 1010
aidIndicator: dcid:dc/svpg/DAC1_10_1010

Node: ONE/DAC1_10_1010-ND-USD
name: "Official Development Assistance (ODA) [Net Disbursements]"
typeOf: dcid:StatisticalVariable
description: "DAC1 data for Official Development Assistance (ODA) as net disbursements, in current US Dollars."
shortDisplayName: "Official Development Assistance (ODA)"
statType: dcid:measuredValue
measuredProperty: dcid:value
measurementQualifier: dcid:Nominal
memberOf: dcid:ONE/g/dac1_FlowsByProvider
searchDescription: "Total aid", "total ODA", "Total foreign aid"
populationType: dcid:EconomicActivity
flowType: dcid:ODA
dac1Measure: 1010
aidIndicator: dcid:dc/svpg/DAC1_10_1010

Node: ONE/DAC1_10_1820-GD-EUR24
name: "Refugees in donor countries [Gross Disbursements]"
typeOf: dcid:StatisticalVariable
description: "DAC1 data for Refugees in donor countries as gross disbursements, in current 2024 constant Euros"
shortDisplayName: "Refugees in donor countries"
searchDescription: "IDRC", "in-donor refugee costs"
statType: dcid:measuredValue
measuredProperty: dcid:value
measurementQualifier: dcid:RealValue
memberOf: dcid:ONE/g/otherIn-DonorExpenditures
populationType: dcid:EconomicActivity
flowType: dcid:ODA
dac1Measure: 1820
aidIndicator: dcid:dc/svpg/DAC1_10_1820

The following fields are always required:

Node: This is the DCID of the entity you are defining. You should add ONE/ as the prefix to differentiate our custom variables from base DC variables. The prefix acts as a namespace.
typeOf: For statistical variables, this is always dcid:StatisticalVariable.
name: This is the descriptive name of the variable, that is displayed in the Statistical Variable Explorer and various other places in the UI.
populationType: This is the type of thing being measured, and its value must be an existing Class type.
measuredProperty: This is a property of the thing being measured. It must be a domainIncludes property of the populationType you have specified.

We additionally aim to always include:

description: A description and/or definition of the variable. It should include information about the constraints.
memberOf: the StatVarGroup to which the variable belongs.
searchDescription: sentences or strings that would match what users may search for in natural language.
measurementQualifier: This is similar to the observationPeriod field for CSV files and applies to all observations of the variable. It Provides additional qualification to an observation. This is particularly useful when modelling economic flows, when it can be Nominal, RealValue or PurchasingPowerParity, for example. It can be also any string representing additional properties of the variable, e.g., Weekly, Monthly, Annual

The following fields are optional:

statType: By default, this is dcid:measuredValue, which is simply a raw value of an observation. If your variable is a calculated value, such as an average, a minimum or maximum, you can use minValue, maxValue, meanValue, medianValue, sumvalue, varianceValue, marginOfError, stdErr.
measurementDenominator: For percentages or ratios, this refers to another statistical variable. For example, for per capita, the measurementDenominator is Count_Person.

The MCF definition of a node (StatVar in this case) can include additional properties. In the example above, we also include flowType: dcid: ODA. Custom properties must appear after standard Data Commons properties. Their values must also be defined as MCF nodes.

Note

All fields that reference another node in the graph must be prefixed by dcid:. All fields that do not reference another node must be in quotation marks.

Modeling bilateral variables

In most datasets, we’re accustomed to working with data about a single place—for example, GDP of France in 2020. These are usually stock or aggregated flow indicators that can be cleanly attributed to one entity. However, some indicators—especially in domains like finance, trade, or aid—involve flows between two places, such as aid from France to Togo. Modelling this bilateral data is more involved.

In Data Commons a statistical observation (StatVarObservation) is about a specific variable (variableMeasured), for a single specific entity (observationAbout) at a specific time (observationDate).

When modelling bilateral flows with a provider (e.g. "France") and a recipient (e.g. "Togo") either:

The observationAbout could be “France” (the provider), where the flow is to “Togo” (the recipient).
The observationAbout could be “Togo” (the recipient), where the flow is from “France” (the provider).

It may be that both perspectives are equivalent (harmonised), but it may also be the case that they represent different values (unharmonized). We therefore have to consider the perspective from which to structure the data (the provider or the recipient). The ‘other’ entity would become a ‘constraint’, which means a new statistical variable would be minted to identify it.

A general schema rule is that every property that describes a variable must be represented as either:

The statType, measuredProperty, populationType
A constraint property on the statistical value (like recipientEntity, aidType, etc.)

A general observation rule is that an observation has exactly one slot for an entity/place (observationAbout) and everything else must be represented as either:

The value and date
A facet (measurementMethod, unit, etc.). Facets can’t hold Places.

For the example above, there are two ways to model the data:

Option1: Provider-centric

observationAbout: provider
Included in the statVar as constraint: recipientEntity (+aidType, flowType, etc.)

Option2: Recipient-centric

observationAbout: recipient
Included in the statVar as constraint: providerEntity (+aidType, flowType, etc)

You should model the data according to the perspective which is customarily used to query and present the data. For flow-like data, you should default to the “source” as observationAbout and the “target” as the constraint. The name you give to StatVars should clearly reflect the directionality of the flow.

In other cases, you should choose the lower-cardinality option as the observationAbout.

You should not duplicate the data by modelling both perspectives. If the perspective matters (e.g., if the provider perspective doesn’t agree with the recipient perspective), the perspective could be modelled as a measurementMethod.

Node: ONE/DAC2A_10_206-ND-USD-TOG
name: "Bilateral Official Development Assistance (ODA) [Net Disbursements to Togo]"
typeOf: dcid:StatisticalVariable
description: "Net disbursements of bilateral Official Development Assistance (ODA) to Togo"
shortDisplayName: "Bilateral ODA [to Togo]"
statType: dcid:measuredValue
measuredProperty: dcid:value
measurementQualifier: dcid:Nominal
memberOf: dcid:ONE/g/dac2a_Bilateral
searchDescription: "Bilateral aid to Togo", "Net disbursements to Togo"
populationType: dcid:EconomicActivity
recipientCountry: dcid:country/TGO
flowType: dcid:ODA
dac2aMeasure: 206
aidIndicator: dcid:dc/svpg/DAC2A_10_206

Node: ONE/DAC2A_10_206-ND-USD-NGA
name: "Bilateral Official Development Assistance (ODA) [Net Disbursements to Nigeria]"
typeOf: dcid:StatisticalVariable
description: "Net disbursements of bilateral Official Development Assistance (ODA) to Nigeria"
shortDisplayName: "Bilateral ODA [to Nigeria]"
statType: dcid:measuredValue
measuredProperty: dcid:value
measurementQualifier: dcid:Nominal
memberOf: dcid:ONE/g/dac2a_Bilateral
searchDescription: "Bilateral aid to Nigeria", "Net disbursements to Nigeria"
populationType: dcid:EconomicActivity
recipientCountry: dcid:country/NGA
flowType: dcid:ODA
dac2aMeasure: 206
aidIndicator: dcid:dc/svpg/DAC2A_10_206

Defining Statistical Variable Groups

StatVarGroups are a specific type of Node in the Data Commons knowledge graph. As with other Nodes, they are defined in MCF files. When you define any Node in MCF, you must explicitly assign them DCIDs.

For example, the variable ONE/DAC1_10_1010-GE-USD is part of the dac1_FlowsByProvider group, which is part of the officialDevelopmentAssistance group, which itself is part of the Development group, which belongs to the ‘top’ group ONE.

Node: dcid:ONE/g/ONE
name: "ONE"
typeOf: dcid:StatVarGroup
specializationOf: dcid:dc/g/Root

Node: dcid:ONE/g/development
name: "Development"
typeOf: dcid:StatVarGroup
specializationOf: dcid:ONE/g/ONE

Node: dcid:ONE/g/officialDevelopmentAssistance
name: "Official Development Assistance"
typeOf: dcid:StatVarGroup
specializationOf: dcid:ONE/g/development

Node: dcid:ONE/g/dac1_FlowsByProvider
name: "DAC1: Flows by Provider"
typeOf: dcid:StatVarGroup
specializationOf: dcid:ONE/g/officialDevelopmentAssistance

The following fields are always required:

Node: This is the DCID of the group you are defining. It must be prefixed by g/ and may include an additional prefix before the g.
typeOf: In the case of a statistical variable group, this is always dcid:StatVarGroup.
name: This is the name of the heading that will appear in the Statistical Variable Explorer.
specializationOf: For a top-level group, this must be dcid:dc/g/Root, which is the root group in the statistical variable hierarchy in the knowledge graph.To create a subgroup, specify the DCID of another node you have already defined. For example, if you wanted to create a subgroup of WHO called Smoking, you would create a Smoking node with specializationOf: dcid:who/g/WHO.

We additionally aim to always include

description: A description and/or definition of the variable. It should include information about the constraints.
searchDescription: sentences or strings that would match what users may search for in natural language.

You can assign a Statistical Variable to as many group nodes as you like by using a comma-separated list of group DCIDs in the memberOf field.

Defining Statistical Variable Peer Groups

StatVarPeerGroups are a specific type of Node in the Data Commons knowledge graph. As with other Nodes, they are defined in MCF files. When you define any Node in MCF, you must explicitly assign them DCIDs.

Node: dcid:ONE/svpg/DAC1_10_1010
name: "Total Official Development Assistance (ODA)"
description: "DAC1 data for Official Development Assistance (ODA), by measurement method"
typeOf: dcid:StatVarPeerGroup
member: ONE/DAC1_10_1010-GE-USD, ONE/DAC1_10_1010-ND-USD, ONE/DAC1_10_1010-GD-USD, ONE/DAC1_10_1010-GE-EUR_2024

The following fields are always required: - Node: This is the DCID of the peer group you are defining. It must be prefixed by svpg/ and may include an additional prefix before the svpg. - typeOf: In the case of statistical variable peer group, this is always dcid:StatVarPeerGroup. - name: This is the name of the heading that will appear in the Statistical Variable Explorer. - member: StatVarPeerGroups contain related variables, which must be listed as member, separated by commas.

We additionally aim to always include:

description: A description and/or definition of the peer group. It should include information about the constraints.
searchDescription: sentences or strings that would match what users may search for in natural language.

Defining Topics

Topics are a specific type of Node in the Data Commons knowledge graph. As with other Nodes, they are defined in MCF. When you define any Node in MCF, you must explicitly assign them DCIDs.

The following fields are always required:

Node: This is the DCID of the topic you are defining. It must be prefixed by topic/ and may include an additional prefix before the topic.
typeOf: In the case of statistical variable peer group, this is always dcid:Topic.
name: This is the name of the heading that will appear in the Statistical Variable Explorer.
relevantVariable: Topics may contain other Topics, StatisticalVariablePeerGroups or StatisticalVariables related variables, which must be listed as member, separated by commas.

We additionally aim to always include:

description: A description and/or definition of the topic.
searchDescription: sentences or strings that would match what users may search for in natural language.

Conventions

We follow several conventions to ensure consistency, clarity and ease of maintaining the data in the knowledge graph.

Naming conventions

We mainly curate and aggregate data from other sources. Most of the data we maintain in our knowledge graph is collected from other data providers. We aim to name the Statistical Variables in a way that is consistent with how the original sources name indicators. That will make it easier to identify and trace the linage of the data we’re modelling.

There are a few conventions to follow:

StatVar DCIDs must start with dcid:, include only ASCII letters (A-Z / a-z), digits, underscores (_) and hyphens (-) only. There can be no spaces, periods, or other punctuation marks.
Make DCIDs and names stable—once published, never change a DCID.
For property Nodes, prefer descriptive over cryptic DCIDs.

Other Data Commons conventions

You should try to align to the standards and practices used by Data Commons. While there is no official documentation on the conventions used by the Data Commons team, exploring the knowledge graph is usually a good way to get a sense of their approach.

Below is some general guidance on how to think about each Node type:

`StatVar`

<namespace>/<indicator_code>-<constraints-separated-with-hyphen>

The namespace should generally be “ONE”. The constraints should be listed in this order: counterpart-measurementMethod-unit-observationPeriod. Not all constraints are required. Others, not listed here, may be required, and they should be appended after the ones on this list.

`StatVarGroup`

<namespace>/g/<lowerCamelCaseGroupName> The namespace should generally be “ONE”. Group names should not be cryptic.

StatVarPeerGroup

<namespace>/svpg/<indicator_code> The namespace should generally be ONE. We use peer groups to link Nodes that all relate to a specific indicator (hence without the constraints), and this structure reflects that. You may need to create other StatVarPeerGroups that don’t follow this pattern. In that case, give it a name with lowerCamelCase.

Topic

<namespace>/topic/<lowerCamelCaseTopic> The namespace should generally be ONE. Topics are conceptual, so their name should always be short and human-readable. If you must use multiple words, use CamelCase.

Entity

<EntityType>/<Identifier> With rare exceptions, you should always specify the entity type. You can then use codes or names to identify the entity.