Getting started with bblocks-datacommons-tools
This page walks you through the basic steps to install and start using
bblocks-datacommonst-tools to prepare and load the data for your custom instance.
About custom Data Commons instances
Anyone can build and manage their own Data Commons instance—combining their own datasets with the base data available from datacommons.org, and taking advantage of built-in features like natural language queries, interactive visualisations, and data exploration tools. For many organisations, a custom instance is a practical way to publish data with exploration and visualisation tools without building infrastructure from scratch.
However, preparing your data for a Data Commons knowledge graph, uploading the necessary files to Google Cloud Platform (GCP), and deploying the service can be a repetitive and error-prone process. For smaller projects with limited data and infrequent updates, managing the workflow manually may be sufficient. But for larger datasets or pipelines with regular refreshes, the process quickly becomes tedious and difficult to maintain.
bblocks-datacommons-tools streamlines this workflow by allowing you to programmatically prepare and load data using
a Python-based pipeline.
Before you get started, you should have a basic understanding of how custom Data Commons instances work and what the data loading process involves. You can find the official documentation here.
At a top level, you should be familiar with:
- The
config.jsonfile: the JSON configuration file that specifies how to map and resolve data to the Data Commons schema knowledge graph. - The data files: CSV files containing the data formatted for a specified schema, either implicit or explicit.
- Meta Content Framework (MCF) files: files that provide additional flexibility form modeling data for the knowledge graph.
- Uploading data and deploying: Files need to be loaded to GCP, and the service needs to be deployed.
Installation
The package can be installed in various ways.
Directly as
Or from the main bblocks package with an extra:
Preparing data
bblocks-datacommons-tools offers convenient functionality to prepare configuration JSON, MCF, and custom data files
without having to manually edit these files. To access this functionality, create an instance of the
CustomDataManager class.
The CustomDataManager lets you create or edit the config.json file without editing it manually.
You can register variables, sources or provenances, and data files.
manager.add_provenance(
provenance_name="ONE Climate Finance",
provenance_url="https://datacommons.one.org/data/climate-finance-files",
source_name="ONE Data",
source_url="https://data.one.org",
)
manager.add_variable_to_config(
statVar="climateFinanceProvidedCommitments",
name="Climate Finance Commitments (bilateral)",
group="ONE/Environment/Climate finance/Provider perspective/Commitments",
description="Funding for climate adaptation and mitigation projects",
searchDescriptions=[
"Climate finance commitments provided",
"Adaptation and mitigation finance provided",
],
properties={"measurementMethod": "Commitment"},
)
You can pass pandas DataFrames to the manager, specifying what schema is being used, and the manager will handle exporting the data as CSVs in the correct format.
```python title="Add implicit schema data import pandas as pd
df = pd.DataFrame(...)
manager.add_implicit_schema_file( file_name="climate_finance/one_cf_provider_commitments.csv", provenance="ONE Climate Finance", entityType="Country", data=df, ignoreColumns=["oecd_provider_code"], observationProperties={"unit": "USDollar"}, )
[//]: # (<--- TODO: Add explicit schema data example --->)
```python title="Add explicit schema data"```
Once you are finished adding and editing data and configuration, you can
validate and export all the files for your custom Data Commons instance.
```python
manager.export_all("path/to/output/folder")
Read more detailed documentation about preparing data with the CustomDataManager
here ↗
Loading data
You can programmatically push the data and config to a Google Cloud Storage Bucket, trigger the data load job, and redeploy your Data Commons instance.
First, specify all the configuration settings needed to add files to the storage bucket. For convenience these
can be specified in a .env file (read more about the configuration settings here).
from bblocks.datacommons_tools.gcp_utilities import get_kg_settings
settings = get_kg_settings(source="env", env_file="customDC.env")
Now we can load data and configuration files to the storage bucket, run the data load job on GCP, and redeploy the custom Data Commons instance.