Datamart Spreadsheet Annotation

Datamart Spreadsheet Annotation enables users to annotate spreadsheets with metadata to support conversion to KGTK format compatible with Datamart canonical data format.

Annotations

A spreadsheet annotation consists of:

For example: Column A is inserted with top 7 rows being: dataset, role, type, description, name, unit and tag. Datamart Spreadsheet Example

The first column must have the labels shown in cells A1:A7. If this is not the case, then annotation validation will fail while uploading the annotated spreadsheet to Datamart.

Annotation Description: cells A1:A7

dataset

This row should provide the dataset_id of the dataset this file belongs to. The dataset_id should be created in the Datamart prior to uploading the annotated spreadsheet.

role

The role row specifies the role the columns has in the data. Any column may have a role annotation, but role annotations are not required for all columns. Columns without role annotations will be ignored unless they have a type annotation (described below), in which case an error will be shown to the user.

The permissible values for role annotation are:

type

The type row specifies semantic types for the columns to enable precise parsing of the contents of columns. Columns with role annotations must also have a type annotation, and type annotations cannot be provided for columns without a role annotation.

Valid type for each role are as follows:

name

The name row allows users to enter descriptive names for columns with variable or qualifier role annotations; names cannot be provided for columns with other roles. Names are optional, and when not provided, the header of the column is used as the name.

description

The description row allows users to provide a longer description for columns with variable or qualifier role annotations; similar to names, description annotations cannot be provided for columns with other roles, and descriptions are optional.

unit

The unit row allows users to specify units for columns with variable role annotations. When provided, the unit is a string that designates the units to be used for all values of the variable. Units are optional as some spreadsheets have units columns to designate units for each row.

tag

The tag row allows users to specify tags in the form of key-value pairs for columns with variable role annotations. The format of the tag string is as follows,

KEY1:VALUE1|KEY2:VALUE2|...

This string will be split and key-values pairs will be added in the variable metadata and will be returned in variable metadata API. The variable metadata can be searched using tags.

Additional Annotations

The header annotation marks the row that contains the headers of the spreadsheet; this annotation is useful for spreadsheets where the header is not the first row of the spreadsheet.

data

The data annotation marks the first row of the spreadsheet that contains the data; this annotation is useful for spreadsheets where the data does not immediately follow the headers.

Annotations: Fine Details

  1. By default, location, time, qualifier and unit annotations are applied to every variable. If there are multiple variables, this means each variable receives the full set of location, time, qualifier and unit annotations.
  2. If a column has no annotation, it will be skipped.
  3. The following is true for qualifier and unit annotations:
    • These annotations can use ; to indicate that the annotated column should be used with a specific variable column. For example; lets say we have a variable column with heading "fatalities" and a qualifier column with heading "location". qualifier;fatalities on the column "location" indicates that this qualifier column should be used only with column "fatalities".
    • It is possible to attach the annotation to multiple variables by separating the column names with "|" as qualifier;fatalities|injuries, or unit;price|currency.
  4. Special case for unit annotation.
    • When multiple unit annotations are present, they are concatenated together, separated by commas, from left to right.

User Workflow

User Workflow

Step 1: Annotate Spreadsheet (with dataset metadata)

First step is to annotate the spreadsheet, as described in this document. The upload Annotated Spreadsheet API will validate the annotation in terms of,

Step 2: Upload the Annotated Spreadsheet

Next step is to upload the annotated spreadsheet to Datamart using the /datasets/{dataset_id}/annotated API. An example is shown in Upload an Annotated Spreadsheet section in the jupyter notebook

Note: The Annotated Spreadsheet can be an Excel file with .xlsx extension or a csv file with .csv extension.

Step 4: Get Canonical data for the variables created

The Upload Annotated Spreadsheet API returns the metadata for all the variables created. Example:

 {
    "name": "Total affected",
    "variable_id": "total_affected",
    "description": "Total affected in TEST01",
    "corresponds_to_property": "PVARIABLE-QTEST01-019",
    "qualifier": [
      {
        "name": "stated in",
        "identifier": "P248"
      },
      {
        "name": "point in time",
        "identifier": "P585"
      }
    ]
  }

Users can use the variable_id to get the canonical data back. An example is shown in the Get time series data for a variable section in the jupyter notebook

Examples

Crude Oil Production

Crude Oil Production

Annotation Details:

Crude Oil Price

Crude Oil Price

Annotation Details:

AID Sample

AID Sample

Annotation Details:

AID Sample: Variable Specific Qualifiers

AID Sample Column Specific Qualifiers

Annotation Details: