Skip to content

Schema Metadata

The Command Line Interface (CLI) offers users more than just the ability to transfer and register dataset objects. It also enables the addition of metadata to describe schemas and perform basic validation of objects. As of this release, the CLI solely supports tabular datasets.

Tabular Dataset

To illustrate, let's consider the tabular data frame named APMS_embedding_MUSIC.csv. This particular dataset comprises 1026 columns. The first column, Internal Experiment Identifier, identifies the experiment that generated the source data, while the second column, Gene Symbol, contains the Gene name for the bait protien. The remaining columns, from Embedding0 to Embedding1023, are a 1024 length embedding vector. The original data frame has no headers, but after consulting with a domain expert, headers are added for clarity, and based on these headers, the schema will be described.

Internal Experiment Identifier Gene Symbol Embedding0 Embedding1 Embedding2 ... Embedding1023
APMS_1 RRS1 0.07591 0.161315 -0.025731 ... -0.172205
APMS_2 SNRNP70 -0.019872 0.083736 0.151332 ... 0.042429
APMS_3 RPL18 0.067353 0.099565 0.308037 ... 0.049538
APMS_4 JMJD6 0.087387 -0.17969 0.036929 ... 0.068675
APMS_5 NCAPH2 0.007115 0.118820 -0.059649 ... 0.119648
APMS_6 BSG 0.143906 -0.034937 -0.141535 ... -0.178751
APMS_7 FAM189B -0.107395 0.284882 0.065763 ... 0.044294
APMS_8 MRPS11 -0.051772 0.045301 0.08211 ... 0.079971
APMS_9 TRIM28 -0.17398 0.209120 0.021203 ... -0.092368
APMS_10 LAMP3 0.048065 0.087677 0.000867 ... 0.047628

Throughout the rest of the document, we will use this tabular dataset as a guide to walk through the step-by-step process of creating, populating and validating the schema.

Create schema

To create a schema for a tabular dataset, the create-tabular command must be invoked, requiring a name, a brief description, a separator character, and an optional boolean value for header to specify the presence of column headers. Once created, the schema will be located in the destination specified by the SCHEMA_FILE.

fairscape-cli schema create-tabular [OPTIONS] SCHEMA_FILE

Options:
  --name TEXT         [required]
  --description TEXT  [required]
  --guid TEXT
  --separator TEXT    [required]
  --header BOOLEAN
  --help              Show this message and exit.

In the schema creation example below, the symbol , (comma) is used as the separator and the header is set to False. The CLI will autogenerate a value for the guid.

fairscape-cli schema create-tabular \
    --name 'APMS Embedding Schema' \
    --description 'Tabular format for APMS music embeddings from PPI networks from the music pipeline from the B2AI Cellmaps for AI project' \
    --separator ',' \
    --header False \
    ./schema_apms_music_embedding.json

Populate schema

To populate the schema for a tabular dataset, we describe its syntactic and semantic properties through a series of unique properties, each representing a single column or an array of similar columns. To add a property, we use the fairscape-cli schema add-property command.

The first step in adding a property is to choose the datatype it represents in the column or array of columns. For example, if a column represents a string datatype, we create a string property by using the fairscape-cli schema add-property string command. We can use a similar command for other datatypes as well. The CLI supports five datatypes for a tabular dataset, which are listed in the table below.

Datatype Description
string Strings of text
number Any numeric type
integer Integral numbers
array Ordered elements
boolean True and False

After choosing the datatype, we must fill in additional information about the column or array of columns it represents. The table headers below display all available options for each datatype. For a string property, this includes a unique name, an integer value for the index (where 0 represents the first column, 1 represents the second, and so on), a human-readable description, a standard vocabulary term for the value-url, and a regular expression for the data pattern in that column. While the first three options are required, the rest are optional.

Datatype name index description value-url pattern items-datatype min-items max-items unique-items
string required required required optional optional
number required required required optional
integer required required required optional
array required required required optional required optional optional optional
boolean required required required optional

To view all available options and arguments, including those for the string datatype, we can use the command fairscape-cli schema add-property string --help, which will display a complete list of options.

fairscape-cli schema add-property string [OPTIONS] SCHEMA_FILE

Options:
  --name TEXT         [required]
  --index INTEGER     [required]
  --description TEXT  [required]
  --value-url TEXT
  --pattern TEXT
  --help              Show this message and exit.

Add a String Property

Columns index 0 and 1 have string values. Both can be constrained with an optional regex pattern. For our first column we have the experiment identifier, and add this to the schema with the following command.

fairscape-cli schema add-property string \
    --name 'Experiment Identifier' \
    --index 0 \
    --description 'Identifier for the APMS experiment responsible for generating the raw PPI used to create this embedding vector' \
    --pattern '^APMS_[0-9]*$' \
    ./schema_apms_music_embedding.json

For the second column we have Gene Symbols for values, We can choose then to provide the optional flag --value-url to align these values to an ontology. Using the (EDAM ontology of bioscientific data analysis and data management)[], we can specify that these are Gene Symbols. This can be usefull for specifying the Database of a particular Gene Identifier. Which enables linking Identifiers across databases. Any ontology can be used to align data.

fairscape-cli schema add-property string \
    --name 'Gene Symbol' \
    --index 1 \
    --description 'Gene Symbol for the APMS bait protien' \
    --pattern '^[A-Za-z0-9\-]*$' \
    --value-url 'http://edamontology.org/data_1026' \
    ./schema_apms_music_embedding.json

Add an Array Property

Instead of registering properties for 1024 individual columns we can add a property for an array of 1024 elements. We can accomplish this with a slice expression for the index. The following slice expressions are supported.

Slice Expression Description
i:: starting at index i to the final index
::i starting at index 0 to index i
i:j starting at index i to index j

We then must specify that the type of the data inside this array is numeric. Items are not contstrained to unique values. And that for every row we expect there to be exactly 1024 elements.

fairscape-cli schema add-property array \
    --name 'MUSIC APMS Embedding' \
    --index '2::' \
    --description 'Embedding Vector values for genes determined by running node2vec on APMS PPI networks. Vector has 1024 values for each bait protien' \
    --items-datatype 'number' \
    --unique-items False \
    --min-items 1024 \
    --max-items 1024 \
    ./schema_apms_music_embedding.json

Generated schema

Looking at our schema we should have a json document equivalent to the following

{
    "@context": {
        "@vocab": "https://schema.org/",
        "EVI": "https://w3,org/EVI#"
    },
    "@id": "ark:59852/schema-apms-music-embedding-izNjXSs",
    "@type": "EVI:Schema",
    "name": "APMS Embedding Schema",    
    "description": "Tabular format for APMS music embeddings from PPI networks from the music pipeline from the B2AI Cellmaps for AI project",    
    "properties": {    
    "Experiment Identifier": {    
        "description": "Identifier for the APMS experiment responsible for generating the raw PPI used to create this embedding vector",    
        "index": 0,                                 
        "valueURL": null,    
        "type": "string",    
        "pattern": "^APMS_[0-9]*$" 
    },                                 
    "Gene Symbol": {                                             
        "description": "Gene Symbol for the APMS bait protien",    
        "index": 1,    
        "valueURL": "http://edamontology.org/data_1026",    
        "type": "string",          
        "pattern": "^[A-Za-z0-9\-]*$"    
    },                                                                          
    "MUSIC APMS Embedding": {                                                                
        "description": "Embedding Vector values for genes determined by running node2vec on APMS PPI networks. Vector has 1024 values for each bait protien",    
        "index": "2::",                                                           
        "valueURL": null,    
        "type": "array",    
        "maxItems": 1024,                               
        "minItems": 1024,                                    
        "uniqueItems": false,                                        
        "items": {    
            "type": "number"
            }
        }                                                              
    },                        
    "type": "object",                                   
    "additionalProperties": true,                                               
    "required": ["Experiment Identifier", "Gene Symbol", "MUSIC APMS Embedding"],    
    "seperator": ",",                         
    "header": false,    
    "examples": []    
}

Validate schema

With our schema we can execute the validation rules against some example data, and explore how errors are reported. In the github repo, example data is provided to evaluate the same schema we have just created. When validating against data where every row conforms to the schema, a simple success message is displayed.

fairscape-cli schema validate \
    --data ./examples/schemas/MUSIC_embedding/APMS_embedding_MUSIC.csv  \
    --schema ./examples/schemas/MUSIC_embedding/music_apms_embedding_schema.json

Validation Success

However when validating against data that contains issues, a table of errors is printed out. For this purpose we provide some intentionally corrupted data to demonstrate how these errors are reported.

fairscape-cli schema validate \
    --data examples/schemas/MUSIC_embedding/APMS_embedding_corrupted.csv \
    --schema examples/schemas/MUSIC_embedding/music_apms_embedding_schema.json 

+-----+-----------------+----------------+-------------------------------------------------------+
| row |    error_type   | failed_keyword |                        message                        |
+-----+-----------------+----------------+-------------------------------------------------------+
|  3  |   ParsingError  |      None      | ValueError: Failed to Parse Attribute embed for Row 3 |
|  4  |   ParsingError  |      None      | ValueError: Failed to Parse Attribute embed for Row 4 |
|  0  | ValidationError |    pattern     |        'APMS_A' does not match '^APMS_[0-9]*$'        |
|  1  | ValidationError |    pattern     |          ' -8- ' does not match '^[A-Z0-9]*$'         |
|  2  | ValidationError |    pattern     |           '-`~' does not match '^[A-Z0-9]*$'          |
+-----+-----------------+----------------+-------------------------------------------------------+

When errors are found there are two sources of these errors. Parsing errors which occur when attempting convert a row of tabular data into the specified json structure. This can happen when either the number of specified rows is incorrect, or the data for a specific column cannot be coerced to the datatype specified of the schema. When this occurs the row is marked as a failure and reported as a ParsingError. Rows that report a ParsingError are not validated against the jsonschema.

Validation Errors occur when a data element violates the contraints specified by the schema. In our example we show multiple examples of strings that defy the regex specified by the pattern attribute. Other constraints include min and max for numeric and integer properties, length for string, etc. In future work we will expand to cover the entire json schema specification.

Using default schemas

For conveineince a collection of default schemas are provided for the Cell Maps for AI pipeline. These schemas have their own repo, and will track the progress of the pipeline as new data modalities are added. These default schemas are packaged and provided as part of the fairscape-cli, and can be implemented using the respective identifier for the schema. Examples for all of the existing default schemas are provided below.

    # validate imageloader files
    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/imageloader/samplescopy.csv" \
        --schema "ark:59852/schema-cm4ai-imageloader-samplescopy" 

    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/imageloader/uniquecopy.csv" \
        --schema "ark:59852/schema-cm4ai-imageloader-uniquecopy"

    # validate image embedding outputs
    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/image_embedding/image_emd.tsv" \
        --schema "ark:59852/schema-cm4ai-image-embedding-image-emd"

    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/image_embedding/labels_prob.tsv" \
        --schema "ark:59852/schema-cm4ai-image-embedding-labels-prob"

    # validate apsm loader input
    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/apmsloader/ppi_gene_node_attributes.tsv" \
        --schema "ark:59852/schema-cm4ai-apmsloader-gene-node-attributes"

    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/apmsloader/ppi_edgelist.tsv" \
        --schema "ark:59852/schema-cm4ai-apmsloader-ppi-edgelist"

    # validate apms embedding 
    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/apms_embedding/ppi_emd.tsv" \
        --schema "ark:59852/schema-cm4ai-apms-embedding"    

    # validate coembedding 
    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/coembedding/coembedding_emd.tsv" \
        --schema "ark:59852/schema-cm4ai-coembedding"