Schema metadata

The Command Line Interface (CLI) offers users more than just the ability to transfer and register dataset objects. It also enables the addition of metadata to describe schemas and perform basic validation of objects. As of this release, the CLI solely supports tabular datasets.

Tabular Dataset¶

To illustrate, let's consider the tabular data frame named APMS_embedding_MUSIC.csv. This particular dataset comprises 1026 columns. The first column, Internal Experiment Identifier, identifies the experiment that generated the source data, while the second column, Gene Symbol, contains the Gene name for the bait protien. The remaining columns, from Embedding0 to Embedding1023, are a 1024 length embedding vector. The original data frame has no headers, but after consulting with a domain expert, headers are added for clarity, and based on these headers, the schema will be described.

Internal Experiment Identifier	Gene Symbol	Embedding0	Embedding1	Embedding2	...	Embedding1023
APMS_1	RRS1	0.07591	0.161315	-0.025731	...	-0.172205
APMS_2	SNRNP70	-0.019872	0.083736	0.151332	...	0.042429
APMS_3	RPL18	0.067353	0.099565	0.308037	...	0.049538
APMS_4	JMJD6	0.087387	-0.17969	0.036929	...	0.068675
APMS_5	NCAPH2	0.007115	0.118820	-0.059649	...	0.119648
APMS_6	BSG	0.143906	-0.034937	-0.141535	...	-0.178751
APMS_7	FAM189B	-0.107395	0.284882	0.065763	...	0.044294
APMS_8	MRPS11	-0.051772	0.045301	0.08211	...	0.079971
APMS_9	TRIM28	-0.17398	0.209120	0.021203	...	-0.092368
APMS_10	LAMP3	0.048065	0.087677	0.000867	...	0.047628

Throughout the rest of the document, we will use this tabular dataset as a guide to walk through the step-by-step process of creating, populating and validating the schema.

Create schema¶

To create a schema for a tabular dataset, the create-tabular command must be invoked, requiring a name, a brief description, a separator character, and an optional boolean value for header to specify the presence of column headers. Once created, the schema will be located in the destination specified by the SCHEMA_FILE.

fairscape-cli schema create-tabular [OPTIONS] SCHEMA_FILE

Options:
  --name TEXT         [required]
  --description TEXT  [required]
  --guid TEXT
  --separator TEXT    [required]
  --header BOOLEAN
  --help              Show this message and exit.

In the schema creation example below, the symbol , (comma) is used as the separator and the header is set to False. The CLI will autogenerate a value for the guid.

fairscape-cli schema create-tabular \
    --name 'APMS Embedding Schema' \
    --description 'Tabular format for APMS music embeddings from PPI networks from the music pipeline from the B2AI Cellmaps for AI project' \
    --separator ',' \
    --header False \
    ./schema_apms_music_embedding.json

Populate schema¶

To populate the schema for a tabular dataset, we describe its syntactic and semantic properties through a series of unique properties, each representing a single column or an array of similar columns. To add a property, we use the fairscape-cli schema add-property command.

The first step in adding a property is to choose the datatype it represents in the column or array of columns. For example, if a column represents a string datatype, we create a string property by using the fairscape-cli schema add-property string command. We can use a similar command for other datatypes as well. The CLI supports five datatypes for a tabular dataset, which are listed in the table below.

Datatype	Description
`string`	Strings of text
`number`	Any numeric type
`integer`	Integral numbers
`array`	Ordered elements
`boolean`	True and False

After choosing the datatype, we must fill in additional information about the column or array of columns it represents. The table headers below display all available options for each datatype. For a string property, this includes a unique name, an integer value for the index (where 0 represents the first column, 1 represents the second, and so on), a human-readable description, a standard vocabulary term for the value-url, and a regular expression for the data pattern in that column. While the first three options are required, the rest are optional.

Datatype	name	index	description	value-url	pattern	items-datatype	min-items	max-items	unique-items
`string`	required	required	required	optional	optional
`number`	required	required	required	optional
`integer`	required	required	required	optional
`array`	required	required	required	optional		required	optional	optional	optional
`boolean`	required	required	required	optional

To view all available options and arguments, including those for the string datatype, we can use the command fairscape-cli schema add-property string --help, which will display a complete list of options.

fairscape-cli schema add-property string [OPTIONS] SCHEMA_FILE

Options:
  --name TEXT         [required]
  --index INTEGER     [required]
  --description TEXT  [required]
  --value-url TEXT
  --pattern TEXT
  --help              Show this message and exit.

Add a String Property¶

Columns index 0 and 1 have string values. Both can be constrained with an optional regex pattern. For our first column we have the experiment identifier, and add this to the schema with the following command.

fairscape-cli schema add-property string \
    --name 'Experiment Identifier' \
    --index 0 \
    --description 'Identifier for the APMS experiment responsible for generating the raw PPI used to create this embedding vector' \
    --pattern '^APMS_[0-9]*$' \
    ./schema_apms_music_embedding.json

For the second column we have Gene Symbols for values, We can choose then to provide the optional flag --value-url to align these values to an ontology. Using the (EDAM ontology of bioscientific data analysis and data management)[], we can specify that these are Gene Symbols. This can be usefull for specifying the Database of a particular Gene Identifier. Which enables linking Identifiers across databases. Any ontology can be used to align data.

fairscape-cli schema add-property string \
    --name 'Gene Symbol' \
    --index 1 \
    --description 'Gene Symbol for the APMS bait protien' \
    --pattern '^[A-Za-z0-9\-]*$' \
    --value-url 'http://edamontology.org/data_1026' \
    ./schema_apms_music_embedding.json

Add an Array Property¶

Instead of registering properties for 1024 individual columns we can add a property for an array of 1024 elements. We can accomplish this with a slice expression for the index. The following slice expressions are supported.

Slice Expression	Description
`i::`	starting at index i to the final index
`::i`	starting at index 0 to index i
`i:j`	starting at index i to index j

We then must specify that the type of the data inside this array is numeric. Items are not contstrained to unique values. And that for every row we expect there to be exactly 1024 elements.

fairscape-cli schema add-property array \
    --name 'MUSIC APMS Embedding' \
    --index '2::' \
    --description 'Embedding Vector values for genes determined by running node2vec on APMS PPI networks. Vector has 1024 values for each bait protien' \
    --items-datatype 'number' \
    --unique-items False \
    --min-items 1024 \
    --max-items 1024 \
    ./schema_apms_music_embedding.json

Generated schema¶

Looking at our schema we should have a json document equivalent to the following

{
    "@context": {
        "@vocab": "https://schema.org/",
        "EVI": "https://w3,org/EVI#"
    },
    "@id": "ark:59852/schema-apms-music-embedding-izNjXSs",
    "@type": "EVI:Schema",
    "name": "APMS Embedding Schema",    
    "description": "Tabular format for APMS music embeddings from PPI networks from the music pipeline from the B2AI Cellmaps for AI project",    
    "properties": {    
    "Experiment Identifier": {    
        "description": "Identifier for the APMS experiment responsible for generating the raw PPI used to create this embedding vector",    
        "index": 0,                                 
        "valueURL": null,    
        "type": "string",    
        "pattern": "^APMS_[0-9]*$" 
    },                                 
    "Gene Symbol": {                                             
        "description": "Gene Symbol for the APMS bait protien",    
        "index": 1,    
        "valueURL": "http://edamontology.org/data_1026",    
        "type": "string",          
        "pattern": "^[A-Za-z0-9\-]*$"    
    },                                                                          
    "MUSIC APMS Embedding": {                                                                
        "description": "Embedding Vector values for genes determined by running node2vec on APMS PPI networks. Vector has 1024 values for each bait protien",    
        "index": "2::",                                                           
        "valueURL": null,    
        "type": "array",    
        "maxItems": 1024,                               
        "minItems": 1024,                                    
        "uniqueItems": false,                                        
        "items": {    
            "type": "number"
            }
        }                                                              
    },                        
    "type": "object",                                   
    "additionalProperties": true,                                               
    "required": ["Experiment Identifier", "Gene Symbol", "MUSIC APMS Embedding"],    
    "seperator": ",",                         
    "header": false,    
    "examples": []    
}

Validate schema¶

With our schema we can execute the validation rules against some example data, and explore how errors are reported. In the github repo, example data is provided to evaluate the same schema we have just created. When validating against data where every row conforms to the schema, a simple success message is displayed.

fairscape-cli schema validate \
    --data ./examples/schemas/MUSIC_embedding/APMS_embedding_MUSIC.csv  \
    --schema ./examples/schemas/MUSIC_embedding/music_apms_embedding_schema.json

Validation Success

However when validating against data that contains issues, a table of errors is printed out. For this purpose we provide some intentionally corrupted data to demonstrate how these errors are reported.

fairscape-cli schema validate \
    --data examples/schemas/MUSIC_embedding/APMS_embedding_corrupted.csv \
    --schema examples/schemas/MUSIC_embedding/music_apms_embedding_schema.json 

+-----+-----------------+----------------+-------------------------------------------------------+
| row |    error_type   | failed_keyword |                        message                        |
+-----+-----------------+----------------+-------------------------------------------------------+
|  3  |   ParsingError  |      None      | ValueError: Failed to Parse Attribute embed for Row 3 |
|  4  |   ParsingError  |      None      | ValueError: Failed to Parse Attribute embed for Row 4 |
|  0  | ValidationError |    pattern     |        'APMS_A' does not match '^APMS_[0-9]*$'        |
|  1  | ValidationError |    pattern     |          ' -8- ' does not match '^[A-Z0-9]*$'         |
|  2  | ValidationError |    pattern     |           '-`~' does not match '^[A-Z0-9]*$'          |
+-----+-----------------+----------------+-------------------------------------------------------+

When errors are found there are two sources of these errors. Parsing errors which occur when attempting convert a row of tabular data into the specified json structure. This can happen when either the number of specified rows is incorrect, or the data for a specific column cannot be coerced to the datatype specified of the schema. When this occurs the row is marked as a failure and reported as a ParsingError. Rows that report a ParsingError are not validated against the jsonschema.

Validation Errors occur when a data element violates the contraints specified by the schema. In our example we show multiple examples of strings that defy the regex specified by the pattern attribute. Other constraints include min and max for numeric and integer properties, length for string, etc. In future work we will expand to cover the entire json schema specification.

Using default schemas¶

For conveineince a collection of default schemas are provided for the Cell Maps for AI pipeline. These schemas have their own repo, and will track the progress of the pipeline as new data modalities are added. These default schemas are packaged and provided as part of the fairscape-cli, and can be implemented using the respective identifier for the schema. Examples for all of the existing default schemas are provided below.

    # validate imageloader files
    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/imageloader/samplescopy.csv" \
        --schema "ark:59852/schema-cm4ai-imageloader-samplescopy" 

    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/imageloader/uniquecopy.csv" \
        --schema "ark:59852/schema-cm4ai-imageloader-uniquecopy"

    # validate image embedding outputs
    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/image_embedding/image_emd.tsv" \
        --schema "ark:59852/schema-cm4ai-image-embedding-image-emd"

    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/image_embedding/labels_prob.tsv" \
        --schema "ark:59852/schema-cm4ai-image-embedding-labels-prob"

    # validate apsm loader input
    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/apmsloader/ppi_gene_node_attributes.tsv" \
        --schema "ark:59852/schema-cm4ai-apmsloader-gene-node-attributes"

    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/apmsloader/ppi_edgelist.tsv" \
        --schema "ark:59852/schema-cm4ai-apmsloader-ppi-edgelist"

    # validate apms embedding 
    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/apms_embedding/ppi_emd.tsv" \
        --schema "ark:59852/schema-cm4ai-apms-embedding"    

    # validate coembedding 
    fairscape-cli schema validate \
        --data "examples/schemas/cm4ai-rocrates/coembedding/coembedding_emd.tsv" \
        --schema "ark:59852/schema-cm4ai-coembedding"