Validation Commands

This document provides detailed information about the validation commands available in fairscape-cli.

Overview

The validate command group provides operations for validating data against schemas. This ensures that datasets conform to their expected structure and constraints.

fairscape-cli validate [COMMAND] [OPTIONS]

Available Commands

  • schema - Validate a dataset against a schema definition

Command Details

schema

Validate a dataset against a schema definition.

fairscape-cli validate schema [OPTIONS]

Options:

  • --schema TEXT - Path to the schema file or ARK identifier [required]
  • --data TEXT - Path to the data file to validate [required]

Example:

fairscape-cli validate schema \
    --schema ./schema_apms_music_embedding.json \
    --data ./APMS_embedding_MUSIC.csv

When validation succeeds, you'll see:

Validation Success

If validation fails, you'll see a table of errors:

+-----+-----------------+----------------+-------------------------------------------------------+
| row |    error_type   | failed_keyword |                        message                        |
+-----+-----------------+----------------+-------------------------------------------------------+
|  3  |   ParsingError  |      None      | ValueError: Failed to Parse Attribute embed for Row 3 |
|  4  |   ParsingError  |      None      | ValueError: Failed to Parse Attribute embed for Row 4 |
|  0  | ValidationError |    pattern     |        'APMS_A' does not match '^APMS_[0-9]*$'        |
+-----+-----------------+----------------+-------------------------------------------------------+

Error Types

Errors are categorized into two main types:

  1. ParsingError: Occurs when the data cannot be parsed according to the schema structure. This often happens when:

  2. The number of columns doesn't match the schema

  3. A value cannot be converted to the expected datatype

  4. ValidationError: Occurs when the data can be parsed but fails validation constraints like:

  5. String values not matching the specified pattern
  6. Numeric values outside the min/max range
  7. Array length not within specified bounds

Working with Different File Types

The validation command automatically detects the file type based on its extension:

  • CSV/TSV files: Tabular validation with field separators
  • Parquet files: Tabular validation with columnar storage
  • HDF5 files: Hierarchical validation with nested structures

Using ARK Identifiers for Schemas

Instead of providing a file path, you can reference a schema by its ARK identifier if it's registered in a FAIRSCAPE repository:

fairscape-cli validate schema \
    --schema "ark:59852/schema-cm4ai-image-embedding-image-emd" \
    --data "examples/schemas/cm4ai-rocrates/image_embedding/image_emd.tsv"