lamindb.Schema¶

Bases: SQLRecord, CanCurate, TracksRun

Schemas of a dataset such as the set of columns of a DataFrame.

Composite schemas can have multiple slots, e.g., for an AnnData, one schema for slot obs and another one for var.

Parameters:

features – list[SQLRecord] | list[tuple[Feature, dict]] | None = None Feature records, e.g., [Feature(...), Feature(...)] or Features with their config, e.g., [Feature(...).with_config(optional=True)].
index – Feature | None = None A Feature record to validate an index of a DataFrame and therefore also, e.g., AnnData obs and var indices.
slots – dict[str, Schema] | None = None A dictionary mapping slot names to Schema objects.
name – str | None = None Name of the Schema.
description – str | None = None Description of the Schema.
flexible – bool | None = None Whether to include any feature of the same itype in validation and annotation. If no Features are passed, defaults to True, otherwise to False. This means that if you explicitly pass Features, any additional Features will be disregarded during validation & annotation.
type – Schema | None = None Type of Schema to group measurements by. Define types like ln.Schema(name="ProteinPanel", is_type=True).
is_type – bool = False Whether the Schema is a Type.
itype – str | None = None The feature identifier type (e.g. Feature, Gene, …).
otype – str | None = None An object type to define the structure of a composite schema (e.g., DataFrame, AnnData).
dtype – str | None = None The simple type (e.g., “num”, “float”, “int”). Defaults to None for sets of Feature records and to "num" (e.g., for sets of Gene) otherwise.
minimal_set – bool = True Whether all passed Features are required by default. See optionals for more-fine-grained control.
maximal_set – bool = False Whether additional Features are allowed.
ordered_set – bool = False Whether Features are required to be ordered.
coerce_dtype – bool = False When True, attempts to coerce values to the specified dtype during validation, see coerce_dtype.

See also

from_df(): Validate & annotate a DataFrame with a schema.
from_anndata(): Validate & annotate an AnnData with a schema.
from_mudata(): Validate & annotate an MuData with a schema.
from_spatialdata(): Validate & annotate a SpatialData with a schema.

Examples

The typical way to create a schema:

import lamindb as ln
import bionty as bt
import pandas as pd

# a schema with a single required feature
schema = ln.Schema(
    features=[
        ln.Feature(name="required_feature", dtype=str).save(),
    ],
).save()

# a schema that constrains feature identifiers to be a valid ensembl gene ids or feature names
schema = ln.Schema(itype=bt.Gene.ensembl_gene_id)
schema = ln.Schema(itype=ln.Feature)  # is equivalent to itype=ln.Feature.name

# a schema that requires a single feature but also validates & annotates any additional features with valid feature names
schema = ln.Schema(
    features=[
        ln.Feature(name="required_feature", dtype=str).save(),
    ],
    itype=ln.Schema(itype=ln.Feature),
    flexible=True,
).save()

Passing options to the Schema constructor:

# also validate the index
schema = ln.Schema(
    features=[
        ln.Feature(name="required_feature", dtype=str).save(),
    ],
    index=ln.Feature(name="sample", dtype=ln.ULabel).save(),
).save()

# mark a single feature as optional and ignore other features of the same identifier type
schema = ln.Schema(
    features=[
        ln.Feature(name="required_feature", dtype=str).save(),
        ln.Feature(name="feature2", dtype=int).save().with_config(optional=True),
    ],
).save()

Alternative constructors (from_values(), from_df()):

# parse & validate identifier values
schema = ln.Schema.from_values(
    adata.var["ensemble_id"],
    field=bt.Gene.ensembl_gene_id,
    organism="mouse",
).save()

# from a dataframe
df = pd.DataFrame({"feat1": [1, 2], "feat2": [3.1, 4.2], "feat3": ["cond1", "cond2"]})
schema = ln.Schema.from_df(df)

Attributes¶

property coerce_dtype: bool¶

Whether dtypes should be coerced during validation.

For example, a objects-dtyped pandas column can be coerced to categorical and would pass validation if this is true.

property flexible: bool¶

Indicates how to handle validation and annotation in case features are not defined.

Examples

Make a rigid schema flexible:

schema = ln.Schema.get(name="my_schema")
schema.flexible = True
schema.save()

During schema creation:

# if you're not passing features but just defining the itype, defaults to flexible = True
schema = ln.Schema(itype=ln.Feature).save()
assert not schema.flexible

# if you're passing features, defaults to flexible = False
schema = ln.Schema(
    features=[ln.Feature(name="my_required_feature", dtype=int).save()],
)
assert not schema.flexible

# you can also validate & annotate features in addition to those that you're explicitly defining:
schema = ln.Schema(
    features=[ln.Feature(name="my_required_feature", dtype=int).save()],
    flexible=True,
)
assert schema.flexible

property index: None | Feature¶

The feature configured to act as index.

To unset it, set schema.index to None.

property members: QuerySet¶

A queryset for the individual records in the feature set underlying the schema.

Unlike schema.features, schema.genes, schema.proteins, etc., this queryset is ordered and doesn’t require knowledge of the entity.

property optionals: SchemaOptionals¶

Manage optional features.

Example

# a schema with optional "sample_name"
schema_optional_sample_name = ln.Schema(
    features=[
        ln.Feature(name="sample_id", dtype=str).save(),  # required
        ln.Feature(name="sample_name", dtype=str).save().with_config(optional=True),  # optional
    ],
).save()

# raise ValidationError since `sample_id` is required
ln.curators.DataFrameCurator(
    pd.DataFrame(
        {
        "sample_name": ["Sample 1", "Sample 2"],
        }
    ),
    schema=schema_optional_sample_name).validate()
)

# passes because an optional column is missing
ln.curators.DataFrameCurator(
    pd.DataFrame(
        {
        "sample_id": ["sample1", "sample2"],
        }
    ),
    schema=schema_optional_sample_name).validate()
)

property slots: dict[str, Schema]¶

Slots.

Examples

# define composite schema
anndata_schema = ln.Schema(
    name="small_dataset1_anndata_schema",
    otype="AnnData",
    slots={"obs": obs_schema, "var": var_schema},
).save()

# access slots
anndata_schema.slots
# {'obs': <Schema: obs_schema>, 'var': <Schema: var_schema>}

Simple fields¶

uid: str¶: A universal id.

name: str | None¶: A name.

description: str | None¶: A description.

n: int¶: Number of features in the schema.

is_type: bool¶: Distinguish types from instances of the type.

itype: str | None¶

A registry that stores feature identifier types used in this schema, e.g., 'Feature' or 'bionty.Gene'.

Depending on itype, .members stores, e.g., Feature or bionty.Gene records.

otype: str | None¶: Default Python object type, e.g., DataFrame, AnnData.

dtype: str | None¶

Data type, e.g., “num”, “float”, “int”. Is None for Feature.

For Feature, types are expected to be heterogeneous and defined on a per-feature level.

hash: str | None¶

A hash of the set of feature identifiers.

For a composite schema, the hash of hashes.

minimal_set: bool¶

Whether all passed features are to be considered required by default (default True).

Note that features that are explicitly marked as optional via feature.with_config(optional=True) are not required even if this minimal_set is true.

ordered_set: bool¶: Whether features are required to be ordered (default False).

maximal_set: bool¶

Whether all features present in the dataset must be in the schema (default False).

If False, additional features are allowed to be present in the dataset.

If True, no additional features are allowed to be present in the dataset.

slot: str | None¶: A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

created_at: datetime¶: Time of creation of record.

Relational fields¶

branch: Branch¶: Whether record is on a branch or in another “special state”.

space: Space¶: The space in which the record lives.

created_by: User¶: Creator of record.

run: Run | None¶: Run that created record.

type: Schema | None¶

Type of schema.

Allows to group schemas by type, e.g., all meassurements evaluating gene expression vs. protein expression vs. multi modal.

You can define types via ln.Schema(name="ProteinPanel", is_type=True).

Here are a few more examples for type names: 'ExpressionPanel', 'ProteinPanel', 'Multimodal', 'Metadata', 'Embedding'.

components: Schema¶: Components of this schema.

features: Feature¶: The features contained in the schema.

schemas¶

Accessor to the related objects manager on the reverse side of a many-to-one relation.

In the example:

class Child(Model):
    parent = ForeignKey(Parent, related_name='children')

Parent.children is a ReverseManyToOneDescriptor instance.

Most of the implementation is delegated to a dynamically defined manager class built by create_forward_many_to_many_manager() defined below.

composites: Schema¶

The composite schemas that contains this schema as a component.

For example, an AnnData composes multiple schemas: var[DataFrameT], obs[DataFrame], obsm[Array], uns[dict], etc.

validated_artifacts: Artifact¶: The artifacts that were validated against this schema with a Curator.

artifacts: Artifact¶: The artifacts that measure a feature set that matches this schema.

sheets¶

Accessor to the related objects manager on the reverse side of a many-to-one relation.

In the example:

class Child(Model):
    parent = ForeignKey(Parent, related_name='children')

Parent.children is a ReverseManyToOneDescriptor instance.

Most of the implementation is delegated to a dynamically defined manager class built by create_forward_many_to_many_manager() defined below.

projects: Project¶: Linked projects.

Class methods¶

classmethod df(include=None, features=False, limit=100)¶

Convert to pd.DataFrame.

By default, shows all direct fields, except updated_at.

Use arguments include or feature to include other data.

Parameters:

include (str | list[str] | None, default: None) – Related fields to include as columns. Takes strings of form "ulabels__name", "cell_types__name", etc. or a list of such strings.
features (bool | list[str], default: False) – If True, map all features of the Feature registry onto the resulting DataFrame. Only available for Artifact.
limit (int, default: 100) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.

Return type:

DataFrame

Examples

Include the name of the creator in the DataFrame:

>>> ln.ULabel.df(include="created_by__name"])

Include display of features for Artifact:

>>> df = ln.Artifact.df(features=True)
>>> ln.view(df)  # visualize with type annotations

Only include select features:

>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])

classmethod filter(*queries, **expressions)¶

Query records.

Parameters:

queries – One or multiple Q objects.
expressions – Fields and values passed as Django query expressions.

Return type:

QuerySet

Returns:

A QuerySet.

See also

Guide: Query & search registries
Django documentation: Queries

Examples

>>> ln.ULabel(name="my label").save()
>>> ln.ULabel.filter(name__startswith="my").df()

classmethod from_df(df, field=FieldAttr(Feature.name), name=None, mute=False, organism=None, source=None)¶

Create schema for valid columns.

Return type:: Schema | None

classmethod from_values(values, field=FieldAttr(Feature.name), type=None, name=None, mute=False, organism=None, source=None, raise_validation_error=True)¶

Create feature set for validated features.

Parameters:

values (list[str] | Series | array) – A list of values, like feature names or ids.
field (DeferredAttribute, default: FieldAttr(Feature.name)) – The field of a reference registry to map values.
type (str | None, default: None) – The simple type. Defaults to None if reference registry is Feature, defaults to "float" otherwise.
name (str | None, default: None) – A name.
organism (SQLRecord | str | None, default: None) – An organism to resolve gene mapping.
source (SQLRecord | None, default: None) – A public ontology to resolve feature identifier mapping.
raise_validation_error (bool, default: True) – Whether to raise a validation error if some values are not valid.

Raises:

ValidationError – If some values are not valid.

Return type:

Schema

Example

import lamindb as ln
import bionty as bt

features = [ln.Feature(name=feat, dtype="str").save() for feat in ["feat11", "feat21"]]
schema = ln.Schema.from_values(features)

genes = ["ENSG00000139618", "ENSG00000198786"]
schema = ln.Schema.from_values(features, bt.Gene.ensembl_gene_id, "float")

classmethod get(idlike=None, **expressions)¶

Get a single record.

Parameters:

idlike (int | str | None, default: None) – Either a uid stub, uid or an integer id.
expressions – Fields and values passed as Django query expressions.

Raises:

lamindb.errors.DoesNotExist – In case no matching record is found.

Return type:

SQLRecord

See also

Guide: Query & search registries
Django documentation: Queries

Examples

ulabel = ln.ULabel.get("FvtpPJLJ")
ulabel = ln.ULabel.get(name="my-label")

classmethod inspect(values, field=None, *, mute=False, organism=None, source=None, from_source=True, strict_source=False)¶

Inspect if values are mappable to a field.

Being mappable means that an exact match exists.

Parameters:

values (list[str] | Series | array) – Values that will be checked against the field.
field (str | DeferredAttribute | None, default: None) – The field of values. Examples are 'ontology_id' to map against the source ID or 'name' to map against the ontologies field names.
mute (bool, default: False) – Whether to mute logging.
organism (str | SQLRecord | None, default: None) – An Organism name or record.
source (SQLRecord | None, default: None) – A bionty.Source record that specifies the version to inspect against.
strict_source (bool, default: False) – Determines the validation behavior against records in the registry. - If False, validation will include all records in the registry, ignoring the specified source. - If True, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.

Return type:

InspectResult

See also

validate()

Example:

import bionty as bt

# save some gene records
bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save()

# inspect gene symbols
gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"]
result = bt.Gene.inspect(gene_symbols, field=bt.Gene.symbol, organism="human")
assert result.validated == ["A1CF", "A1BG"]
assert result.non_validated == ["FANCD1", "FANCD20"]

classmethod lookup(field=None, return_field=None)¶

Return an auto-complete object for a field.

Parameters:

field (str | DeferredAttribute | None, default: None) – The field to look up the values for. Defaults to first string field.
return_field (str | DeferredAttribute | None, default: None) – The field to return. If None, returns the whole record.

Return type:

NamedTuple

Returns:

A NamedTuple of lookup information of the field values with a dictionary converter.

See also

search()

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> bt.Gene.from_source(symbol="ADGB-DT").save()
>>> lookup = bt.Gene.lookup()
>>> lookup.adgb_dt
>>> lookup_dict = lookup.dict()
>>> lookup_dict['ADGB-DT']
>>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id")
>>> genes.ensg00000002745
>>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")

classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶

Search.

Parameters:

string (str) – The input string to match against the field ontology values.
field (str | DeferredAttribute | None, default: None) – The field or fields to search. Search all string fields by default.
limit (int | None, default: 20) – Maximum amount of top results to return.
case_sensitive (bool, default: False) – Whether the match is case sensitive.

Return type:

QuerySet

Returns:

A sorted DataFrame of search results with a score in column score. If return_queryset is True. QuerySet.

See also

filter() lookup()

Examples

>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name")
>>> ln.save(ulabels)
>>> ln.ULabel.search("ULabel2")

classmethod standardize(values, field=None, *, return_field=None, return_mapper=False, case_sensitive=False, mute=False, source_aware=True, keep='first', synonyms_field='synonyms', organism=None, source=None, strict_source=False)¶

Maps input synonyms to standardized names.

Parameters:

values (Iterable) – Identifiers that will be standardized.
field (str | DeferredAttribute | None, default: None) – The field representing the standardized names.
return_field (str | DeferredAttribute | None, default: None) – The field to return. Defaults to field.
return_mapper (bool, default: False) – If True, returns {input_value: standardized_name}.
case_sensitive (bool, default: False) – Whether the mapping is case sensitive.
mute (bool, default: False) – Whether to mute logging.
source_aware (bool, default: True) – Whether to standardize from public source. Defaults to True for BioRecord registries.
keep (Literal['first', 'last', False], default: 'first') –
When a synonym maps to multiple names, determines which duplicates to mark as pd.DataFrame.duplicated: - "first": returns the first mapped standardized name - "last": returns the last mapped standardized name - False: returns all mapped standardized name.

When keep is False, the returned list of standardized names will contain nested lists in case of duplicates.

When a field is converted into return_field, keep marks which matches to keep when multiple return_field values map to the same field value.
synonyms_field (str, default: 'synonyms') – A field containing the concatenated synonyms.
organism (str | SQLRecord | None, default: None) – An Organism name or record.
source (SQLRecord | None, default: None) – A bionty.Source record that specifies the version to validate against.
strict_source (bool, default: False) – Determines the validation behavior against records in the registry. - If False, validation will include all records in the registry, ignoring the specified source. - If True, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.

Return type:

list[str] | dict[str, str]

Returns:

If return_mapper is False – a list of standardized names. Otherwise, a dictionary of mapped values with mappable synonyms as keys and standardized names as values.

See also

add_synonym(): Add synonyms.
remove_synonym(): Remove synonyms.

Example:

import bionty as bt

# save some gene records
bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save()

# standardize gene synonyms
gene_synonyms = ["A1CF", "A1BG", "FANCD1", "FANCD20"]
bt.Gene.standardize(gene_synonyms)
#> ['A1CF', 'A1BG', 'BRCA2', 'FANCD20']

classmethod using(instance)¶

Use a non-default LaminDB instance.

Parameters:: instance (str | None) – An instance identifier of form “account_handle/instance_name”.
Return type:: QuerySet

Examples

>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name")
            uid    score
name
ULabel7  g7Hk9b2v  100.0
ULabel5  t4Jm6s0q   75.0
ULabel6  r2Xw8p1z   75.0

classmethod validate(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶

Validate values against existing values of a string field.

Note this is strict_source validation, only asserts exact matches.

Parameters:

values (list[str] | Series | array) – Values that will be validated against the field.
field (str | DeferredAttribute | None, default: None) – The field of values. Examples are 'ontology_id' to map against the source ID or 'name' to map against the ontologies field names.
mute (bool, default: False) – Whether to mute logging.
organism (str | SQLRecord | None, default: None) – An Organism name or record.
source (SQLRecord | None, default: None) – A bionty.Source record that specifies the version to validate against.
strict_source (bool, default: False) – Determines the validation behavior against records in the registry. - If False, validation will include all records in the registry, ignoring the specified source. - If True, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.

Return type:

ndarray

Returns:

A vector of booleans indicating if an element is validated.

See also

inspect()

Example:

import bionty as bt

bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save()

gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"]
bt.Gene.validate(gene_symbols, field=bt.Gene.symbol, organism="human")
#> array([ True,  True, False, False])

Methods¶

add_synonym(synonym, force=False, save=None)¶

Add synonyms to a record.

Parameters:

synonym (str | list[str] | Series | array) – The synonyms to add to the record.
force (bool, default: False) – Whether to add synonyms even if they are already synonyms of other records.
save (bool | None, default: None) – Whether to save the record to the database.

See also

remove_synonym(): Remove synonyms.

Example:

import bionty as bt

# save "T cell" record
record = bt.CellType.from_source(name="T cell").save()
record.synonyms
#> "T-cell|T lymphocyte|T-lymphocyte"

# add a synonym
record.add_synonym("T cells")
record.synonyms
#> "T cells|T-cell|T-lymphocyte|T lymphocyte"

delete()¶

Delete.

Return type:: None

describe(return_str=False)¶

Describe schema.

Return type:: None | str

remove_synonym(synonym)¶

Remove synonyms from a record.

Parameters:: synonym (str | list[str] | Series | array) – The synonym values to remove.

See also

add_synonym(): Add synonyms

Example:

import bionty as bt

# save "T cell" record
record = bt.CellType.from_source(name="T cell").save()
record.synonyms
#> "T-cell|T lymphocyte|T-lymphocyte"

# remove a synonym
record.remove_synonym("T-cell")
record.synonyms
#> "T lymphocyte|T-lymphocyte"

save(*args, **kwargs)¶

Save.

Return type:: Schema

set_abbr(value)¶

Set value for abbr field and add to synonyms.

Parameters:: value (str) – A value for an abbreviation.

See also

add_synonym()

Example:

import bionty as bt

# save an experimental factor record
scrna = bt.ExperimentalFactor.from_source(name="single-cell RNA sequencing").save()
assert scrna.abbr is None
assert scrna.synonyms == "single-cell RNA-seq|single-cell transcriptome sequencing|scRNA-seq|single cell RNA sequencing"

# set abbreviation
scrna.set_abbr("scRNA")
assert scrna.abbr == "scRNA"
# synonyms are updated
assert scrna.synonyms == "scRNA|single-cell RNA-seq|single cell RNA sequencing|single-cell transcriptome sequencing|scRNA-seq"