Annotate#

This guide shows how to define clear validation criteria, validate and curate metadata within a few minutes.

By the end, you’ll have validated data objects empowered by LaminDB registries.

Set up#

!lamin init --storage ./test-annotate --schema bionty

import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad

ln.settings.verbosity = "hint"

💡 connected lamindb: testuser1/test-annotate

A DataFrame with labels#

Let’s start with a DataFrame object that we’d like to validate and curate:

df = pd.DataFrame({
    "cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
    "assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
    "donor": ["D0001", "D0002", "DOOO3"],
})
df

	cell_type	assay_ontology_id	donor
0	cerebral pyramidal neuron	EFO:0008913	D0001
1	astrocyte	EFO:0008913	D0002
2	oligodendrocyte	EFO:0008913	DOOO3

Validate and curate metadata#

Define validation criteria for the columns:

fields = {
    "cell_type": bt.CellType.name,
    "assay_ontology_id": bt.ExperimentalFactor.ontology_id,
    "donor": ln.ULabel.name,
}

Validate the Pandas DataFrame:

annotate = ln.Annotate.from_df(df, fields=fields)

✅ registered 3 features with Feature.name: ['cell_type', 'assay_ontology_id', 'donor']

validated = annotate.validate()

💡 inspecting 'cell_type' by CellType.name

❗    3 terms are not validated: 'cerebral pyramidal neuron', 'astrocyte', 'oligodendrocyte'
      → register terms via .update_registry('cell_type')

💡 inspecting 'assay_ontology_id' by ExperimentalFactor.ontology_id

❗    1 terms is not validated: 'EFO:0008913'
      → register terms via .update_registry('assay_ontology_id')

💡 inspecting 'donor' by ULabel.name

❗    3 terms are not validated: 'D0001', 'D0002', 'DOOO3'
      → register terms via .update_registry('donor')

validated

False

Validate using registries in another instance#

Sometimes you want to validate against existing registries others might have created.

Here we use the cellxgene instance registries to curate against. You will notice more terms are validated than above.

This allows us to register values that are currently missing in our instance from the cellxgene instance directly. By having our own registry but also validating against the cellxgene instance, we enable the addition of new registry values while keeping the cellxgene instance focused on the cellxgene schema.

annotate = ln.Annotate.from_df(
    df, 
    fields=fields, 
    using="laminlabs/cellxgene",  # pass the instance slug
    )
annotate.validate()

💡 inspecting 'cell_type' by CellType.name

❗    1 terms is not validated: 'cerebral pyramidal neuron'
      → register terms via .update_registry('cell_type')

💡 inspecting 'assay_ontology_id' by ExperimentalFactor.ontology_id

✅    all assay_ontology_ids are validated

💡 inspecting 'donor' by ULabel.name

❗    3 terms are not validated: 'D0001', 'D0002', 'DOOO3'
      → register terms via .update_registry('donor')

False

Register new metadata labels#

Following the suggestions above to register labels that aren’t present in the current instance:

(Note that our current instance is empty. Once you filled up the registries, registering new labels won’t be frequently needed)

annotate.update_registry("cell_type")

❗ 1 non-validated labels are not registered with CellType.name: ['cerebral pyramidal neuron']!
      → to lookup categories, use .lookup().['cell_type']
      → to register, set validated_only=False

✅ registered 2 labels from laminlabs/cellxgene with CellType.name: ['astrocyte', 'oligodendrocyte']

Fix typo and register again:

# use a lookup object to get the correct spelling of categories from public reference
# pass "public" to use the public reference
lookup = annotate.lookup()

lookup

Lookup objects from the laminlabs/cellxgene:
 ['feature']
 ['cell_type']
 ['assay_ontology_id']
 ['donor']

Example:
    → categories = validator.lookup().['cell_type']
    → categories.alveolar_type_1_fibroblast_cell

cell_types = lookup["cell_type"]

cell_types.cerebral_cortex_pyramidal_neuron

CellType(uid='2sgq6sE7', name='cerebral cortex pyramidal neuron', ontology_id='CL:4023111', description='A Pyramidal Neuron With Soma Located In The Cerebral Cortex.', updated_at=2023-11-28 22:37:06 UTC, public_source_id=48, created_by_id=1)

# fix the typo
df["cell_type"] = df["cell_type"].replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})

annotate.update_registry("cell_type")

✅ registered 1 labels from laminlabs/cellxgene with CellType.name: ['cerebral cortex pyramidal neuron']

annotate.update_registry("donor")

✅ registered 3 labels with ULabel.name: ['D0001', 'D0002', 'DOOO3']

To register non-validated terms, pass validated_only=False:

annotate.update_registry("donor", validated_only=False)

Let’s validate it again:

validated = annotate.validate()

💡 inspecting 'cell_type' by CellType.name

✅    all cell_types are validated

💡 inspecting 'assay_ontology_id' by ExperimentalFactor.ontology_id

✅    all assay_ontology_ids are validated

💡 inspecting 'donor' by ULabel.name

✅    all donors are validated

validated

True

Validate an AnnData object#

We offer an AnnData specific annotate that is aware of the variables in addition to the observations DataFrame.

Here we specify which var_fields and obs_fields to validate against.

df.index = ["obs1", "obs2", "obs3"]

X = pd.DataFrame({"TCF7": [1, 2, 3], "PDCD1": [4, 5, 6], "CD3E": [7, 8, 9], "CD4": [10, 11, 12], "CD8A": [13, 14, 15]}, index=["obs1", "obs2", "obs3"])

adata = ad.AnnData(X=X, obs=df)
adata

AnnData object with n_obs × n_vars = 3 × 5
    obs: 'cell_type', 'assay_ontology_id', 'donor'

annotate = ln.Annotate.from_anndata(
    adata, 
    obs_fields=fields, 
    var_field=bt.Gene.symbol, # specify the field for the var
    organism="human",
    )

✅    registered 6 labels from public with Gene.symbol: ['TCF7', 'PDCD1', 'PDCD1', 'CD3E', 'CD4', 'CD8A']

annotate.validate()

💡 inspecting 'variables' by Gene.symbol

✅    all variabless are validated

💡 inspecting 'cell_type' by CellType.name

✅    all cell_types are validated

💡 inspecting 'assay_ontology_id' by ExperimentalFactor.ontology_id

❗    1 terms is not validated: 'EFO:0008913'
      → register terms via .update_registry('assay_ontology_id')

💡 inspecting 'donor' by ULabel.name

✅    all donors are validated

False

annotate.update_registry("all")

💡    registering labels for 'cell_type'

💡    registering labels for 'assay_ontology_id'

✅    registered 1 labels from public with ExperimentalFactor.ontology_id: ['EFO:0008913']

💡    registering labels for 'donor'

annotate.validate()

💡 inspecting 'variables' by Gene.symbol

✅    all variabless are validated

💡 inspecting 'cell_type' by CellType.name

✅    all cell_types are validated

💡 inspecting 'assay_ontology_id' by ExperimentalFactor.ontology_id

✅    all assay_ontology_ids are validated

💡 inspecting 'donor' by ULabel.name

✅    all donors are validated

True

Register file#

The validated object can be subsequently registered as an Artifact in your LaminDB instance:

ln.transform.stem_uid = "WOK3vP0bNGLx"
ln.transform.version = "0"
ln.track()

💡    Assuming editor is Jupyter Lab.

💡    notebook imports: anndata==0.10.5.post1 bionty==0.42.4 lamindb==0.69.2 pandas==1.5.3

💡    saved: Transform(uid='WOK3vP0bNGLx6K79', name='Annotate', key='annotate', version='0', type=notebook, updated_at=2024-03-28 10:23:58 UTC, created_by_id=1)

💡    saved: Run(uid='5DENnAdSck51Ir7hxH3f', transform_id=1, created_by_id=1)

💡    tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_5DENnAdSck51Ir7hxH3f.txt

artifact = annotate.register_artifact(description="test AnnData")

... storing 'assay_ontology_id' as categorical

💡    path content will be copied to default storage upon `save()` with key `None` ('.lamindb/GjZEABWHRHWt5qgKGPCE.h5ad')

✅    storing artifact 'GjZEABWHRHWt5qgKGPCE' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-annotate/.lamindb/GjZEABWHRHWt5qgKGPCE.h5ad'

💡    parsing feature names of X stored in slot 'var'

✅    5 terms (100.00%) are validated for symbol

✅    linked: FeatureSet(uid='N9VFofd47vqOluFiXmJa', n=6, type='number', registry='bionty.Gene', hash='12Mh3I-mUBuOvj1q6wNn', created_by_id=1)

💡 parsing feature names of slot 'obs'

✅    3 terms (100.00%) are validated for name

✅    linked: FeatureSet(uid='SCWUyRUtrGp8dAp3R98T', n=3, registry='core.Feature', hash='P2nr4AHCJn58MrMgUtQf', created_by_id=1)

✅ saved 2 feature sets for slots: 'var','obs'

✅ linked feature 'cell_type' to registry 'bionty.CellType'

✅ linked feature 'assay_ontology_id' to registry 'bionty.ExperimentalFactor'

✅ linked feature 'donor' to registry 'core.ULabel'

✅ registered artifact in testuser1/test-annotate

View the registered artifact with metadata:

artifact.describe()

Artifact(uid='GjZEABWHRHWt5qgKGPCE', suffix='.h5ad', accessor='AnnData', description='test AnnData', size=19888, hash='KCkT4m82Mfrhtfma0vW3xg', hash_type='md5', n_observations=3, visibility=1, key_is_virtual=True, updated_at=2024-03-28 10:23:58 UTC)

Provenance:
  🗃️ storage: Storage(uid='tWMa7Qeq', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-annotate', type='local', updated_at=2024-03-28 10:23:37 UTC, created_by_id=1)
  💫 transform: Transform(uid='WOK3vP0bNGLx6K79', name='Annotate', key='annotate', version='0', type=notebook, updated_at=2024-03-28 10:23:58 UTC, created_by_id=1)
  👣 run: Run(uid='5DENnAdSck51Ir7hxH3f', started_at=2024-03-28 10:23:58 UTC, is_consecutive=True, transform_id=1, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-03-28 10:23:37 UTC)
Features:
  var: FeatureSet(uid='N9VFofd47vqOluFiXmJa', n=6, type='number', registry='bionty.Gene', hash='12Mh3I-mUBuOvj1q6wNn', updated_at=2024-03-28 10:23:58 UTC, created_by_id=1)
    'TCF7', 'PDCD1', 'PDCD1', 'CD3E', 'CD4', 'CD8A'
  obs: FeatureSet(uid='SCWUyRUtrGp8dAp3R98T', n=3, registry='core.Feature', hash='P2nr4AHCJn58MrMgUtQf', updated_at=2024-03-28 10:23:58 UTC, created_by_id=1)
    🔗 cell_type (3, bionty.CellType): 'astrocyte', 'oligodendrocyte', 'cerebral cortex pyramidal neuron'
    🔗 assay_ontology_id (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
    🔗 donor (3, core.ULabel): 'D0001', 'D0002', 'DOOO3'
Labels:
  🏷️ cell_types (3, bionty.CellType): 'astrocyte', 'oligodendrocyte', 'cerebral cortex pyramidal neuron'
  🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
  🏷️ ulabels (3, core.ULabel): 'D0001', 'D0002', 'DOOO3'

Register collection#

# register a new collection
collection = annotate.register_collection(
    artifact,  # registered artifact above, can also pass a list of artifacts
    name="Experiment X in brain",  # title of the publication
    description="10.1126/science.xxxxx",  # DOI of the publication
    reference="E-MTAB-xxxxx", # accession number (e.g. GSE#, E-MTAB#, etc.)
    reference_type="ArrayExpress") # source type (e.g. GEO, ArrayExpress, SRA, etc.)

✅ registered collection in testuser1/test-annotate

collection.artifact

Artifact(uid='GjZEABWHRHWt5qgKGPCE', suffix='.h5ad', accessor='AnnData', description='test AnnData', size=19888, hash='KCkT4m82Mfrhtfma0vW3xg', hash_type='md5', n_observations=3, visibility=1, key_is_virtual=True, updated_at=2024-03-28 10:23:58 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)

artifact.collection

Collection(uid='GjZEABWHRHWt5qgKGPCE', name='Experiment X in brain', description='10.1126/science.xxxxx', hash='KCkT4m82Mfrhtfma0vW3xg', reference='E-MTAB-xxxxx', reference_type='ArrayExpress', visibility=1, updated_at=2024-03-28 10:23:58 UTC, transform_id=1, run_id=1, artifact_id=1, created_by_id=1)