Annotate#
This guide shows how to define clear validation criteria, validate and curate metadata within a few minutes.
By the end, you’ll have validated data objects empowered by LaminDB registries.
What does “validating a categorical variable based on registries” mean?
The records in your LaminDB instance define the validated reference values for any entity managed in your schema.
Validated categorical values are stored in a field of a registry; a column of the registry table.
The default field to label an entity record is the name
field.
For instance, if “Experiment 1” has been registered as the name
of a ULabel
record, it is a validated value for field ULabel.name
.
CanValidate
methods validate()
, inspect()
, standardize()
, from_values()
take 2 important parameters: values
and field
. The parameter values
takes an iterable of input categorical values, and the parameter field
takes a typed field of a registry.
Set up#
!lamin init --storage ./test-annotate --schema bionty
Show code cell output
💡 connected lamindb: testuser1/test-annotate
import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad
ln.settings.verbosity = "hint"
💡 connected lamindb: testuser1/test-annotate
A DataFrame with labels#
Let’s start with a DataFrame object that we’d like to validate and curate:
df = pd.DataFrame({
"cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
"assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
"donor": ["D0001", "D0002", "DOOO3"],
})
df
cell_type | assay_ontology_id | donor | |
---|---|---|---|
0 | cerebral pyramidal neuron | EFO:0008913 | D0001 |
1 | astrocyte | EFO:0008913 | D0002 |
2 | oligodendrocyte | EFO:0008913 | DOOO3 |
Validate and curate metadata#
Define validation criteria for the columns:
fields = {
"cell_type": bt.CellType.name,
"assay_ontology_id": bt.ExperimentalFactor.ontology_id,
"donor": ln.ULabel.name,
}
Validate the Pandas DataFrame:
annotate = ln.Annotate.from_df(df, fields=fields)
✅ registered 3 features with Feature.name: ['cell_type', 'assay_ontology_id', 'donor']
validated = annotate.validate()
💡 inspecting 'cell_type' by CellType.name
❗ 3 terms are not validated: 'cerebral pyramidal neuron', 'astrocyte', 'oligodendrocyte'
→ register terms via .update_registry('cell_type')
💡 inspecting 'assay_ontology_id' by ExperimentalFactor.ontology_id
❗ 1 terms is not validated: 'EFO:0008913'
→ register terms via .update_registry('assay_ontology_id')
💡 inspecting 'donor' by ULabel.name
❗ 3 terms are not validated: 'D0001', 'D0002', 'DOOO3'
→ register terms via .update_registry('donor')
validated
False
Validate using registries in another instance#
Sometimes you want to validate against existing registries others might have created.
Here we use the cellxgene instance registries to curate against. You will notice more terms are validated than above.
This allows us to register values that are currently missing in our instance from the cellxgene instance directly. By having our own registry but also validating against the cellxgene instance, we enable the addition of new registry values while keeping the cellxgene instance focused on the cellxgene schema.
annotate = ln.Annotate.from_df(
df,
fields=fields,
using="laminlabs/cellxgene", # pass the instance slug
)
annotate.validate()
💡 inspecting 'cell_type' by CellType.name
❗ 1 terms is not validated: 'cerebral pyramidal neuron'
→ register terms via .update_registry('cell_type')
💡 inspecting 'assay_ontology_id' by ExperimentalFactor.ontology_id
✅ all assay_ontology_ids are validated
💡 inspecting 'donor' by ULabel.name
❗ 3 terms are not validated: 'D0001', 'D0002', 'DOOO3'
→ register terms via .update_registry('donor')
False
Register new metadata labels#
Following the suggestions above to register labels that aren’t present in the current instance:
(Note that our current instance is empty. Once you filled up the registries, registering new labels won’t be frequently needed)
annotate.update_registry("cell_type")
❗ 1 non-validated labels are not registered with CellType.name: ['cerebral pyramidal neuron']!
→ to lookup categories, use .lookup().['cell_type']
→ to register, set validated_only=False
✅ registered 2 labels from laminlabs/cellxgene with CellType.name: ['astrocyte', 'oligodendrocyte']
Fix typo and register again:
# use a lookup object to get the correct spelling of categories from public reference
# pass "public" to use the public reference
lookup = annotate.lookup()
lookup
Lookup objects from the laminlabs/cellxgene:
['feature']
['cell_type']
['assay_ontology_id']
['donor']
Example:
→ categories = validator.lookup().['cell_type']
→ categories.alveolar_type_1_fibroblast_cell
cell_types = lookup["cell_type"]
cell_types.cerebral_cortex_pyramidal_neuron
CellType(uid='2sgq6sE7', name='cerebral cortex pyramidal neuron', ontology_id='CL:4023111', description='A Pyramidal Neuron With Soma Located In The Cerebral Cortex.', updated_at=2023-11-28 22:37:06 UTC, public_source_id=48, created_by_id=1)
# fix the typo
df["cell_type"] = df["cell_type"].replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})
annotate.update_registry("cell_type")
✅ registered 1 labels from laminlabs/cellxgene with CellType.name: ['cerebral cortex pyramidal neuron']
annotate.update_registry("donor")
✅ registered 3 labels with ULabel.name: ['D0001', 'D0002', 'DOOO3']
To register non-validated terms, pass validated_only=False
:
annotate.update_registry("donor", validated_only=False)
Let’s validate it again:
validated = annotate.validate()
💡 inspecting 'cell_type' by CellType.name
✅ all cell_types are validated
💡 inspecting 'assay_ontology_id' by ExperimentalFactor.ontology_id
✅ all assay_ontology_ids are validated
💡 inspecting 'donor' by ULabel.name
✅ all donors are validated
validated
True
Validate an AnnData object#
We offer an AnnData specific annotate that is aware of the variables in addition to the observations DataFrame.
Here we specify which var_fields
and obs_fields
to validate against.
df.index = ["obs1", "obs2", "obs3"]
X = pd.DataFrame({"TCF7": [1, 2, 3], "PDCD1": [4, 5, 6], "CD3E": [7, 8, 9], "CD4": [10, 11, 12], "CD8A": [13, 14, 15]}, index=["obs1", "obs2", "obs3"])
adata = ad.AnnData(X=X, obs=df)
adata
AnnData object with n_obs × n_vars = 3 × 5
obs: 'cell_type', 'assay_ontology_id', 'donor'
annotate = ln.Annotate.from_anndata(
adata,
obs_fields=fields,
var_field=bt.Gene.symbol, # specify the field for the var
organism="human",
)
✅ registered 6 labels from public with Gene.symbol: ['TCF7', 'PDCD1', 'PDCD1', 'CD3E', 'CD4', 'CD8A']
annotate.validate()
💡 inspecting 'variables' by Gene.symbol
✅ all variabless are validated
💡 inspecting 'cell_type' by CellType.name
✅ all cell_types are validated
💡 inspecting 'assay_ontology_id' by ExperimentalFactor.ontology_id
❗ 1 terms is not validated: 'EFO:0008913'
→ register terms via .update_registry('assay_ontology_id')
💡 inspecting 'donor' by ULabel.name
✅ all donors are validated
False
annotate.update_registry("all")
💡 registering labels for 'cell_type'
💡 registering labels for 'assay_ontology_id'
✅ registered 1 labels from public with ExperimentalFactor.ontology_id: ['EFO:0008913']
💡 registering labels for 'donor'
annotate.validate()
💡 inspecting 'variables' by Gene.symbol
✅ all variabless are validated
💡 inspecting 'cell_type' by CellType.name
✅ all cell_types are validated
💡 inspecting 'assay_ontology_id' by ExperimentalFactor.ontology_id
✅ all assay_ontology_ids are validated
💡 inspecting 'donor' by ULabel.name
✅ all donors are validated
True
Register file#
The validated object can be subsequently registered as an Artifact
in your LaminDB instance:
ln.transform.stem_uid = "WOK3vP0bNGLx"
ln.transform.version = "0"
ln.track()
💡 Assuming editor is Jupyter Lab.
💡 notebook imports: anndata==0.10.5.post1 bionty==0.42.4 lamindb==0.69.2 pandas==1.5.3
💡 saved: Transform(uid='WOK3vP0bNGLx6K79', name='Annotate', key='annotate', version='0', type=notebook, updated_at=2024-03-28 10:23:58 UTC, created_by_id=1)
💡 saved: Run(uid='5DENnAdSck51Ir7hxH3f', transform_id=1, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_5DENnAdSck51Ir7hxH3f.txt
artifact = annotate.register_artifact(description="test AnnData")
... storing 'assay_ontology_id' as categorical
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/GjZEABWHRHWt5qgKGPCE.h5ad')
✅ storing artifact 'GjZEABWHRHWt5qgKGPCE' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-annotate/.lamindb/GjZEABWHRHWt5qgKGPCE.h5ad'
💡 parsing feature names of X stored in slot 'var'
✅ 5 terms (100.00%) are validated for symbol
✅ linked: FeatureSet(uid='N9VFofd47vqOluFiXmJa', n=6, type='number', registry='bionty.Gene', hash='12Mh3I-mUBuOvj1q6wNn', created_by_id=1)
💡 parsing feature names of slot 'obs'
✅ 3 terms (100.00%) are validated for name
✅ linked: FeatureSet(uid='SCWUyRUtrGp8dAp3R98T', n=3, registry='core.Feature', hash='P2nr4AHCJn58MrMgUtQf', created_by_id=1)
✅ saved 2 feature sets for slots: 'var','obs'
✅ linked feature 'cell_type' to registry 'bionty.CellType'
✅ linked feature 'assay_ontology_id' to registry 'bionty.ExperimentalFactor'
✅ linked feature 'donor' to registry 'core.ULabel'
✅ registered artifact in testuser1/test-annotate
View the registered artifact with metadata:
artifact.describe()
Artifact(uid='GjZEABWHRHWt5qgKGPCE', suffix='.h5ad', accessor='AnnData', description='test AnnData', size=19888, hash='KCkT4m82Mfrhtfma0vW3xg', hash_type='md5', n_observations=3, visibility=1, key_is_virtual=True, updated_at=2024-03-28 10:23:58 UTC)
Provenance:
🗃️ storage: Storage(uid='tWMa7Qeq', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-annotate', type='local', updated_at=2024-03-28 10:23:37 UTC, created_by_id=1)
💫 transform: Transform(uid='WOK3vP0bNGLx6K79', name='Annotate', key='annotate', version='0', type=notebook, updated_at=2024-03-28 10:23:58 UTC, created_by_id=1)
👣 run: Run(uid='5DENnAdSck51Ir7hxH3f', started_at=2024-03-28 10:23:58 UTC, is_consecutive=True, transform_id=1, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-03-28 10:23:37 UTC)
Features:
var: FeatureSet(uid='N9VFofd47vqOluFiXmJa', n=6, type='number', registry='bionty.Gene', hash='12Mh3I-mUBuOvj1q6wNn', updated_at=2024-03-28 10:23:58 UTC, created_by_id=1)
'TCF7', 'PDCD1', 'PDCD1', 'CD3E', 'CD4', 'CD8A'
obs: FeatureSet(uid='SCWUyRUtrGp8dAp3R98T', n=3, registry='core.Feature', hash='P2nr4AHCJn58MrMgUtQf', updated_at=2024-03-28 10:23:58 UTC, created_by_id=1)
🔗 cell_type (3, bionty.CellType): 'astrocyte', 'oligodendrocyte', 'cerebral cortex pyramidal neuron'
🔗 assay_ontology_id (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
🔗 donor (3, core.ULabel): 'D0001', 'D0002', 'DOOO3'
Labels:
🏷️ cell_types (3, bionty.CellType): 'astrocyte', 'oligodendrocyte', 'cerebral cortex pyramidal neuron'
🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
🏷️ ulabels (3, core.ULabel): 'D0001', 'D0002', 'DOOO3'
Register collection#
Register a new collection for the registered artifact:
# register a new collection
collection = annotate.register_collection(
artifact, # registered artifact above, can also pass a list of artifacts
name="Experiment X in brain", # title of the publication
description="10.1126/science.xxxxx", # DOI of the publication
reference="E-MTAB-xxxxx", # accession number (e.g. GSE#, E-MTAB#, etc.)
reference_type="ArrayExpress") # source type (e.g. GEO, ArrayExpress, SRA, etc.)
✅ registered collection in testuser1/test-annotate
collection.artifact
Artifact(uid='GjZEABWHRHWt5qgKGPCE', suffix='.h5ad', accessor='AnnData', description='test AnnData', size=19888, hash='KCkT4m82Mfrhtfma0vW3xg', hash_type='md5', n_observations=3, visibility=1, key_is_virtual=True, updated_at=2024-03-28 10:23:58 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
artifact.collection
Collection(uid='GjZEABWHRHWt5qgKGPCE', name='Experiment X in brain', description='10.1126/science.xxxxx', hash='KCkT4m82Mfrhtfma0vW3xg', reference='E-MTAB-xxxxx', reference_type='ArrayExpress', visibility=1, updated_at=2024-03-28 10:23:58 UTC, transform_id=1, run_id=1, artifact_id=1, created_by_id=1)