Gene Ontology (GO)#

Pathways represent interconnected molecular networks of signaling cascades that govern critical cellular processes. They provide understandings cellular behavior mechanisms, insights of disease progression and treatment responses. In an R&D organization, managing pathways across different datasets are crucial for gaining insights of potential therapeutic targets and intervention strategies.

In this notebook we manage a pathway registry based on “2023 GO Biological Process” ontology. We’ll walk you through the steps of registering pathways and link them to genes.

In the following Standardize metadata on-the-fly notebook, we’ll demonstrate how to perform a pathway enrichment analysis and track the dataset with LaminDB.

Setup#

!lamin init --storage ./use-cases-registries --schema bionty

import lamindb as ln
import bionty as bt
import gseapy as gp

bt.settings.organism = "human"  # globally set organism

💡 connected lamindb: testuser1/use-cases-registries

2024-03-28 10:21:45,877:INFO - Failed to extract font properties from /usr/share/fonts/truetype/noto/NotoColorEmoji.ttf: In FT2Font: Can not load face (unknown file format; error code 0x2)

2024-03-28 10:21:45,950:INFO - generated new fontManager

Fetch GO pathways annotated with human genes using Enrichr#

First we fetch the “GO_Biological_Process_2023” pathways for humans using GSEApy which wraps GSEA and Enrichr.

go_bp = gp.get_library(name="GO_Biological_Process_2023", organism="Human")
print(f"Number of pathways {len(go_bp)}")

2024-03-28 10:21:47,129:INFO - Downloading and generating Enrichr library gene sets...

2024-03-28 10:22:02,458:INFO - 0001 gene_sets have been filtered out when max_size=2000 and min_size=0

Number of pathways 5406

go_bp["ATF6-mediated Unfolded Protein Response (GO:0036500)"]

['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF']

Parse out the ontology_id from keys, convert into the format of {ontology_id: (name, genes)}

def parse_ontology_id_from_keys(key):
    """Parse out the ontology id.

    "ATF6-mediated Unfolded Protein Response (GO:0036500)" -> ("GO:0036500", "ATF6-mediated Unfolded Protein Response")
    """
    id = key.split(" ")[-1].replace("(", "").replace(")", "")
    name = key.replace(f" ({id})", "")
    return (id, name)

go_bp_parsed = {}

for key, genes in go_bp.items():
    id, name = parse_ontology_id_from_keys(key)
    go_bp_parsed[id] = (name, genes)

go_bp_parsed["GO:0036500"]

('ATF6-mediated Unfolded Protein Response',
 ['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF'])

Register pathway ontology in LaminDB#

bionty = bt.Pathway.public()

bionty

PublicOntology
Entity: Pathway
Organism: all
Source: go, 2023-05-10
#terms: 47514

📖 .df(): ontology reference table
🔎 .lookup(): autocompletion of terms
🎯 .search(): free text search of terms
✅ .validate(): strictly validate values
🧐 .inspect(): full inspection of values
👽 .standardize(): convert to standardized names
🪜 .diff(): difference between two versions
🔗 .to_pronto(): Pronto.Ontology object

Next, we register all the pathways and genes in LaminDB to finally link pathways to genes.

Register pathway terms#

To register the pathways we make use of .from_values to directly parse the annotated GO pathway ontology IDs into LaminDB.

pathway_records = bt.Pathway.from_values(go_bp_parsed.keys(), bt.Pathway.ontology_id)

ln.save(pathway_records, parents=False)  # not recursing through parents

Register gene symbols#

Similarly, we use .from_values for all Pathway associated genes to register them with LaminDB.

all_genes = {g for genes in go_bp.values() for g in genes}

gene_records = bt.Gene.from_values(all_genes, bt.Gene.symbol)

gene_records[:3]

[Gene(uid='4VdaTgLQAxmb', symbol='BCAS2', ensembl_gene_id='ENSG00000116752', ncbi_gene_ids='10286', biotype='protein_coding', description='BCAS2 pre-mRNA processing factor ', synonyms='DAM1|SNT309|SPF27', organism_id=1, public_source_id=9, created_by_id=1),
 Gene(uid='q2aA2JDqYUNh', symbol='CCNE2', ensembl_gene_id='ENSG00000175305', ncbi_gene_ids='9134', biotype='protein_coding', description='cyclin E2 ', synonyms='CYCE2', organism_id=1, public_source_id=9, created_by_id=1),
 Gene(uid='61dA4iednKP7', symbol='NAT8', ensembl_gene_id='ENSG00000144035', ncbi_gene_ids='9027', biotype='protein_coding', description='N-acetyltransferase 8 (putative) ', synonyms='TSC501|GLA|HCML1|ATASE2', organism_id=1, public_source_id=9, created_by_id=1)]

ln.save(gene_records);

Link pathway to genes#

Now that we are tracking all pathways and genes records, we can link both of them to make the pathways even more queryable.

gene_records_ids = {record.symbol: record for record in gene_records}

for pathway_record in pathway_records:
    pathway_genes = go_bp_parsed.get(pathway_record.ontology_id)[1]
    pathway_genes_records = [gene_records_ids.get(gene) for gene in pathway_genes]
    pathway_record.genes.set(pathway_genes_records)

Now genes are linked to pathways:

pathway_record.genes.list("symbol")

['CARD18', 'CAST', 'CARD8', 'XIAP', 'CST7']

Move on to the next analysis: Standardize metadata on-the-fly