reactome
Reactome Objects
class Reactome()
__init__
def __init__()
constructor reactome class, there are no parameters, while getting constructed obtains the latest information from the api
search
@api_call
def search(query,
species=None,
compartments=None,
keywords=None,
types=None,
start=0,
num_rows=1000,
cluster=True,
force_filters=True)
general query that return specific reactome ids for different types
Arguments:
query: a string to be searchedspecies: a species name see self.show_fields[“species”]compartments: compartment name see self.show_fields[“compartment”]keywords: see self.show_fields[“keyword”]types: see self.show_fields[“type”]start: where to start the search, default is 0num_rows: number of rows to return default is 1000 (shouldbe more than enough)cluster: whether the cluster the results by different types default Trueforce_filters: if True and nothing is found will return an empty dict otherwise will try again w/o any filters
Returns:
response dict or an error
get_details
@api_call
def get_details(id)
get detailed information about a reactome entry, you need the reactome id
Arguments:
id: reactome id
Returns:
response dict
show_values
def show_values(field)
show available values for a given field
Arguments:
field: see show fields
Returns:
a list
show_fields
def show_fields()
show available fields for filtering
Returns:
a list
uniprot
UniProt Objects
class UniProt()
__init__
def __init__()
constructor for the UniProt class, which is used to gather data from the UniProt API and process it in a readable format.
search
def search(query, page_size=500)
free text query for the UniProt API
Arguments:
query: text query, anything that can be searched on the UniProt websitepage_size: number of items per request, this is not the total number of results, it will get results until there are no more pages
Returns:
a dataframe of name, UniProt ID, gene name, organism and a brief description
get_info
@api_call
def get_info(uniprot_id,
consolidate_refs=True,
get_variations=True,
get_interactions=True,
get_mutagenesis=True,
get_isoforms=True)
gather all the information about a specific entry described by the UniProt ID
Arguments:
uniprot_id: UniProt accessionconsolidate_refs: whether to consolidate all the references from the different sections into a single listget_variations: whether to call the variations apiget_interactions: whether to call the interactions apiget_mutagenesis: whether to call the mutagenesis apiget_isoforms: whether to call the isoforms api
get_features
def get_features(results, feature_types=None)
filter already extracted features by type, this just filters the features from the json response
Arguments:
feature_types: type of the feature to filter by
Returns:
the features
get_comments
def get_comments(results, types=None)
get already extracted comments from the json response
Arguments:
types: comment types to filter by
Returns:
comments
Interactions Objects
class Interactions()
__init__
def __init__(uniprot)
query the UniProt API for interaction data
Arguments:
uniprot: UniProt class
Isoforms Objects
class Isoforms()
__init__
def __init__(uniprot)
query the UniProt API for isoform data, not all proteins have isoforms and there will be warnings if none are found
Arguments:
uniprot: uniprot class
Mutagenesis Objects
class Mutagenesis()
__init__
def __init__(uniprot)
query the UniProt API for mutagenesis data this is different than variations, these are not variations that are
seen in the wild but from experimental data
Arguments:
uniprot: UniProt class
others
BioGrid Objects
class BioGrid()
__init__
def __init__(access_key)
Initialize the BioGrid class with the provided access key.
Arguments:
access_key: you can get one from https://webservice.thebiogrid.org/
interactions
@api_call
def interactions(gene_list, evidence_types=None, organism=None)
Get the interactions for the given gene list.
Arguments:
gene_list: list of genesid_types: the type of the identifier, e.g. “entrez”, “uniprot”, “ensembl”evidence_types: see self.evidence_types
Returns:
a pandas dataframe with the interactions and kinds of evidences that support them
IntAct Objects
class IntAct()
intact_search
@api_call
def intact_search(ebi_id, page=0)
search intact database
Arguments:
ebi_id: ebipage: which page to start from, this is more of a precaution for very large searches, if you lose connection you can resume from the last page you got data from, default 0
Returns:
a dataframe of all interactions found
AlphaGenome Objects
class AlphaGenome()
__init__
def __init__(access_key)
Create an AlphaGenome object. this is used to query the alphagenome api, but unlike other api calls this does
not return and api_call dataclass instance, instead it returns depending on the method, a variant, a genomic_range or a dataframe will be returned
Arguments:
access_key: your alphagenome api key, you can get one from their website.
predict_variant
def predict_variant(variants,
interval_size="SEQUENCE_LENGTH_2KB",
organism="human")
for a given list of variants predict their consequences, this does not mean you can pass a whole vcf file to it
but you can do a few dozen at a time no problem.
Arguments:
variants: list of variant objects, they do not need to have annotationsinterval_size: which interval should we consider, default 2KBorganism: which organism should we consider, default human the other option is mouse, that’s it.
Returns:
a benchmate.Variant.SequenceVariant instances, the same ones passed to the function but with annotations
predict_sequence
def predict_sequence(sequences,
ontology_terms,
interval_size="SEQUENCE_LENGTH_2KB",
output_types=None,
organism="human")
predict features of a list of sequences, if you have only one you should pass [sequence]
Arguments:
sequences: list of benchmate.sequences.Sequence objectsontology_terms: which ontology terms to use if you do not specify any we’ll use all of theminterval_size: interval size to consider, default 2KB but if needs to be longer than your sequenceoutput_types: see self.ouput_types or get them all (if none)organism: which organism to consider, default human the other option is mouse, that’s it
Returns:
a list of benchmate.sequences.Sequence objects, the same ones with the features property filled in
predict_interval
def predict_interval(granges,
ontology_terms,
interval_size="SEQUENCE_LENGTH_2KB",
output_types=None,
organism="human")
predict things about an interval,
Arguments:
granges: a list of granges or a granges list object, if you have only one grange then pass it as a list [grange]ontology_terms: which ontology terms to useinterval_size: interval size to consider, default 2KB, it needs to be longer then len(grange)output_types: see aboveorganism: see above
Returns:
a list of granges, with annotations
mutagenesis
def mutagenesis(granges,
scorers,
mutagenesis_region=None,
interval_size="SEQUENCE_LENGTH_2KB",
output_types=None,
organism="human")
Perform in-silico mutagenesis for all the sequences in the range you provided
Arguments:
granges: list of grangesscorers: list of scorers, see self.scorersinterval_size: which interval size to consider, default 2KB, it needs to be longer then len(grange)mutagenesis_region: which region of the sequence to mutate extensively, this needs to be shorter than your interval size, the method picks the center of the rage and mutagenesis_region/2 on each side
Returns:
a dataframe of scores or a list of dataframe of scores if you picked more than one scorer, if you get greedy and ask for all the things the server might kick you out.
rnacentral
RnaCentral Objects
class RnaCentral()
get_information
@api_call
def get_information(id: str,
get_xrefs: bool = True,
get_publications: bool = True)
Get information about a specific RNAcentral entry.
Arguments:
id: rnacentral identifierget_xrefs: whether to get cross-references form other databasesget_publications: whether to get publications related to the entry, these will return pubmed ids
Returns:
a dictionary containing information about the RNAcentral entry
stringdb
StringDb Objects
class StringDb()
__init__
def __init__()
constructor for StringDb class
Arguments:
name: some sort of identifier for the protein it support UniProt, gene name, gene name synonymsspecies: species id for the protein, default is human, you can taxanomy id from NCBInetwork_depth: how deep you want to go in the network, default is 1, if more than 1 it will re search all the results for the next depth this will increase the time it takes to get the network and the number will increase exponentially
gather
@api_call
def gather(species, name, get_network=False, network_depth=1)
gather all the information about a specific entry
Arguments:
species: which specices, this is to disambiguate, since homologs can have the same name across speciesname: name of the queryget_network: whether to get the interactors of interactorsnetwork_depth: depth of the networks, this makes the queries grow exponentially.
Returns:
a dictionary of results, if the network depth is greater than one, under the “network” key you will see other entries
ncbi
Ncbi Objects
class Ncbi()
__init__
def __init__(access_key=None, email=None, collect_info=False)
Arguments:
api_key: NCBI API key, you can get one from https://www.ncbi.nlm.nih.gov/account/settings/email: you can also use your email address if these are not provided the searches will be limited and there will be stricter rate limits
search
@api_call
def search(db, query, retmax=100)
thin wrapper around the NCBI Entrez esearch
Arguments:
db: the database to search, use show_databases to see available databasesquery: the query string, this can be anything that can be typed into the NCBI search barretmax: maximum number of results to return 10000 is the api max
Returns:
a list of ncbi ids matching the query from that database the ids are not unique to each database so there can be another item with the same id in another database
summary
@api_call
def summary(db, id)
thin wrapper around the NCBI Entrez esummary
Arguments:
db: db nameid: id to get summary for, you can get the ids from the search function
Returns:
list of summary records
fetch
@api_call
def fetch(db, id)
thin wrapper around the NCBI Entrez fetch
Arguments:
db: database nameid: id to fetch
Returns:
list parsed from the xml
show_databases
def show_databases()
show available databases
Returns:
a list of strings of database names, these strings can be used in other functions
get_db_info
def get_db_info(db)
get database info
Arguments:
db: name of the database fron show_databases
Returns:
list of parameters and how they can be searched
ensembl
Ensembl Objects
class Ensembl()
Ensembl API wrapper for the Ensembl REST API.
__init__
def __init__()
Initialize the Ensembl API wrapper. there are some basic variables that are set there is nothing here for the user to set. The base url is the ensembl rest api url, the dataset is the dataset that will be used for the queries, and the headers are the headers that will be used for the queries.
variation
@api_call
def variation(id,
method=None,
species="human",
pubtype=None,
add_annotations=False)
Get variation information from the Ensembl REST API.
Arguments:
id: variant idmethod: search method, default is None which means we will get information otherwise you can search for publications (pmid and pmcid) or translation which converts the notations to other notationsspecies: species to search for, default is humanpubtype:
Returns:
returns a detailed dict with the variation information depending on the parameters described above
vep
@api_call
def vep(species, variant, tools, check_existing=True)
”
Get variant effect prediction from the Ensembl REST API.
Arguments:
species: species to search forvariant: variant to search for, must be a Variant objecttools: tools to use for the prediction, default is None which means we will just return basic informationcheck_existing: check population frequencies from gnomad and 1kg
Returns:
variant effect prediction a detailed dict, not all tools are compatible with all variants and each other
phenotype
@api_call
def phenotype(grange, species="human")
Get phenotype information from the Ensembl REST API that is associated with the genomic range.
Arguments:
grange: a GenomicRange objectspecies: species to search for, default is human
Returns:
a dictionary with the phenotype information
sequence
@api_call
def sequence(id,
trim_end=None,
trim_start=None,
expand_3=None,
expand_5=None,
sequence_type="genomic")
Get sequence information from the Ensembl REST API for a given Ensembl id
Arguments:
id: Ensembl id, because the ids also specify the species you do not need to specify the speciestrim_end: trim this many nucleotides from the endtrim_start: trim this many nucleotides from the startexpand_3: expand this many nucleotides from the 3’ end not compatible with trim_endexpand_5: expand this many nucleotides from the 5’ end not compatible with trim_startsequence_type: genomics, cds, protein, cdna
Returns:
sequence of the thing that is requested, depending on the type this can be genomic sequence, cds sequence, protein sequence or cdna sequence, multiple sequences are returned as a dataframe
xrefs
@api_call
def xrefs(id, species="human", external=False)
Get cross references from the Ensembl REST API for a given Ensembl id
Arguments:
id: Ensembl id, because the ids also specify the species you do not need to specify the species
Returns:
a dict of cross references these can be used to get the ids from other databases from other apis
mapping
@api_call
def mapping(id, start, end, type="cDNA")
Get mapping information from the Ensembl REST API for a given Ensembl id, convert between cDNA, CDS and protein
Arguments:
id: Ensembl id, because the ids also specify the species you do not need to specify the speciesstart: start position of the rangeend: end position of the rangetype: type of mapping, cDNA, CDS or protein
Returns:
dict of mapping information, this not really compatible with genomicranges that’s why the inputs are different
overlap
@api_call
def overlap(grange, features=None, species="human")
Get overlap information from the Ensembl REST API for a given genomic range, this can be used to get the features that are
within a region of interest. The features can be specified as a list of strings, if no features are specified all features will be returned.
Arguments:
grange: a GenomicRange objectfeatures: features to get, default is None which means all features will be returnedspecies: species to search for, default is human
Returns:
a dict of overlap information, this is a dict of dicts where the keys are the features and the values are the genomic features
homology
@api_call
def homology(id,
type="orthologues",
target_species=None,
source_species="human")
Get homology information from the Ensembl REST API for a given ensembl id, this can be used to get orthologues and paralogues
Arguments:
id: ensembl id, because the ids also specify the species you do not need to specify the speciestype: type of homology, orthologues or paraloguestarget_species: target species to get the homology for, if None all species will be returnedsource_species: source species to get the homology for, default is human
Returns:
a dict of homology information
info
def info()
Get information from the Ensembl REST API, this returns general information about the api,
used to get an idea of what is available in the api.
Returns:
divisions, species and consequences that are available in the api
utils
api_call
def api_call(func)
add metadata to an api call and return the apicall dataclass instance instead of just a dict
Arguments:
func: function to be decorated
Returns:
a wrapper function, this will return an ApiCall instance with all information about the api call
ApiCall Objects
@dataclass
class ApiCall()
Stores metadata and results of an API call. This is to make it easier to track api calls for knowledge base construction.
rerun
def rerun(access_key=None, email=None)
rerun the api call with the same parameters, useful if the api call failed or if you want to update the results
Arguments:
access_key: if the api requires an access key like alphagenome or biogridemail: if the api requires an email like NCBI
Returns:
an updated ApiCall instance
chunks
@cached_property
def chunks(path="root", max_chunk_chars: int = 1000)
chunks an api response, this will be used for semantic searching the chunks
Arguments:
max_chunk_chars: for larger ones with text
Returns:
list of chunks with path of the dict starting with root
flat
@cached_property
def flat()
Flatten JSON response into a single summary string. This will be used for tsvector in full text search