Extract all homonym homologs (HHs) from a UniProt accession cluster file for a given list of source accessions.
By default, the HH mapping result is grouped by NCBI Tax ID and printed to STDOUT. Clusters can be read from a file or from STDIN.
Clusters file format:
taccession,taxID \t taccession,taxID \t ...
taccession,taxID ...
...
Each line represents one HH cluster where taccession (str) is a possible UniProt accession to map to and taxID (int) is the NCBI Tax ID for that protein.
Source accessions file format:
saccession
saccession
...
Each line represents on source protein; accessions do not have to be from the same organism (taxID). If a source accession is found in a cluster, all other accessions are treated as possible targets. If the search is limited by taxID[s], only the accessions with those IDs are retrieved. If the exclude flag is used (see hhmap.mapper.extract()), accessions in a cluster with the same taxID as the source are skipped.
Output is printed (to STDOUT) as:
"source" \t taxID \t taxID \t ...
saccession \t taccession,taccession,... \t taccession,... \t ...
saccession \t taccession,...
...
If the output is requested to be ungrouped (see ungroup flag for hhmap.mapper.extract()), the output is only two columns and no header:
saccession \t taccession,taccession,taccession,...
saccession \t ...
...
Note
Output is also ungrouped if taxa for hhmap.mapper.extract() is None
Provides the main entry function biocreative.hhmap.extract().
Parses clusters by searching for all mappings for the sources (limited to a set of taxa and/or by excluding the taxon of the source accession) and prints the found mappings per source on one line.
The found (target) accessions are grouped by taxIDs or can be printed ungrouped (all together in on CSV list). Maps to all found accessions if taxa is an empty list or None, and prints ungrouped if taxa is None.
The clusters should be an iterator over each cluster, returning another iterator for the taccession,taxID strings.
Parameters: |
|
---|
Provides a collection.namedtuple() for Accession and a dict to hold the Mapping.
Bases: tuple
A namedtuple with the attributes acc (str) and tax (int).
Bases: dict
Mapping data structure:
{ src_acc: { tax_id: [ trgt_acc, trgt_acc, ... ], ... }, ... }
Add a new target Accession for the currently set_source().
Parameter: | target (Accession) – to add |
---|
Set the source accession for the add_target() method.
Parameter: | source (str) – source accession |
---|
Get a list of all source accessions.
Returns: | source accessions (str) |
---|---|
Return type: | list |
Get a list of all target dictionaries.
Returns: | targets (as { int(tax_id): [ str(acc), ... ], ... }) |
---|---|
Return type: | list |
Get a (unique) list of all target taxIDs.
Returns: | all tax IDs (int) to which a mapping has been made |
---|---|
Return type: | list |
Functions hhmap.extract() uses to create HH mappings: first find_mappings(), then print_mappings().
Extract mappings (limited by taxa) from clusters for the given source accessions.
If taxa is None, all possible mappings are retrieved. If exclude is True, mappings to accessions with the same taxID as the source are skipped.
Parameters: |
|
---|---|
Returns: | the found mappings |
Return type: | hhmap.data.Mapping |
Print found mappings for the list of taxa as tab-separated values.
The first column contains the source accession, all following columns contain mapped accessions for a specific tax ID per column. If header is False, the header line with the tax IDs for the target taxa columns is omitted. If taxa is None or an empty list, mappings are printed ungrouped and the header is not printed.
Parameters: |
|
---|