Homonym Homolog Mapper Documentation

Extract all homonym homologs (HHs) from a UniProt accession cluster file for a given list of source accessions.

By default, the HH mapping result is grouped by NCBI Tax ID and printed to STDOUT. Clusters can be read from a file or from STDIN.

  • Created by: Florian Leitner on 2010-04-09.
  • Copyright: 2010, Florian Leitner. All rights reserved.
  • License: GNU Public License, latest version.

File Formats

Input File Format

Clusters file format:

taccession,taxID \t taccession,taxID \t ...
taccession,taxID ...
...

Each line represents one HH cluster where taccession (str) is a possible UniProt accession to map to and taxID (int) is the NCBI Tax ID for that protein.

Source accessions file format:

saccession
saccession
...

Each line represents on source protein; accessions do not have to be from the same organism (taxID). If a source accession is found in a cluster, all other accessions are treated as possible targets. If the search is limited by taxID[s], only the accessions with those IDs are retrieved. If the exclude flag is used (see hhmap.mapper.extract()), accessions in a cluster with the same taxID as the source are skipped.

Output Format

Output is printed (to STDOUT) as:

"source" \t taxID \t taxID \t ...
saccession \t taccession,taccession,... \t taccession,... \t ...
saccession \t taccession,...
...

If the output is requested to be ungrouped (see ungroup flag for hhmap.mapper.extract()), the output is only two columns and no header:

saccession \t taccession,taccession,taccession,...
saccession \t ...
...

Note

Output is also ungrouped if taxa for hhmap.mapper.extract() is None

biocreative.hhmap

Provides the main entry function biocreative.hhmap.extract().

extract()

biocreative.hhmap.extract(sources, clusters, taxa=None, excluding=True, ungrouped=False)

Parses clusters by searching for all mappings for the sources (limited to a set of taxa and/or by excluding the taxon of the source accession) and prints the found mappings per source on one line.

The found (target) accessions are grouped by taxIDs or can be printed ungrouped (all together in on CSV list). Maps to all found accessions if taxa is an empty list or None, and prints ungrouped if taxa is None.

The clusters should be an iterator over each cluster, returning another iterator for the taccession,taxID strings.

Parameters:
  • sources (list) – lookup accessions (str) for mappings
  • clusters (iter()) – a cluster iterator
  • taxa (list or None) – taxIDs (int) to map to
  • excluding (bool) – exclude mappings to the same taxon as the source
  • ungrouped (bool) – do not split target accessions by taxa in output

biocreative.hhmap.data

Provides a collection.namedtuple() for Accession and a dict to hold the Mapping.

Accession

class biocreative.hhmap.data.Accession

Bases: tuple

A namedtuple with the attributes acc (str) and tax (int).

Mapping

class biocreative.hhmap.data.Mapping

Bases: dict

Mapping data structure:

{ src_acc: { tax_id: [ trgt_acc, trgt_acc, ... ], ... }, ... }
add_target(target)

Add a new target Accession for the currently set_source().

Parameter:target (Accession) – to add
set_source(source)

Set the source accession for the add_target() method.

Parameter:source (str) – source accession
sources()

Get a list of all source accessions.

Returns:source accessions (str)
Return type:list
targets()

Get a list of all target dictionaries.

Returns:targets (as { int(tax_id): [ str(acc), ... ], ... })
Return type:list
taxa()

Get a (unique) list of all target taxIDs.

Returns:all tax IDs (int) to which a mapping has been made
Return type:list

biocreative.hhmap.mapper

Functions hhmap.extract() uses to create HH mappings: first find_mappings(), then print_mappings().

find_mappings()

biocreative.hhmap.mapper.find_mappings(accessions, clusters, taxa=None, exclude=False)

Extract mappings (limited by taxa) from clusters for the given source accessions.

If taxa is None, all possible mappings are retrieved. If exclude is True, mappings to accessions with the same taxID as the source are skipped.

Parameters:
  • accessions (list) – lookup (source) accessions (str) for mappings
  • clusters (iter()) – a cluster iterator - see hhmap.extract()
  • taxa (set or None) – taxIDs (int) to map to (all if None)
  • exclude (bool) – exclude mappings to the same taxon as the source
Returns:

the found mappings

Return type:

hhmap.data.Mapping