RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative II.5

Evaluation library (Resources) [2009-12-17]

This is the current version of the BioCreative evaluation library including a command line tool to use it; current, official version: 3.2 (use command line option --version to see the version of the script you have installed: bc-evaluate --version. If you have reason to believe that there is a bug with the tool or the library, or any other questions related to it, please contact the author, Florian Leitner.

To update any former version you have installed, you only need to download the file, unpack it, and install it from command line by typing sudo python install in the directory where you unpacked the download. To check the updated worked, then type bc-evaluate --version and it should show you the correct version (see above).

Updates for major version 3

The following changes have been applied to the initial library (3.0):
  • 3.1 to 3.2: Minor change: A bug was fixed that prohibited ACT evaluations from working properly, thanks to Simon Hafner, who found and contributed a fix for the bug.
  • 3.0 to 3.1: A new, combined score between F-measure and Average Precision, the FAP-score, has been added for the INT, IMT, and IPT evaluations. The FAP-score is implemented as the harmonic mean between the F-measure and the Average Precision, just as the F-measure is the harmonic mean between precision and recall. This combined score makes it possible to order annotation results with respect to classification performance (F-measure) and ranking performance (AP) at the same time, while it requires no cutoff or threshold. Minor change: All failures when running the script now produce a new warning that should help understand input (format) errors more readily.

New major version 3 (November 2011)

The interpolated (ranking) evaluations have been removed, as interpolation contributes little to the overall score while the interpolation appeared to be too controversial (use the old version 2 library to run interpolated ranking evaluations). For the article classification task (ACT), the rank evaluation is now based on AUC PR (ie., without interpolation). For all other tasks, the precision and recall across all articles for each rank is calculated and the ranking score reported is the Average Precision (AP). AP is a geometric approximation of AUC PR using ∑ p ∗ ∆r, calculated from all articles and deltas taken at each rank, while the AUC PR is now calculated exactly, using:

A(PR) := ∑ [p(i) + p(i-1)] ∗ [r(i) - r(i-1)] ÷ 2 ∧ p(0) = 1, r(0) = 0

Where A(PR) is the AUC PR, p is precision and r recall. Initial precision is set to one, initial recall to zero. Precision and recall are calculated at each rank i. In summary, this means, that the old AUC PR implementation now is termed Average Precision (AP), without any interpolation. AP is only used for INT, IMT, and IPT, while the new AUC PR implementation now uses the exact formula to measure the area, and is used for ACT. The plot outputs (option "--plot") have been adapted accordingly.

For the protein normalization (INT), interaction detection (IPT), and method extraction (IMT) task evaluations, using the defaults (no additional options) does no longer exclude articles without annotations. Overall, this means the default evaluation measures are now more strict for all tasks. To run an evaluation using the default prior to version 3, use the flag "-e" (exclude documents with no annotations). As interpolation has been removed, there is no way to (re-)produce AUC iP/R results any more. If you require this functionality, use the last pre-3 release (version 2.3.1).

Tabulated output (option "-t") now prints a header row, and the standard deviations for each macro-averaged result have been added to the tabulated results. The short-hand flags for the evaluation types ("-a" for ACT, "-n" for INT, "-p" for IPT, "-m" for IMT) have been removed as they created more confusion than utility. Please use "--ACT", "--IPT" and "--IMT" now (INT is the default task and does not need to be indicated).

Updates for major version 2

The following changes have been applied to the initial library (2.0):

  • 2.0 to 2.1: adding a command line option for the method detection task (IMT, for BioCreative III) - although this is the same evaluation as the normalization task (INT).
  • 2.1 to 2.2: changes the strict column behavior: formerly, you were only allowed to supply the exact number of columns (document ID, annotation, rank, confidence) as used in BC II.5, while now the result file may contain any number of additional columns, e.g., the evidence string used in BioCreative III. To get the old (strict) behavior back, use the new command line option -s/--strict, in which case additional columns will be shown as error messages.
  • 2.2 to 2.3: bugfix in the plotting function (thanks to Sérgio Matos for reporting it!), more informative output when an input file is missing.
  • 2.3 to 2.3.1: [minor, not required updates only] local installation (user-only) now explained twice in the README as people were missing it; verbose (default) output for ACT evaluation now prints the filename, too (same as verbose IMT/INT/IPT output).


This library is used to produce results using the official BioCreative evaluation functions. The rank evaluation score is calculated from the AUC (area under curve) of the precision/recall (P/R) curve for article classification, and Average Precision (AP) for the other tasks. The classification evaluation score is calculated from Matthew's correlation coefficient (MCC) for ACT and F-measure for all other tasks. In addition, a new score, the FAP-score, is calculated as the harmonic mean between AP and F-score, to evaluate the overall performance with respect to both ranking and classification. The library provides various additional performance calculations (specificity, sensitivity, and accuracy for ACT, precision and recall, plus the std. devs. of the macro-averaged versions of each measure for all others). In addition, if you wish to use the library code (license: GPL, latest version), please consult the inline API documentation in the source code.


You will need to have a working version of Python 2.5 (or 2.6, 2.7) installed to use this package. It imports only on the standard libraries part of any Python base distribution as long as you do not want to use the plotting functionality. In this case, you need to install matplotlib, too.

To run the evaluation after installing the library (see the included README.txt file), you can call it from the command line:

bc-evaluate -h

The -h (or --help) flag will explain you the parameters and options; In-depth explanations can be found by using -d (or --documentation). The tool can evaluate the results for all tasks, ACT, INT, IMT, and IPT by using the corresponding option flag --ACT, --INT, --IMT, --IPT. The default is --INT and does not need to be specified if a protein/gene normalization evaluation should be made.


The tool allows you to explore your results in more detail than just the official evaluation function. The main arguments when using the library with the command line tool (bc-evaluate) are:

  1. one or more result files as tab-separated plain-text files (see the --documentation option of the tool itself for detailed explanations about the format of result files), and
  2. the corresponding task's gold standard annotations (ie., either the training or test set annotations from a BioCreative corpus).
You can download and install the Python source packages of this script for all operating systems. Please have a look at the README file for instructions on how to install this library if you never have used a Python package.

Homonym ortholog mapping

For the homonym ortholog mappging, please use the files provided in the gold standard, the BioCreative II.5 Elsevier corpus. The homonym ortholog mapping files for this corpus are found in the directories biocreative_II.5_elsevier_corpus/training_set/vocabulary and biocreative_II.5_elsevier_corpus/test_set/vocabulary, respectively, called "uniprot_15.0_homonym_orthologs.tsv" in both cases.

If you would like to use another gold standard with this library (which is no problem as long as you keep to the file formats) and want to do homonym ortholog mapping for that gold standard, you can download a cluster of homonym identity (50% protein sequence identity) clusters for UniProt r15.0 from the Bc II.5 Elsevier corpus page and use this script to extract mappings in the format the evaluation library uses for a list of source proteins (i.e., a list of UniProt accessions used in your gold standard).

Organism Filtering

The organism filtering data is in the download "UniProt 15.0 acc2tax mapping + hom. orth. clusters" on the same page, in the file accession2tax_id.tsv. Note that you do not need the file homonym_identity_clusters.tsv found there unless you want to build your own H.O. mapping files (analogous to the "uniprot_15.0_homonym_orthologs.tsv" files explained above) for another gold standard.

Example Filtering and Mapping Usage

For example, if you have your own protein normalization result file for a BC corpus that has the default format (with ranks and confidence scores) - let's say you called it "normalization_result.tsv" - you could evaluate it with the following command:

bc-evaluate --INT \
--ho test_set/vocabulary/uniprot_15.0_homonym_orthologs.tsv \
--of uniporot_15.0/accession2tax_id.tsv \
normalization_result.tsv test_set/annotations/normalizations.tsv
Or, in more compact notation (without the full file paths and names): "bc-evaluate -n --ho=ho_file --of=acc2tax_file result_file gs_file". In general, please consult the help (option --help) and documentation (option --documentation) provided by the bc-evaluate exectuable for all the possible options and details about the result file formats.