BioCreative - FAQ and guidelines

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative II

FAQ and guidelines [2006-03-31]

1 How are the articles chosen by the interaction databases?: In general there are two main types of article selection strategies used in the case of MINT and IntAct. One is based on the exhaustive full curation of all the articles of a given collection of predefined peer-reviewed journals. The other is topic-based, for example according to pathways, protein types, or species. For this competition we consider the first type of article selection.
2 How many protein mentions of interacting proteins can not be mapped from the articles to a protein identifier by the database curators?: Practically all the proteins can be mapped to the database identifiers, although the difficulty or time required in the manual mapping might vary a lot. Only in less than 5 percent of the cases this is not possible. Those cases are not entered in the database. In some cases if the UniProt ID is not available in the given organism we infer the identifier from an other organism. There will be a comment reporting the "abuse" .
3 How do curators deal with organism source ambiguity of a given protein mention?: They use all kind of information provided in the article to unambiguously identify the organism source of the proteins. The curators sometimes had to use the cell lines described in the article to obtain a clue for the organism source disambiguation (e.g. through the CABRI database).
4 Are figures considered by the database curators to derive their annotations in the case of MINT and IntAct?: Yes. They are used, as they often provide experimental evidence information. Both figures and figure legends might be used for annotation purposes. In case of the BioCreative contest test set interaction which were only apparent from a table or figure were not used.
5 Are tables considered by the database curators to derive their annotations in the case of MINT and IntAct?: Yes. Actually for some large scale interaction experiments (and depending on the interaction detection method) often tables are used to extract annotations.
6 What kind of article document is used by the curators to read and detect the interaction annotations?: In case of regular database annotation, mainly html and pdf files, both in electronic and printed forms are used.
7 How are the protein interaction evidence sentences extracted?: After reading carefully all the article, including legends and additional materials, the curators mainly cut and paste the best evidence sentence for a given protein interaction.
8 Is the extracted evidence sentence for a given protein interaction pair the overall best?: This depends of course on the curator interpretation, and there might be cases where several sentences are equally good evidence passages. For some interaction pairs several sentences expressing the protein interaction have been extracted.
9 Are there cases where in a given phrase or sentence, evidence is provided for more then one protein interaction pair?: Yes, there are cases where a given text passage contains interaction evidence for several protein interaction pairs.
10 Is the additional material section considered for regular annotation?: Yes, the curators use everything provided for a given publication to extract confidently their annotations. They curators sometimes take into consideration the additional material section. In these cases this is flagged.
11 Is it possible that in a given article multiple methods for detecting protein interaction are used?: Yes, this can certainly happen. Note that not all the proteins in a given article might be studied with all the mentioned protein interaction detection methods. For instance proteins A, B, C and D could be studied with protein interaction detection method X, but only A and B are subsequently studied with protein interaction detection method Y.
12 Is the annotation of protein interactions in case of the these two databases organism dependent?: In principle not. These databases curate interactions for any organism and are not restricted to a single model organism or human proteins.
13 Are there cases where the protein interaction is between two proteins from different organisms (e.g. protein A from mouse and protein B from human) ?: Yes, although this is not very common, there are cases.
14 Is there a size limit of the evidence sentence for protein interactions?: Most of the evidence sentences extracted by the annotators have less than 250 characters.
15 Which character encoding will be used for mapping the predicted evidence sentences to the curated evidence sentences?: In principle we are expecting to use UNICODE character encoding.
16 Are there cases of large scale protein interaction experiments on the test set articles?: No, most of the interactions in the test set article have less then 30 interactions.
17 Should I consider the very large scale experiment articles in the training set?: We recommend NOT to use them, as in case of the test set there are no large scale experiment papers, and you could get a bias in case of using them. As a cut-off for the total number of interactions per article (for the training set) we recommend of using those which have less then 21 interactions per article.
18 How should I deal with the mapping between splice variants and the master entry of UniProt (normalization step) ?: You should not worry about the splice variant case and the mapping to UniProt master entries. This is a not very common problem (less then approximately 5 percent of the cases) and in terms of the test set will be handled by the evaluation group.
19 Did the two interaction databases, MINT and Intact perform an agreement of curators study?: Yes, they performed a comparative annotation study to assure that both databases where following the same curation standards and data model. This study was done on 5 full text articles related to yeast proteins.
20 Are there cases where the article authors actually use wrongly the terms (incorrect terminology usage)?: Yes, but only in few cases. We call these wrong (confused) term usage 'jargon term usage by authors'. We estimate that there are less than 2 percent of such cases. An example would be the use of 'pull down' instead of 'co-immunoprecipitation' for referring to an experiment. This happens sometimes due to wrong terminology usages encountered in sub-domains like virology. These experiments are mapped by the curators to the correct controlled vocabulary term based on the experiment description in the article and the citation reference of the method used in the articles. In the test set there are no such cases.
21 Could there be a term overlap (a same term used for different concepts within the controlled vocabulary hierarchy)?: There can be an overlap between the synonyms of some concepts of the controlled vocabulary (but this is very rare).
22 Which is the used spelling of the controlled vocabulary terms (e.g. US spelling of UK spelling) ?: In case of Gene Ontology the US spelling is used. In MI we are not completely sure about this.
23 Do the curators sometimes take into account the references provided in an article for the interaction detection experiment?: Yes, there are cases where the reference of the experimental method used to detect the protein interaction is taken into account (back reference). Note that for concepts in MI an external reference (PMID) is provided, corresponding to the article describing this method.
24 Can I use also additional resources despite the provided training data?: Yes, sure. You can use any additional data resource available. You could nevertheless specify them in the system description paper of the evaluation workshop.
25 What is the level of expertise of the database curators of MINT and IntAct?: They have a P.h.D. or at least a Master degree in Molecular Biology or related disciplines and are highly trained and experienced curators.
26 How long does it take for a curator to annotate an article?: This varies a lot depending on the database, the journals and the articles used. On average it takes between 1 paper/curator/day to 4 papers/curator/day.
27 Which is the format used by MINT and IntAct for their annotation entries?: They use a standard called PSI-MI format. You should revise this standard format for protein interaction annotation. Refer to Hermjakob et al. (2006), PMID:14755292 and the latest version of the standard, described at: http://psidev.sourceforge.net/mi/rel25
28 Do the curators extract interactions between a protein and a protein family?: No, the extracted interactions are based on individual proteins which can be mapped to database entries.
29 What are common naming ambiguities/difficulties encountered for the interaction partner proteins?: In addition to the difficulty in linking a protein name to the corresponding organism source other aspects which difficult the linking process are: the protein name and protein family name ambiguity, and that authors often refer to nucleic acid regions using the same name as for proteins.
30 What is the frequency of update of the data contained in the interaction databases?: The IntAct database is weekly updated. However, each entry is probably only updated twice per year, normally maintenance updates of the syntax rather than the content of the entry.
31 How do the curators deal with cases where the authors call the protein using homologous protein naming?: There are some cases, where the authors do not use the official or common name of a given protein, or the corresponding database entry is not complete enough and does not cover the protein name mentioned by the author. In these cases the curators sometime use the bioinformatics approach, based on protein sequence similarity searches to the homologue protein which does have the name the author uses in the article. Example: the author mention 'murine protein ZZZ' but no protein ZZZ is encountered for mouse in the protein database. Instead a human protein ZZZ is existing. Then using sequence similarity searches the curators retrieve a protein in mouse which shares significant similarity to the human ZZZ protein. Based on the sequence similarity as well as looking at the database record of this protein and the description of this protein in the article the expert curator is able to know if they are the same protein.
32 What kind of protein-protein interactions are curated in MINT and IntAct?: The interaction type is given as an attribute of the interaction. According the PSI-MI 2.5 MINT and Intact curate colocalisations and Physical interactions (and all their children). Generally, physical interactions with experimental evidence shown in the paper are curated.
33 Are symmetric or asymmetric relation considered in case of the protein interactions?: Both are considered, the experimental role of the proteins can be asymmetric.
34 Are all the interaction types annotated?: The interaction type is given as an attribute of the interaction. Generally, physical interactions with experimental evidence shown in the paper are curated. You should be careful with genetic interactions. In some cases the genetic interactions mentioned in articles are not curated because they are not trustworthy and the interaction is not direct (e.g. one protein actives another protein but through a signaling cascades with intermediate proteins in between). On a regular basis genetic interactions aren't curated.
35 Will the test set collection follow the annotation standards used by IntAct /MINT databases?: Yes, they will follow their annotation standards.
36 Can I use also additional resources other then provided by the BioCreative organizers to develop/construct my system?: Participating teams are not restricted to use only the provided training sets for developing their systems in case of the Protein-Protein Interaction (PPI) task. So this is not a 'closed' task which is restricted to a particular training collection. Nevertheless we will ask participants which will submit results for the test set predictions to provide a short system description including the mention of additional resources they used in order to allow comparative evaluation and to see which approaches are successful.