BioCreative - Track 3: Genetic Phenotype Extraction and Normalization from Dysmorphology Physical Examination Entries (genetic conditions in pediatric patients)

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VIII

Track 3: Genetic Phenotype Extraction and Normalization from Dysmorphology Physical Examination Entries (genetic conditions in pediatric patients) [2023-01-22]

A. Task Motivation

The dysmorphology physical examination is a critical component of the diagnostic evaluation in clinical genetics. This process catalogues often minor morphological differences of the patient's facial structure or body, but it may also identify more general medical signs such as neurologic dysfunction. The findings enable correlation of the patient with known rare genetic diseases. They therefore directly influence clinical diagnosis, the selection of genetic testing, and the interpretation of results---particularly when testings reveals variants of uncertain clinical significance. Beyond the clinic, such information is also useful to researchers attempting to delineate undescribed genetic conditions or to further our understanding of existing ones.

Whereas the medical findings are key information, they are nearly always captured within the electronic health record (EHR) as unstructured free text, making it unavailable for downstream computational analysis. Advanced Natural Language Processing methods are therefore required to retrieve the information from the records.

B. Task Definition: Automatic extraction and normalization of genetic conditions in dysmorphology physical examination reports

For the BioCreative VIII shared task, we call for automated systems to extract and normalize the key findings in observations written during dysmorphology physical examinations.

Dysmorphology physical examinations are frequently documented in the EHR as a series of organ system observations. For example:

PHYSICAL EXAMINATION
FACE: slightly inverted triangular face shape
EYES: long palpebral fissures with slight downslant. Sparse lateral eyebrows.
EARS: Thin inferior helices, low-set
NOSE: Short, wide nasal bridge. Anteverted nares.
MOUTH: thin upper lip; palate intact
CHEST: supernumerary nipple inferior to left nipple
HANDS FEET: Long fingers, normal toes
NEUROLOGIC: Resting tremor. Wide-based, unsteady gate.

Similar to clinical workflows, we will standardize the description of dysmorphic findings using the Human Phenotype Ontology, an ontology specially designed for human genetics.

A successful system should extract the span of text referring to the key positive findings and normalize them to term IDs in the HPO ontology. The system should ignore the normal findings. For example, in the organ system observation: EYES: long palpebral fissures with slight downslant. Normal eyebrows. A system should extract the spans of the two key findings [long palpebral fissures] and [palpebral fissures with slight downslant], and normalize them to the terms IDs HP:0000637 and HP:0000494, respectively. The system should ignore the normal finding of [Normal eyebrows].

During the competition, participants will be able to perform one of the following subtasks:
- Subtask 3a: Given an observation, participants of this subtask will be required to submit the HPO term IDs of all key findings mentioned in the observation.
- Subtask 3b: Given an observation, participants will be required to submit the spans of the key findings and their corresponding HPO term IDs.

Training set: 1716 de-identified observations with key and normal findings manually annotated and normalized with their corresponding HPO terms
Validation/Development set: 454 de-identified observations with key and normal findings manually annotated and normalized with their corresponding HPO terms
Test set: 966 de-identified observations with key and normal findings manually annotated and normalized with their corresponding HPO terms + 2427 decoys observations
Evaluation metric:

Subtask 3a: Standard Precision, Recall and F1 score
Subtask 3b: Strict and Overlapping Precision, Recall and F1 score

Baseline systems: Multiple systems are available to the participants and can be adapted/extended to resolve our task, e.g. doc2HPO, NeuralCR, PhenoTagger, PhenoBERT, and txt2HPO
Codalab: link
Registration link (Note: after registration, we communicate and release the data through dedicated google groups, please, make sure to check your spam folders)
Participating teams are required to submit a paper describing the system(s) they ran on the test data. Sample description systems can be found in previous years proceedings, e.g. here. Participating teams are also required to review at least one system description paper from another participating team.
Contact information: Davy Weissenbacher (davy.weissenbacher@cshs.org)

Important dates: (tentative)

Training data available: April 11, 2023
Test data available: September 15, 2023
System predictions for test data due: September 18, 2023
Short technical systems description paper due: October 10, 2023
Paper acceptance notification: October 20, 2023
Camera ready: October 27, 2023

C. Challenges

Both steps, the extraction and the normalization, are particularly difficult on dysmorphology physical examinations given the current state-of-the-art of natural language processing.

Extraction. This step is challenging due to the descriptive style of the examinations and their polarity. The observations are short reports where, for conciseness, the span of a finding can be disjoint or overlapping with the span of another finding. The previous observation is an example of overlapping findings, with the span palpebral fissures contributing to both HP:0000637 and HP:0000494 terms. For disjoint findings, i.e. findings defined with non-consecutive segments of text, consider the term Short nasal bridge - HP:0003194 in the observation NOSE: Short, wide nasal bridge. Anteverted nares. Designed extractors should go beyond the standard sequence labeling approach which, designed to extract contiguous and mutually-exclusive named entities, fails to capture the disjoint and overlapping terms. As an additional challenge, the extractor should also resolve the polarity of the findings, that is, automatically detecting and ignoring normal findings, only returning the key positive findings.

Normalization. This step is also challenging, both due to the large scale of the HPO ontology and its incompleteness. Standard strategies for multi-label classification are designed to assign small sets of classes to input instances. However, to be successful in our task, a normalizer should adapt traditional strategies to assign one term from among the 17,000 terms in the HPO to each finding detected in an observation. This must frequently occur without supervision since our training set does not provide examples of use for all terms in the HPO. Furthermore, while specifically designed for human genetics, and constantly improving, the HPO does not have standardized levels of term detail. As a consequence, a key finding may need to be matched with a close ancestor in the hierarchy of the ontology, making the strict matching strategy inefficient since the string of the ancestor in the HPO will be different from the string of the key finding in the observation. For example, there exists both Naevus flammeus of the eyelid - HP:0010733 and Nevus flammeus of the forehead - HP:0007413, but no term for the nose, leaving only generic Nevus flammeus - HP:0001052 to normalize this abnormality of the nose when it is mentioned in an observation.

D. Data description

Our dataset consists of 3,136 organ system observations extracted from dysmorphology physical examinations of 1,652 pediatric patients evaluated at the Children's Hospital of Philadelphia. Four physicians and one medical student annotated all mentions of key positive findings as well as normal findings in the observations. They assigned each finding to its most detailed and unambiguous term in the HPO ontology. To preserve patient privacy, we automatically de-identified the text using NLM Scrubber and manually reviewed its outputs during our annotation process. We double-annotated a subset of 890 observations in our corpus and found 76% of complete agreement between the annotators.

For each observation, the publicly available dataset contains: i. the observation ID identifying uniquely the observation in our corpus; ii. the text of the observation; it always starts with an organ system followed by the description of the findings, in free text; iii. a term ID from the HPO ontology associated to a finding mentioned in the observation; iv. the starts and ends of the spans of text denoting the finding; v. polarity of the finding, if the cell is empty, the finding is a key finding, if the cell contains an X, the finding is normal.

Observation ID	Text	HPO Term	Spans	Polarity
D433F04E6AD5E56	EYES: partial synophrys, long lashes, horizontal slant	HP:0000664	14-23
D433F04E6AD5E56	EYES: partial synophrys, long lashes, horizontal slant	HP:0000527	25-36
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000218	28-39
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000159	7-18	X
8A1EEF66A345576	MOUTH: normal lips, tongue, high palate	HP:0000157	7-13, 20-26	X
879246677902DE5	NEUROLOGIC: very active	NA	NA	NA

Note 1: if an observation mentions 2 or more findings, the observation is repeated 2 or more times, one finding per repetition as shown in the table above (for ex. observation 8A1EEF66A345576). The test set will just contain the observation ID and the text of the observation.
Note 2: Mentions of findings can span multiple and discontinuous segments of text. When they do, we report the start and end positions of each segment in order of occurrence in the text, all segments separated by commas.
Note 3: The HPO ontology does not have terms to denote normal findings. As an alternative to this current limitation, when this was possible, we normalized normal findings with the most generic terms of their corresponding key findings in the ontology and negated the findings; when it was not possible, i.e. no generic terms were available to normalize the findings, we just normalized the finding with Not Available (NA).
Note 4: Not all terms in the HPO are observable during a dysmorphology physical examination and can be ignored by a normalizer. The list of irrelevant terms for the task will be shared with the registered participants.

E. Task organizers

Graciela Gonzalez-Hernandez, Cedars-Sinai Medical Center, USA
Ian M. Campbell, Children's Hospital of Philadelphia, USA
Davy Weissenbacher, Cedars-Sinai Medical Center, USA
Xinwei Zhao, Children's Hospital of Philadelphia, USA