24.04.2014
Extracting trait information from digitized floras
Robert Hoehndorf (Aberystwyth University), Quentin Groom (Botanic Garden Meise), George Gosline (Royal Botanic Gardens Kew), Claus Weiland (Biodiversity and Climate Research Centre / Senckenberg), Thomas Hamann (Naturalis Biodiversity Center)
The aim of the Traits task group at the recent pro-iBiosphere Biodiversity Data Enrichment Hackathon was to extract plant trait data from digitized Floras (i.e. a book that describes the plant life occurring in a particular region or time). We wanted to demonstrate the feasibility of using an ontology-based approach for extracting and integrating trait information from digitized Floras, even when the Floras are available in different languages. To tackle our main aim, we addressed two main questions: (1) Can we automatically extract trait and phenotype information from Flora descriptions written in multiple languages (English and French)?, and (2) Can we represent and integrate the extracted trait and phenotype information semantically using an ontology-based approach?
Extracting structured information about traits and phenotypes from natural language descriptions is a common problem in mobilizing legacy biodiversity data. One tool that has been developed for this purpose is the CharaParser [1], which is applied in the Phenoscape project [2] and integrated in the Phenex tool [3]. As the flora descriptions in our use cases were written in both on English and French language, and CharaParser primarily supports English language descriptions, we have chosen not to use CharaParser during the Hackathon. Instead we followed a simple text matching approach applicable to multiple languages. In particular, we identified mentions of plant anatomical entities (taken from the Plant Ontology [4]) and mentions of trait or phenotype terms (from the PATO ontology [5]) in the Flora descriptions. We used a dictionary to translate French and English terms referring to plant anatomy or plant traits. In the future, we plan to use more complex approaches such as CharaParser to provide a more complete and accurate mark-up of anatomy and phenotype terms in Flora descriptions.
To semantically describe traits, we follow the Entity-Quality (EQ) approach [6] that has been widely applied to semantically characterize model organism [7] and disease phenotypes [8]. Using the EQ model, a trait is characterized by an entity (E) of which a trait is observed, and the quality (Q) that characterizes the trait. The characterize identity can be an anatomical entity (from the Plant Ontology), or a biological process or function (from the Gene Ontology). The Phenotypic Attribute and Trait Ontology (PATO) contains a rich classification of widely applicable traits. A phenotype is described in a similar way using the EQ pattern, but the quality has a specific value and is a subclass of the trait. For example, the trait "flower color" will be described using the entity "flower" (from Plant Ontology) and the trait "color" (from PATO). The phenotype "flower red" is described using the entity "flower" (from Plant Ontology) and the quality "red" (from PATO), where "red" is a subclass of "color" in PATO.
We then used a data-driven approach to build a flora phenotype ontology (FLOPO) from the EQ statements we identified in the Flora descriptions. FLOPO is an ontology of over 25,000 trait and phenotype terms, all of which have at least one taxon annotation in one of the Floras we processed. The draft of FLOPO is available in BioPortal (http://bioportal.bioontology.org/ontologies/FLOPO), and the source code we produced and the data we used is available from http://wiki.pro-ibiosphere.e/wiki/Traits_Task_Group.
We have also started to generate further resources that we plan to use in the future. In particular, we have started to add environmental terms to the Environment Ontology [9] that will allow us to extract parts of the environmental conditions in which taxa are found, we collected vocabulary and glossary terms that need to be incorporated into FLOPO. We have also experimented with using an RDF store that contains the FLOPO and its taxon annotations.
[1] Cui, H. (2012). CharaParser for fine-grained semantic annotation of organism morphological descriptions. Journal of American Society of Information Science and Technology. 63(4) DOI: 10.1002/asi.22618 http://onlinelibrary.wiley.com/doi/10.1002/asi.22618/pdf
[2] http://phenoscape.org/
[3] http://phenoscape.org/wiki/Phenex
[4] http://www.plantontology.org
[5] http://obofoundry.org/wiki/index.php/PATO:Main_Page
[6] Gkoutos, G. V., Green, E. C., Mallon, A.-M. M., Hancock, J. M., and Davidson, D. (2005) Using ontologies to describe mouse phenotypes. Genome biology, 6(1).
[7] Mungall, C., Gkoutos, G., Smith, C., Haendel, M., Lewis, S., and Ashburner, M. (2010) Integrating phenotype ontologies across multiple species. Genome Biology, 11(1), R2+.
[8] Robinson, P. N. et al. (2008) The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. American journal of human genetics, 83(5), 610–615.
[9] http://www.jbiomedsem.com/content/4/1/43
[2] http://phenoscape.org/
[3] http://phenoscape.org/wiki/Phenex
[4] http://www.plantontology.org
[5] http://obofoundry.org/wiki/index.php/PATO:Main_Page
[6] Gkoutos, G. V., Green, E. C., Mallon, A.-M. M., Hancock, J. M., and Davidson, D. (2005) Using ontologies to describe mouse phenotypes. Genome biology, 6(1).
[7] Mungall, C., Gkoutos, G., Smith, C., Haendel, M., Lewis, S., and Ashburner, M. (2010) Integrating phenotype ontologies across multiple species. Genome Biology, 11(1), R2+.
[8] Robinson, P. N. et al. (2008) The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. American journal of human genetics, 83(5), 610–615.
[9] http://www.jbiomedsem.com/content/4/1/43
For more information, please contact Robert Hoehndorf: [email protected]
pro-iBiosphere wiki platform