Merging, extracting, and annotating biodiversity data with rich semantics using NeXML
Bachir Balech (Institute of Biomembranes and Bioenergetics -
Italian National Research Center), Christian Brenninkmeijer (University of Manchester), Hannes Hettling (Naturalis Biodiversity Center), Rutger Vos (Naturalis Biodiversity Center)
Biodiversity phylogenetics' analysis workflows usually involve various software tools connected in series and depend on different sources and types of data. The proliferation of different, mutually incompatible and poorly defined data syntax standards poses significant challenges both for software developers and end users of such workflows. Recent years have seen the development and adoption of a new, expressive, and easy to process data standard that intends to remedy this issue: NeXML.
NeXML is an XML standard that supports the representation of (among others) taxa, character-state matrices and phylogenetic trees as well as semantic annotations (using RDFa) within one single document and is therefore specifically tailored to ease the interplay of different tools in evolutionary comparative and biodiversity analysis.
Since XML documents are generally intended to be handled by software rather than by users directly, a software tool to easily manipulate NeXML files appears desirable. To this end, participants of the biodiversity data enrichment hackathon (Leiden, the Netherlands, 17 – 21 March 2014) developed web services that can (i) construct NeXML documents from data encoded in commonly-used phylogenetic file formats or add metadata to an existing NeXML document, and (ii) extract information identified by the user from a given NeXML file and represent it in a variety of output formats.
To make the NeXML merger- and extractor tools easily accessible for the biodiversity research community and to enable their integration into existing workflows, they are implemented as RESTful web services, to be hosted by Naturalis Biodiversity Center and made available in the BiodiversityCatalogue. Clients that use these services can be implemented in a variety of ways; proofs-of-concept demonstrate that this is trivially done using the popular workflow management tool Taverna, such that these data merger and extractor facilities are available to the users of, inter alia, BioVeL workflows. Preliminary tests of NeXML merger and extractor have been conducted using data inputs and outputs used by the phylogenetic service set of BioVeL (https://www.biodiversitycatalogue.org/services/31/service_endpoint
); while, NeXML extractor output has been tested, visualizing a phylogenetic tree with its taxa associated metadata, by implementing ITOL (http://itol.embl.de/
) tool wraper within a taverna workflow.