Semantic enhancement of biodiversity literature: the Biodiversity Heritage Library contribution to pro-iBiosphere
Iliyana Kuzmova

In the field of biodiversity, the traditional workflow for producing core taxonomic information, such as Floras and Faunas, has not changed much over the years. The process is time-consuming process usually performed by individual specialists. Legacy (i.e. existing) literature is still the foundation and starting point for taxonomic studies. This literature contains all relevant data for a certain taxon, such as morphological characters, geographic distribution, taxonomic status, but also information that is needed to locate the physical specimens that were used to describe the taxon in the past. However, accessibility to legacy literature is uneven and a major drag on the pace of biodiversity research.

Since 2007, ten major biodiversity libraries have collaborated in digitising a large body of biodiversity literature (with a focus on English language literature) in an open access milieu via the Biodiversity Heritage Library (BHL) project. Since 2009, the eContentplus project Biodiversity Heritage Library for Europe (BHL-Europe) has achieved substantial progress in coordinating the digitisation of biodiversity literature in the EU. In the recent years, BHL has become a real global initiative: The Global Biodiversity Heritage Library (gBHL) is now a cooperative network of autonomous decentralised members operating programs and projects to make biodiversity literature more widely available. The current gBHL partner projects are: BHL, BHL-Europe, BHL-China, BHL-Australia, the BHL-SciELO Network (Brazil) and the BHL Arabic node organised by the Bibliotheca Alexandrina (Egypt). At the time of this writing (December 2012), more than 39 million pages from more than 109,000 books and journals are accessible online in an open access Creative Commons framework to a wide spectrum of end-users. This significantly facilitates the process of discovering and accessing literature that is relevant for taxonomic studies.

Currently, BHL content is available through various online portals (e.g.,, Currently, most literature relevant to taxonomic studies are still manually extracted from this corpus of digitised literature. Semantic enhancements of the digitized literature could make the information in the literature even more accessible to researchers as well as amenable to searches and sophisticated queries. Taxonomic literature is ideal for (semi-)automatic enhancements as the description of the world’s biodiversity is a highly standardized process with a distinct language and format. Taxonomic treatments, for example, are a key structural element in taxonomic publications, where the various taxa (families, genera, species, etc.) are described.

Binomial species names are another key element that is highly standardized in taxonomic literature. Improving the power of users to search electronically for species names that appear in the literature has been a major focus for data enhancement in the various gBHL projects in recent years. The application of a name finding algorithm based on the OCRed (OCR = Optical Character Recognition) page images of the digitised literature facilitate the search for binomial species names. A search term expansion for common names and synonyms of scientific names further facilitates the search for animals and plants described in the literature. However, these are just first steps. Large-scale data mining of taxonomic literature is still very difficult, but further improvements in the structure of digitised taxonomic literature to facilitate increasingly sophisticated searches and queries are on the horizon. The pro-iBiosphere project will help identify current gaps in the process and recommend priorities for further development of tools and services to optimize the semantic mark-up of taxonomic literature. The project also helps to identify the necessary steps to improve the integration of biodiversity literature into a biodiversity knowledge management system (i-Biosphere). Ultimately, more effective data mining from taxonomic literature will significantly enhance taxonomic research and streamline the discovery and description of new species.

This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 312848