NEWSLETTER

Home » News

21.12.2012
Global Names Project
Iliyana Kuzmova

Wouldn't it be lovely, when writing a paper, or compiling data, for the computer to tell you if you have spelled the name correctly, or that that the name has been superseded because of some taxonomic activity. Wouldn't it be nice to read documents and databases, have older names replaced with current names of organisms brought up to date automatically.  This is the kind of functionality that is being pursued through the Global Names project.

The project capitalizes on the almost universal use of the Linnaean system of Latin binomials to annotate most of our meaningful observations of life made over the past 250 years.  Those names offer us a way of indexing and linking information about species; whether in the 500 million or so estimated pages of literature; the billions of specimens located in museums and herbaria or in the tens of thousands of web sites and databases.  Unfortunately, we don't have a unique name for each species because species may be split or moved from one genus to another. Nor is every name unique because the codes of nomenclature allow for the same name to be used for a plant and an animal. Because of this, we have to build an infrastructure that is taxonomically intelligent.

The potential of delivering names as part of the mechanism of integrating data distributed across the internet led GBIF and the Encyclopedia of Life to set up a series of Nomina workshops to conceive of what a future names-based cyberinfrastructure might be like.  Last year, the USA’s National Science Foundation funded a two year 'Global Names' project (globalnames.org). Our goal is to provide openly available infrastructural tools that will assist in management of information about biodiversity.  We have been building databases of names, developing various names discovery and names matching tools - such as the Chrome Names Spotter plug-in or the name discovery tool at http://gnrd.globalnames.org/). New names parsing algorithms help to convert name strings into canonical forms, or to offer the mechanisms for searches and browsing based on dates, authors, or genera.  Our code is openly available at GitHub. Numerically, the most significant challenge lies with incorrectly spelled names that create variant 'name-strings' that prevent information on the same species in different data bases do be joined together.  The solution is 'reconciliation', a process that maps all alternative names for the same species to each other so that a search initiated with one name can lead to an action that calls on all names. We hold over 22 million name strings, and there are many more to come, especially as older texts are digitized and OCR'd.  We have extended the Rees / Giddens fuzzy matching tools to map variant spellings against each other.

Working with the Biodiversity Heritage Library, we have built a new indexing tool that includes names recognition, names discovery, names parsing, and a validation service. In the future, as necessary internal databases get increasingly populated, we can offer more validation services that will, for example, help biologists who are compiling research data or are writing papers to ensure that their names are spelled correctly, that they have the correct authority information, or that the name is the most current one for that taxon. 

Our vision is of an open and very flexible cyberinfrastructure that all projects can contribute to and draw from so that we do not have to build multiple copies of databases. The result will be a more flexible and relevant suite of databases and services that will make it increasingly easier to discover and interconnect data.  There remain many challenges, such as the capture of 250 years worth of synonymy information, full integration of vernacular names, and integration of the 'surrogates' for names that are increasingly flooding out of environmental surveys that rely solely on molecular techniques.

Reading: Patterson, D. J., Cooper, J., Kirk, P. M., Pyle, R.L. and Remsen D. P. 2010. Names are key to the big new biology. TREE  25: 686-691, doi:10.1016/j.tree.2010.09.004

David Patterson
[email protected]


Login
flag big

This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 312848