1st Workshop: Paris 29 June 06

Notes of Workpackage 6 EDIT meeting

EDIT kick-off Thursday 29 June 2006

Muséum National d'Histoire Naturelle, Paris, France

Attendance1:

Name Affiliation E-mail address
Bill Baker RBG Kew w.baker_AT_kew.org.uk
Marc Brugman ETI BioInformatics, Amsterdam marc_AT_eti.uva.nl
Isabel Calabuig SNM, Copenhagen icalabuig_AT_snm.ku.dk
Ben Clark Imperial College, London benjamin.clark_AT_ic.ac.uk
Gerhard Falkner   falkner_AT_malaco.de
Andvás Gubànyi NHM, Budapest gubanyi_AT_nhmus.hu
Thomas Haevermans MNHN, Paris haever_AT_mnhn.fr
Anna Haigh RBG Kew a.haigh_AT_kew.org.uk
Pamela Harling NHM, London p.harling_AT_nhm.ac.uk
Norbert Kilian Berlin Botanic Garden n.kilian_AT_bgbm.org
Niels P Kristensen SNM, Copenhagen npkristensen_AT_snm.ku.dk
Judita Lihová Slovak Academy of Sciences, Bratislava judita.lihova_AT_savba.sk
Simon Mayo RBG Kew s.mayo_AT_kew.org.uk
Thomas Pape SNM, Copenhagen tpape_AT_snm.ku.dk
László Peregovits NHM, Budapest perego_AT_zoo.zoo.nhmus.hu
Peter Phillipson MNHN, Paris pbp_AT_mnhn.fr
Dave Roberts NHM, London d.roberts_AT_nhm.ac.uk
Yuri Roskov Species 2000, Univ of Reading y.roskov_AT_reading.ac.uk
Malcolm Scoble NHM, London m.scoble_AT_nhm.ac.uk
Jan van Tol Naturalis, Leiden tol_AT_naturalis.nl
Richard White Species 2000, Cardiff University r.j.white_AT_cs.cardiff.ac.uk

1. The scope of EDIT WP6: objectives and development

Dave Roberts chaired the meeting and welcomed those present. He expressed his concern that not everyone had heard about the meeting who might have wished, or had been allocated funds, to attend. It appears that not all team leaders are passing on e-mails. Good communication is vital to the success of the WP.

Objectives of the workpackage are:

  • to define and develop the means to provide access to the currently highly fragmented taxonomic information source;

  • to formalise distributed, taxon-specific committees to become expert networks, especially for the demonstrator taxa;

  • to define what is needed from the cyber-environment for taxonomic Web content to be delivered;

and

  • to assess barriers to implementation.

WP6 is to provide output material and will be working closely with WP5 that will be building the tools to make this material available. One of the most difficult tasks will be to resolve issues around IPR, copyright and multiple authors.

Peter Phillpson asked about money for basic research. During the contract negotiations money was earmarked for three taxonomists will be employed by the workpackage to prepare data for web revisions, e.g. the NHM London has employed a Dipterist who will act as a facilitator, will be migrating current taxonomy to the Web and will work to establish interface standards with other reliable sites such as the Sphingid revision being prepared under CATE (see below). Work will be balanced between taxonomy and IT. The aim of the WP is reduce fragmentation in taxonomic resources rather than to generate new work. Dave Roberts reminded everyone that EC Networks of Excellence are not about carrying out original research (although that may be done) but about forging durable links in the European research community.

2. The user community

The initial aim of EDIT is to provide resources for taxonomists, although EDIT outputs have potential to be used by the environmental/conservation community. Different users have different requirements: for instance, common names are often important to policy makers. Identifying user requirements is a task under WP4, but the non-taxonomists form what is known as a sparse user network, which presents considerable difficulties recognised within the IT community.

The WP will be working closely with GBIF, especially on specimen access and digitisation. Dave Roberts will be meeting Jim Edwards, the Director of GBIF, shortly and Dave Remsen (GBIF ECAT) when he is in post. Links to distributed species information pages are being handled by Donald Hobern (GBIF).

Peter Phillipson asked how users can distinguish between reviewed and non-reviewed data? Reliable data are important for taxonomy. Isabel Calabuig said that GBIF too are working on detailed labelling of specimen data to filter out non-reviewed records. The meeting agreed that a method of assigning quality had to be developed, although it was noted that peer-review was not always an appropriate benchmark because few books and monographs are subject to such review.

It is clear that with the time and budget available we cannot deliver results across all fields and it is essential that we deliver to a high standard in the core area of taxonomist support. As the project progresses, we will monitor what can be delivered to other user groups. Dave Roberts suggested that the wider community has a low priority at this stage in dictating needs and Jan van Tol cautioned against focussing too early on user needs.

3. The publication platform

Malcolm Scoble outlined the "Creating a taxonomic e-science" (CATE) project currently being funded by the UK Natural Environment Research Council and involving RBG Kew (working with Missouri Botanic Garden), NHM London and Imperial College London. The goal of CATE is to test the feasibility of creating a web-based, consensus taxonomy using two model groups, one from the plant and the other from the animal kingdom. The wider aim is to explore practically the idea of "unitary" taxonomy and promote web-based revisions as a source of authoritative information about groups of organisms for specialist and non-specialist users. The project web site allows online revision with referees and an Editorial Board (www.cate-project.org). The experience from the project will feed into this WP.

Dave Roberts said that EDIT had a broader scope than CATE: its immediate goal is to provide content without consensus, i.e. to allow conflict within the knowledge base. Further, contributions were not required to meet any targets of being taxonomically or geographically comprehensive. Initially we plan to take publications and enter these into a taxonomic knowledge base, then make this available for data mining. Documents need to be tagged but the content left unchanged. It is easiest to do this from manuscripts with copyright, i.e. at the pre-print stage. Authors and partner institutions are requested to retain copyright on their published output, giving to the publishers in stead an exclusive licence to publish. This approach has not caused any publisher to refuse to take a manuscript in our experience, but it is very important to understand that the marked-up copy will not be downloadable and cannot be re-constructed through multiple queries to the knowledge base, so is not being re-published in any sense. Copyright permission can be granted to reproduce illustrations, for instance. Published work is preferred because it allows standard bibliographic citation.

Bill Baker suggested a portal to existing web-based sites with some quality control. For palms there already exists a roughly agreed checklist: now there is a need the community to buy into this and expand the information.

The need for machine readable text was stressed. New work from exemplar groups will be digitised from the start. WP5 (Internet platform for cybertaxonomy) will hopefully develop tools for handling text and extracting the data needed for revision work. The TDWG literature standards group (Chris Lyal at the NHM London is a convenor) is working on this too and collaboration will be sought.

After assisting with mark up for data mining, documents can be returned to author institutions for placing on secure servers. A portal could then be set up to access these servers. Richard White suggested a central location for documents and Bill Baker asked about what happens at the end of the project. The NHM London does have some commitment to maintaining a server for a time after the end of the project, but not for ever. Ideally, in the future the system will be self sustaining.

Large nomenclators perhaps have to reside in larger institutions that can provide the necessary infrastructure. Do they also need a dedicated person to oversee them? Is this part of the remit of the larger taxonomic institutions? The danger is that they may be seen as "taking over" the process. Ways round this include offering mirror sites or setting up structures (a charity like ILDIS or a limited company like Species 2000) to maintain databases.

Local data providers can maintain control of their own data in a distributed system (eg GBIF). The danger here is of web sites disappearing. Richard White gave a useful link to www.archive.org (run by W3C) that automatically archives a copy of all web sites from time to time. Yuri Roskov stressed the importance of version control on the Web. Dave Roberts suggested that using pdfs could assist here because they are relatively difficult to change and it is easier to produce an updated version, thus creating a version library.

4. Integration with WP5

One rôle of WP6 to provide WP5 with specifications for tools to assist the taxonomic revision process. For example: Simon Mayo described how difficult tracing sources of information was during the initial phase of taxonomic work. WP4 (co-ordinating research) will also be able to highlight bottlenecks in the taxonomic process and suggest online resources that might overcome or minimise these. Jan van Tol asked what should the system deliver? To a certain extent WP6 has to anticipate the tools that WP5 might come up with. WP4 will also feed to WP5 what users need, for how long and how often. Dave Roberts will be meeting on a regular basis with Walter Berendsohn the leader of WP5.

One tool that zoologists would find valuable, for example, is a means for a community to create a shared standard bibliography, much in the same way that the Flickr (http://flickr.com/) community database works for images. The botanical community are, again, already ahead in this field and have the established TL2 (http://tl2.idcpublishers.info/) which is a standardised bibliography. The NHML is currently constructing a funding bid for this project and the Board of Directors are considering whether this should be adopted as an EDIT initiative.

5. The role of the exemplar groups

Dave Roberts asked "what do we expect a revision to be"? What is the mechanism for doing this? The three exemplar groups were selected because of the limited resources for answering these questions, because they were likely to be successful within the available resources and because we expect them to demonstrate best practice in how to carry out e-taxonomy. The chosen taxa are Monocots (palms: RBG Kew), the Compositae (Lactuceae: BGBM) and the insects (Lepidoptera, Nepticulidae and Diptera, Syrphidae: NHML).

There was some concern that this process had not been transparent or inclusive enough. New groups were invited to come forward, see item 6.

6. Building new groups

The WP is keen to recruit new groups: there is no intention to be exclusive and closed.

Gerhard Falkner offered to nucleate an expert network for mollusca.

Expert groups generally already know their peer group. Expert groups themselves need to decide how to move forward in their area of expertise. Can WP6 identify taxonomic groups where a network could come into being? Malcolm Scoble stressed the need to work closely with WP2 (Integrating and reshaping the expert and expertise basis). In some areas though Norbert Kilian reminded everyone that not only is there no expert community, but there is also no motivation to do e-taxonomy. Species 2000 has found that network building can be moved on by working with database custodians, who can get a higher profile for their work and funding on the back of collaboration. Species 2000 is careful about copyright, attribution and credit and signs an access licence with all data providers.

Malcolm Scoble suggested the Planetary Biodiversity Inventories as a mechanism for progress (see eg http://www.actionbioscience.org/biodiversity/page.html).

Simon Mayo asked how people could interact with WP6 if they were not yet ready to give material? The objective would be to form a network with a work plan and contact the WP leader with requests for support in terms of tools or taxonomic resources (but not money directly).

Peter Phillipson suggested that thought needed to be given to accommodating highly polished information systems through to pulling together scattered data. In the Madagascar project he is working on much taxonomy is insufficient so that presenting evaluation of material is important, as is trying to accelerate presentation of data.

Dave Roberts stressed that WP6 should work through interactive co-operation. Once the IT person is in place in London (Nov/Dec) they will set up a wiki for ideas and documents. A mailing list can operate through the EDIT web site. There is a need to develop using the Web for verified data, building a sustainable network and building consensus the European Community level (reporting is important here). There is still room for new exemplar groups.

Five workshops are scheduled for this WP, in which anyone can participate. Indeed the wider community will be welcome if appropriate people can be targeted and those with experience from outside the exemplar groups:

  • Milestone 6.2 Workshop for demonstrator taxa to explore commonalities in content structure (month 8 October 2006) London

  • Milestone 6.3 Workshops for taxonomic expert networks to elect committees and task forces (months 10 and 12 October 2006 and February 2007) RBG Kew

  • Milestone 6.4 Workshop to discuss content structure and Web-interface (month 9 November 2006) Berlin

  • Milestone 6.6 Workshop on web publication issues (month 12 February 2007)

7. Development of data standards

What do taxonomists need? Suggested as a bare minimum: names, authorities, descriptions, images, source (reference citation), type, distribution, bibliography. It was stressed that for any individual data source for the knowledge base could omit any of these fields. Yuri Roskov added that references are also needed, authority citation and status references.

WP6 must aim to collect the best models and best practice considering the distinct needs of the Botanical and Zoological codes of nomenclature. What do we really need? What would be nice? How can ancillary information be captured and tagged? Perhaps at this stage only nomenclators can be compiled, reflecting the state of knowledge of each group. The merging of effort is more important than being too prescriptive early on. WP6 will be building on the taxonomic concept schema plus spatial data standards plus a bit more.

Using the Cryptomonads as an exemplar group might be useful, as it is small and covered by both the Zoological and Botanical Codes and the data are already mostly available in digital form.

The need to concentrate on valid names and strict interpretation of the nomenclatural codes was stressed by Gerhard Falkner. Dave Roberts stressed that this was an output standard appropriate in some circumstances, but not in others such as a taxonomist working on a revision who needed access to all available information.

Norbert Kilian stressed the importance of identification keys, but needs are different for different groups. Keys are time consuming to produce and cannot easily be generated automatically unless a character/state matrix is available. Gerhard Falkner disagreed that keys were important and felt that they could even hamper new taxonomic work.

The importance of authority was discussed several times. All levels of data can be taken into the system, but user needs to be clear what level they are seeing at any one time. This mechanism needs to be built into the system from the start.

Revision is about bringing scattered knowledge into a coherent whole. WP6 is providing data for revisions, data for other scientists to use and data for other user groups, but must deliver to taxonomists in order to reach the wider community.

8. Mechanisms to encourage the formation of networks

Yuri Roskov put forward the experience of the Legume database ILDIS, which for 25 years has used a world wide network of experts at a regional level to produce and revise the database. Thomas Pape talked about the Dipterist community and explained how the Diptera checklist at some point became THE recognised resource for Diptera taxonomists, so now they buy into it and help maintain it. In both cases one person had acted as the focal point to move co-operation forward.

The wish to co-operate is there, but taxonomists need to be confident that their work will be recognised and credited. Building community participation is the key to success. A low maintenance data model assists this process. Receiving credit for work appearing on the web is beginning to become the norm. Dave Roberts explained how in the UK research success is still judged on publications in key journals. How could a project like EDIT assist with changing this metric? In some scientific communities (eg high energy physics) this has already happened.

9. Selection and adoption of a data storage model or the development of data interchange mechanisms

What resources are required to support the taxonomic process? Possible resources are a database (structured), published materials (unstructured but tagged) or web resources. Reviewed publishable material for data mining are an obvious target, but equally non-reviewed material (books and monographs) are important. It is not necessary for there to be a single, agreed taxonomic framework.

Phase one of the work needs to provide data standards for interoperability between data sets. We need to accommodate alternative hierarchies so we need to know rank and parent of each species. It was suggested that the WP leader explores possibilities at this stage.

Yuri Roskov suggested that wider examples from difficult groups are needed and asked if the Berlin model is robust enough? Jan van Tol proposed using additional groups (to the exemplars) that will help define problems and features. At the moment the exemplar groups will enable the WP to get content on to the Web and this can then be extended outwards.

Reminder - Milestones and Deliverables for first 18 months

Deliverables

D6.1 Content structure provided for each demonstrator taxa and commonalities assessed (Month12 : Feb 2007).

D6.2 Preliminary websites set up for demonstrators and guidelines for data creators to interface (Month 18 : Aug 2007)).

D6.3. Exemplar networks and committees created and their roles defined (Month 18 : Aug 2007).

D6.4 Set agenda for report on Web-publication and accreditation and IPR issues (Month 18 : Aug 2007).

Milestones and expected result

M6.1 Identify leaders and core taxonomists for demonstrator taxa (Month 6 : Aug 2006).

M6.2 Workshop for demonstrator taxa to explore commonalities in content structure (Month 8 : Oct 2006).

M6.3 Workshops for taxonomic expert networks to elect committees and task forces (Month 8 : Oct 2006 & 10 : Dec 2006).

M6.4 Workshop to discuss content structure and Web-interface (Month 9 : Nov 2006).

M6.5 Interim statements from networks for agreed demonstrator taxa (Month 12 : Feb 2007).

M6.6 Workshop on web publication issues and IPR (Month 12 : Feb 2007).

1 Also to go on distribution list: Ole Seberg, SNM, Denmark <oles_AT_snm.ku.dk>; Gitte Petersen, SNM, Denmark <gittep_AT_bi.ku.dk>

EDIT kick-off WP6 workshop 29 June 2006 PJH 10/7/06 3

Scratchpads developed and conceived by: Vince Smith, Simon Rycroft, Dave Roberts, Ben Scott...