Contact: Julius Welby (julius.welby_AT_gmail_DOT_com)
This report is the outcome of a 12 week contract (ending 8 June 2007) at the Natural History Museum to evaluate the GoldenGATE software package (GG), looking at its potential use as a means of translating existing taxonomic literature into one or more XML formats. The goal is to assess GG's suitability for converting documents into a marked-up format for addition to the XML repository which is part of the planned infrastructure of Work Package 6 of the EDIT project.
A number of people of people have provided assistance and input for this report, particularly Dave Roberts, Vince Smith, Simon Rycroft and Chris Lyal at the NHM, Guido Sautter and Terry Catapano of the GoldenGATE/TaxonX team, and Ben Clark of CATE. Many thanks to them and the other contributors.
Version evaluated: GoldenGATE 2007.04.03.14.45.
Please note that the development of GoldenGATE is ongoing and new versions are released regularly. Technical issues identified in this report will no doubt be fixed or changed significantly in later versions.
- GoldenGATE is a promising application under active development.
- GoldenGATE is a good general XML mark-up tool, allowing rapid manual and semi-automatic annotation of text. It also offers a number of functions specific to the handling of taxonomic documents.
- The application is of modular design and can support a number of import and output formats.
- User defined functions are easily set up and can be set up to run sequentially as a pipeline (macro), making GoldenGATE highly customisable.
- The main technical limitations of the version evaluated are:
- Text files larger than approx 200kB refresh very slowly in the editor. This appears to be caused by the poor performance of the Java Swing component used for the editor window. The current workaround is to split the input file into smaller pieces prior to marking-up.
- Characters outside of the Western European character set are not handled properly. Support for a set of Unicode formats is in the process of being added and should appear in the next version.
- A number of areas of direct interest to WP6, such the encoding of taxonomic relationships and bibliographic information, are not directly supported in the evaluated version of TaxonX. Nor are the relevant tools available within GG. The addition of new external schemas is currently being implemented to address the mark-up related aspects many of these issues. In parallel, the future development of GG will presumably provide a user interface to add this mark-up
- Some complex functions, such as the automatic recognition and annotation of taxon names, shift the user input away from the manual addition of annotations and instead require the user to check and, where necessary, correct the automatically added mark-up. The UI supporting this checking and correcting phase could be improved, and this would increase the usefulness of these functions.
- GoldenGATE supports the automatic retrieval of taxon IDs from the Hymenoptera Name Server. If this feature would be more generally useful if extended to cover a broader range of taxa.
- The documentation in the form of a user manual and help information are of good quality, but the rapid pace of development inevitably means that there are some areas in which the documentation lags behind the state of the tool.
- Documentation of the workflow suitable for the export of TaxonX to level 3 would be required if the output is to meet the requirements of the WP6 data warehouse. (At the time of this evaluation, only the level 1 workflow was available).
- In addition, a set of specifications of the intermediate formats created prior to export as TaxonX for specific levels of mark-up would be required for WP6 use.
- Like GoldenGATE, TaxonX is evolving rapidly.
- The schema takes a relatively minimalist approach when compared with more atomised formats such as taXMLit.
The combination of a schema and a customised XML editing tool is powerful, and has great potential. While GoldenGATE and taxonX are in rapid development, I would suggest that EDIT staff continue to liaise with the developers to remain aware of changes and provide feedback where appropriate.
TaxonX schema documentation
Metadata Object Description Schema (MODS)
Biobiographic/reference data extraction - ParaCite.
I will be concentrating on the technical evaluation, but comments on usability will also be incorporated.
- The technical evaluation will concern itself with the following:
- GG's ability to handle input files, formats and content of the various relevant types
- GG's proficiency at automatically adding mark-up of suitable granularity and accuracy to give a useful and satisfactory output. This includes:
- The options available for users to view and amend marked-up content as necessary
- Potential suggestions for pre or post processing, or any helpful GG plug-ins which might be required or desired
- The target output format is currently assumed to be TaxonX. I understand that the choice of schema is still open, and will be decided during the period of the evaluation, but it is sensible to have a candidate output schema to work with, even if this changes during the evaluation period. (TaxonX itself is an evolving schema).
- Processing tools for the identification and annotation of certain types of text (e.g. character and state data in descriptions) are being developed as part of WP5. For this reason I will assume that the target level of TaxonX output from GG will be Level 3, with certain other tags also being marked-up as required where GG provides the only or best means of detecting and tagging any specific data type. I will compile a list of elements which will be added to the data as part of the initial GG processing as opposed to being added by a tool used after GG.
- An attractive element of the TaxonX schema is its hierarchical nature, where gross objects, such as a description, are first identified and then detailed elements such as characters and states are added as objects within the description. This is equivalent to the use of cached fields in the Berlin model and it means that well developed and structured taxonomic treatments can be managed under the same schema as less developed taxonomic groups.
The scope of this evaluation and the capabilities of GoldenGATE
Most of the testing carried out for this evaluation has followed the current recommendations regarding input file preparation, and has then adhered as far as possible to the available documented TaxonX Level 1 workflow.
It should be kept in mind that GoldenGATE, whilst providing tools which make the creation of TaxonX files more efficient, can also be used to create output files which adhere to a different schema, or to no formal published schema at all - i.e. documents can be marked up on an ad hoc basis for specific purposes. GoldenGATE has a number of functions which add mark-up in an efficient way, for example, by simply selecting text and wrapping it in a chosen XML element. Optionally, any other identical instances of that text in the document can be annotated in the same way at the same time. Different types of data within an annotated section can then be marked-up in their turn, again with a simple select/annotate action by the user. Documents which are highly structured (e.g. checklists) may be particularly amenable to this type of treatment, which may use relatively few of the more highly automated functions available within GoldenGATE.
The close and undoubtedly valuable coupling between GoldenGATE and TaxonX is only part of the picture. GoldenGATE is a more versatile tool than might be supposed.
- The range of supported input file types (Word documents, PDF, RTF etc.) needs to be defined both for this evaluation and the broader project.
- Initial testing: Example documents included on the GG website.
- Further testing: NHM zoology examples - 3 papers:
- 2 on cryptomonad flagellates, one zoological and one botanical. Concepts expressed in these papers involve more indirection (the author says that Smith says that this family ...) which could stretch the schema. Further, this ambiregnal group will test the data warehouse's ability to manage multiple names for the same object (that are not synonyms) and the ability to handle the codes of Zoological and Botanical nomenclature.
- The final outcome would be an assessment of the extent to which such marking-up can be automated.
- Extended testing: Time may allow the addition of other example documents, e.g. some palm related content.
- Certain specific content types will also be included in testing. For example bibliographies currently residing as lists in Word files; these will probably be handled by batch processing outside the GG environment.
Intermediate file format documentation
The conversion process is :
Document --> Intermediate format (IF) --> Final output format (FOF)
The user has a number of pre-existing tools within GG designed to facilitate the creation of an IF file. Once this has been created, the user calls a function which runs a predefined transformation to give the final marked-up output. The main priority for the user is therefore the efficient and accurate creation of the IF file. Accordingly:
- the specification of the IF file needs to be known in detail by the user
- documentation of the specification is needed
- a documented workflow giving instructions on how to apply the mark-up would also be desirable (as exists currently for Level 1 mark-up)
The IF specification is dependant on the FOF specification, in that it is required to support it. If more than one FOFs are required, or if the FOF changes, the IF will need to change or be multiplied to reflect this.
Output format options and documentation
The identified elements tagged by this process will need to be integrated into a central resource, the data warehouse. Part of this will logically be the recognition and resolution of equivalent objects, e.g. rationalisation of bibliographic lists. Such work is outside the scope of the present evaluation.
Document import into the WP6 system
The main focus of the input tools provided by WP6 will be to allow direct entry of taxonomic content into the system. However, as an adjunct to this, it is desirable that taxonomic data already in document form be available to the system.
In the short to medium term, it is envisaged that resources will be available within the WP6 team to undertake conversions of specific existing documents. This would be on a limited scale, and would create a small library of example XML files for later addition to the XML Primary Repository.
This evaluation is concerned with documents already in machine readable formats, such as Word documents or PDFs. A few example files are in a form which must go through an OCR process. These will be included in the evaluation, but not as a main priority.
In the longer term, it would be desirable for the import of any existing files to be done by either the authors themselves, or by local experts with XML mark-up skills situated locally to the authors, allowing easy feedback and corrections prior to submission.
The tools, techniques, data files and recommendations generated as part of this software evaluation will form part of the background information used in this later part of the WP6 project.