Initial testing with GoldenGATE examples - 2nd tutorial

The second tutorial file: TutorialExampleOCR-2.xml

 

Back to Part 1

Footnotes

The tutorial instructs the user to delete footnotes, which seems unwise to say the least. It is possible that a footnote might be the most important part of a document. I assume this is done to simplify this particular tutorial, but I would be concerned if this was recommened or necessary for the general processing of similar documents. Obviously, no guidance regarding footnote mark-up is included in the tutorials.

Markup Converter UI bug

In the process of creating a Markup Converter, when a new mapping is added below an existing one, the mapping target in the row above it reverts to the original element name. Until this is fixed, it would be best to create all the required mappings first, and then add the target values.

Pagination information

The tutorial tells the user to remove all page boundaries/titles, losing any pagination information. This is fine if we have no interest in this metadata, but it strikes me that preserving it may well be useful in some cases where original journal content is being marked up. If pagenumbers are preserved, the marked-up file could be used to create a reference in the form journal vol. pg.

I know that this is a small part of a much bigger issue, and original pagination as published will probably be irrelevant for most documents, but I think it is worth mentioning, particularly in reference to the requirements of the final output format.

Freezes

On a couple of occassions I have experienced freezes in GG. for example, I initiated a spell check, cancelled it, and then re-initiated it. The Analyser running... window grabbed focus and refused to allow me to click on the spell-checking widget, effectively freezing the application. Killing the application resulted in lost work. (The spell check worked fine for the same file on my Mac).

Save often.

Spell checking UI

Moving the file to my  Mac allowed me to run the spellcheck. The UI is awkward, with a modal approach (so you are either in 'spellcheck-mode' or not, so no ability to edit the file manually or scroll the file to check on something) , and there is no highlighting of the text being corrected, making it difficult to be sure exactly which instance of a word is being corrected at any particular time, or even to find it at all.

I gave up on GG's spellchecker very quickly. Until this is aspect of the UI improved, it is probably worth carrying out this function in an external text editor, and re-importing the corrected file.

Handling special characters

In the case of OCRed documents, if the author has used special characters (e.g. male or female, or other similar non-ASCII symbols) to signify a type of material examined, these characters may either be lost or mistranslated into spurious text. For this reason, the user will need to see the original text during the correction phase of the mark-up process.

Another question is how these signifiers should be handled. Should they be preserved as they were written or replaced with text? Unicode will support these characters, but there is still the issue of how and when their meaning should be encoded.

Incomplete conversion if following tutorial notes

As it currently stands, the tutorial instructions for processing the second file leaves the document in an unfinished and partially incorrect state. The treatments appear to be well handled, but the start of the document is not marked-up as it should be:

<taxonx:taxonxHeader>
<mods:mods>
<mods:titleInfo>
<mods:title>
AMERICAN MUSEUM NOVITATES
</mods:title>
</mods:titleInfo>
Number 45 September 7, 1922
59.57,96 (729.8)
THE ANTS OF TRINIDAD 1
BY WILLIAM MORTON WHEELER
Since the publication of my paper (Bull. Mus. Comp. Zool., 40, 1916, pp. 323 - 330, 1 fig.) on the ants collected in Trinidad by Prof. Roland Thaxter I have seen considerable additional material from the same locality. Dr. F. E. Lutz has recently sent me for study a series of specimens taken by Mr. P. B. Whelpley and contributed to The American Museum of Natural History, and Mr. F. W. Urich has sent me several interesting forms, among them a singular cave-ant which proves to belong to an undescribed genus. I have also found some species hitherto unrecorded from the island in a vial of miscellaneous sweepings received from Prof. Thaxter. During July 1920, while on my way to British Guiana, I was able, through the courtesy of Mr. W. G. Freeman, Director of Agriculture, Department of Trinidad and Tobago, to collect a number of species in the Botanical Garden near Port of Spain and at Caroni and Diego Martin. After studying this additional material it seems advisable to list the Formicidae known to occur in the island. I have therefore included all the older records of species taken by Mr. Urich and Prof. Forel, who collected at Port of Spain while on his voyage to Colombia in 1896. The nearly 150 different forms taken to date furnish additional proof, if it were needed, that the ant fauna of Trinidad, unlike that of the various Windward Islands and Tobago, is in great part identical with and probably quite as rich as that of the adjacent Venezuelan coast.
FORMICIDAE
Dorylinae
</mods:mods>
</taxonx:taxonxHeader>

Clearly this section needs quite a more work, with some content mistagged, and other text missed - e.g. document title and taxon names not caught.

Mistagged above:

  • AMERICAN MUSEUM NOVITATES should be wrapped in  <mods:relatedItem type="host"> around the existing <mods:titleInfo><mods:title> tags
  • The title should be wrapped in <mods:titleInfo><mods:title> tags

Atomisation of MODS mark-up, if used, needs to be determined. (TaxonX uses MODS (Metadata Object Description Schema) for metadat mark-up).

We also need to check how to handle content which is identified by tags in MODS, but is also legitimate text content in the document, e.g. Number 45 is in the text, and also in mods tags/attributes. Leaving the text in may cause multiple appearance of the data later on, removing it may result in it not being seen if text node content is used - a tension between XML as data and XML as document.

UI: It is unclear to me how to wrap a tag around existing tags in GG during a correction phase, as the default behaviour appears to be to add the tags adjacent to the contained text.

Always check the result of a process. I would suggest that it is better to check the intermediate format or stages leading up to it than to have to correct an exported TaxonX file.

Validation is important, but valid mark-up does not mean correct mark-up.

Back to Part 1

Scratchpads developed and conceived by: Vince Smith, Simon Rycroft, Dave Roberts, Ben Scott...