Unable to connect to database - 03:12:22 Unable to connect to database - 03:12:22 SQL Statement is null or not a SELECT - 03:12:22 SQL Statement is null or not a DELETE - 03:12:22 Botany & Mycology 2009 - Abstract Search
Unable to connect to database - 03:12:22 Unable to connect to database - 03:12:22 SQL Statement is null or not a SELECT - 03:12:23

Abstract Detail


Systematics Section

Lafferty, Daryl [1], Landrum, Leslie [2].

SALIX, a semiautomatic label information extraction system using OCR.

The use of Optical Character Recognition (OCR) software to read label data coupled with software to transfer those data into a database has been a goal in recent years. If the process of extracting data from specimens could become automated or partially automated, then the process of databasing numerous specimens held in herbaria could be greatly accelerated. We believe that full automation may be impossible because labels are so variable in format and quality that there will always be a need to do some checking for accuracy at some stage.
One of us (Lafferty) developed a program, SALIX, that includes an open-source OCR program (Tesseract) and facilitates moving data from label image files to a database file. It can also work with external OCR programs (e.g., AABBY 5.0) available with some scanners. We have used this system and have improved it to the point where it has become a practical alternative to typing into a database. When OCR results are good and the labels are information-rich (e.g., with numerous associated species, or extensive habitat descriptions) we believe SALIX is faster than typing. Using SALIX has the advantage of including a photographic record of each label. Furthermore, all the processing is in the control of the operator and the necessary equipment and software are relatively inexpensive. A moderately priced digital camera and SALIX are all that are needed. By doing the whole job in-house, one can proof-read the database as one goes along. SALIX parses the OCR output to a database semi-automatically and the user is watching and facilitating the process, so mistakes can be corrected immediately. As SALIX is used it records that certain words are more likely to belong to particular fields and the parsing is improved.


Log in to add this item to your schedule

1 - Arizona State University, School of Life Sciences, P.O. Box 874601, Tempe, Arizona, 85287, USA
2 - Arizona State University, School of Life Sciences, P.O. Box 874501, Tempe, Arizona, 85287-4501, USA

Keywords:
herbarium
Database
OCR.

Presentation Type: Oral Paper:Papers for BSA Sections
Session: 67
Location: Wasatch B/Cliff Lodge - Level C
Date: Wednesday, July 29th, 2009
Time: 1:30 PM
Number: 67003
Abstract ID:130


Copyright 2000-2008, Botanical Society of America. All rights