Lafferty, Daryl , Landrum, Leslie .
SALIX, a semiautomatic label information extraction system using OCR.
The use of Optical Character Recognition (OCR) software to read label data coupled with software to transfer those data into a database has been a goal in recent years. If the process of extracting data from specimens could become automated or partially automated, then the process of databasing numerous specimens held in herbaria could be greatly accelerated. We believe that full automation may be impossible because labels are so variable in format and quality that there will always be a need to do some checking for accuracy at some stage.
One of us (Lafferty) developed a program, SALIX, that includes an open-source OCR program (Tesseract) and facilitates moving data from label image files to a database file. It can also work with external OCR programs (e.g., AABBY 5.0) available with some scanners. We have used this system and have improved it to the point where it has become a practical alternative to typing into a database. When OCR results are good and the labels are information-rich (e.g., with numerous associated species, or extensive habitat descriptions) we believe SALIX is faster than typing. Using SALIX has the advantage of including a photographic record of each label. Furthermore, all the processing is in the control of the operator and the necessary equipment and software are relatively inexpensive. A moderately priced digital camera and SALIX are all that are needed. By doing the whole job in-house, one can proof-read the database as one goes along. SALIX parses the OCR output to a database semi-automatically and the user is watching and facilitating the process, so mistakes can be corrected immediately. As SALIX is used it records that certain words are more likely to belong to particular fields and the parsing is improved.
Log in to add this item to your schedule
1 - Arizona State University, School of Life Sciences, P.O. Box 874601, Tempe, Arizona, 85287, USA
2 - Arizona State University, School of Life Sciences, P.O. Box 874501, Tempe, Arizona, 85287-4501, USA
Presentation Type: Oral Paper:Papers for BSA Sections
Location: Wasatch B/Cliff Lodge - Level C
Date: Wednesday, July 29th, 2009
Time: 1:30 PM