Jules Verne Forum

<jvf@Gilead.org.il>

[Email][Members][Photos][Archive][Search][FAQ][Passwd][private]

Re: OCR scans

From: Ralf Tauchmann <ralf.tauchmann~at~t-online.de>
Date: Thu, 10 Feb 2000 17:20:01 +0100
To: jvf~at~math.technion.ac.il


James D. Keeline schrieb:
> Norm, I have some questions about the scanning/OCR process:
>
> Which programs have the best performance to cost ratio for the
> OCR end of things?
Dear James,
Dear fellow Vernians,

If you're interested, here are some general hints and comments from my own
experience. I have some practice with TextBridge (Pro98), but I've never tried
any other program. (To make sure: I have no commercial interest in
TextBridge!!!)

TextBridge is relatively expensive due to a number of additional features: image
zoning, restoring of the original layout, learning functions, proofing tools...
I always skip all of the extra functions for pure OCR and rely on proofing with
MS-Word (2000).

But TextBridge has several important features, which I find very helpful for
extensive scan/OCR projects:

1) MORE THAN ONE PAGE - it OCRs a lot of pages at one time (either directly from
the scanner or by reading in graphic files - BMP, TIF...).

2) DEFERRED OCR or BACKGROUND OCR (saves time)

3) OUTPUT FORMATS. The resultant text can be directly stored in one single file
(ASCII, RTF [MS-Word], WPD [WordPerfect]...).

4) FLOW TEXT. TextBridge recognises and renders paragraphs as such (exception:
paragraphs extending from one page to the next one, but this is true for OCR in
general).

>
> Do the programs prefer grayscale or bitmap images for scanning?

Black & white is important. The latest TextBridge version claims grayscale and
colour recognition, but chiefly for restoring colours. The standard requirement
is: 300 dpi (black & white, sometimes called LineArt).

> Is it better to adjust the images for a very white page and very
> black text?

A good contrast is always helpful, but can emphasize paper flaws (to be wrongly
read as letters or numbers).

> Are there programs which can take the filenames of a particular
> structure and process them automatically? If so, we might want
> to have names which facilitate this (for example, a file name
> could be Mysterious_Island_001.tif). I am a Mac person but I am
> trying to consider the special needs of the Windows environment.

Input names are not so important for TextBridge (except for the proper order of
the pages), because you can have one file from a certain number of pages. (I've
forgotten the recommended number, but it's a lot and the programm will warn
you.) I would suggest 8-character names or less. Proposal: 001mi.tif (or
001myst.tif). This is easier for the generally narrow space in the file
retrieval boxes.

Personally, I skip the TextBridge proofing tools (too hard a job) and rely on
the proofing tools of the word processors (especially Word 2000, good
search&replace functions [also for formats], red underscored spelling and green
underscored grammar errors...).

I would describe the job percentages on this basis as follows:

        Scan: 20-25%
        OCR: 5-10% (as background or deferred OCR)
        1st proofing: 50%
        2nd/3rd... 20%

Some remarks about proofing with word processors (especially MS-Word): It is
wise to "portion" books like Mysterious Island (approximately 10 chapters at
once), because the on-line spelling module can cope only with a restricted
number of spelling errors (and there are usually a lot of OCR errors).

There are some typical and unavoidable OCR errors (some of them are not
recognizable by automatic proofing tools).

cl - d (so "clear" can easily give "dear") ; rn - m (torn - tom) ; letter 'l'
(EL) and number '1' (ONE) for some fonts, capital 'I' and small 'l' (EL) for
other fonts etc...

For sure, proofing is the biggest part. As soon as you rely on automatic means,
you will run the risk of unidentified typing errors. The best way would be to
read the scanned text word by word with the original. And even that would not
avoid errors (in "Robur le Conquérant", I could read the name "Albatros", but
the second "letter" was the figure ONE and not EL. That could not be seen with
the naked eye. And automatic proofing with MS-Word skips figures).

So best wishes for the PG work.

Ralf Tauchmann
(Radebeul, Germany)
Received on Thu 10 Feb 2000 - 18:21:13 IST

hypermail 2.2.0 JV.Gilead.org.il
Copyright © Zvi Har’El
$Date: 2009/02/01 22:36:11 $$