Here are my comments on your items.
----------
> From: James D. Keeline <keeline~at~adnc.com>
> To: Jules Verne Forum <jvf~at~math.technion.ac.il>
> Subject: Re: A Modest Proposal
> Date: Wednesday, February 09, 2000 6:51 PM
>
>
> Norm, I have some questions about the scanning/OCR process:
>
> Which programs have the best performance to cost ratio for the
> OCR end of things?
Ans. Everyone has their favorite. some do one thing better than others. I
use Omnipage as that is what I started with. It does have problems of
making rn into m and things like that that are hard to repair. Judith Boss
likes Presto
http://www.newsoftinc.com
All do latin-1 and 8 bit characterss.
>
> Do the programs prefer grayscale or bitmap images for scanning?
Ans. Omnipage has a mode to scan in grayscale. It supports TIFF files at
200, 300, or 400 dpi, b/w or gray scale. A 300 dpi grayscale image is
several megabytes per page, so grayscale is used mainy internally. Better
recognition is obtained with grayscale. I performed some tests on a double
page of a heavily foxed book. The results were 20 errors at 300 dpi b/w;
10 errors at 400 dpi b/w; 10 errors at 300 dpi greyscale; 4 errors at 400
dpi greyscale. At 400 dpi and greyscale there were fewer noise and
punctuation errors, which are harder to correct than spelling, which is
fixed with a spell checker. The size of greyscale files (5 and 8 megs for
300/400 dpi) precludes their use except at a scanning station. But
comparable results are obtained at 400 dpi b/w which might be recommended
for heavily foxed books.
When Omnipage saves its scanned images they are black and white, 300 dpi.
>
> Is it better to adjust the images for a very white page and very
> black text?
Ans. You have to be careful about specks etc becoming false characters.
Sometimes it is better to scan with a scanner program to clean up the image
before going to OCR. The scanner programs offer clean, deskew, blank, and
other nice features.
>
> Are there programs which can take the filenames of a particular
> structure and process them automatically? If so, we might want
> to have names which facilitate this (for example, a file name
> could be Mysterious_Island_001.tif). I am a Mac person but I am
> trying to consider the special needs of the Windows environment.
Ans. Most programs can import multiple files, and also have re-ordering
ability. With TIFF4 Multipage, supported by Omnipage at least, you can
store a whole book as one file, if you want to. The pages are are numbered
by the user before saving in TIFF4 format. When imported into Omnipage, all
pages appear sequentially.
ZIP Files
TIFF4 multipage format is already compressed. It is the format used by fax
machines upgraded to higher dpi. Putting it in Zip format does no further
compression, and may actually enlarge the file, and confuses things. The
main reason for using Zip would be to collect a number of small documents
together as a package to mail on the net, or if for some reason TIFF4
multipage could not be used.
File names
A technical problem with file names is that I understand that on unix
systems all files are 8 characters long. this puts a limit on the
inventiveness of identifiable filenames which are stored on a server.
Servers
The best way to move things about from a central site is ftp. Downloading
from a web page I believe has more problems. Also for speed, at least in
the US a US site would be faster, I have noticed considerable delays
contacting Israel on occasion. I believe setting up an ftp site is more
difficult with an ISP than with a University where passwords are managed
locally.
Hope this answers your questions.
Received on Mon 14 Feb 2000 - 21:35:56 IST