• Please review our updated Terms and Rules here

Help me digitize Color Computer Magazines

Joined
Feb 24, 2007
Messages
25
Location
Pacific Northest, USA
With the withdrawal of the Rainbow on Disc project, I've determined we
needed a more distributed process of digitizing magazines.

So I built a community digitization project web site:

http://cocomag.dyndns.org/UnderColor.shtml

Note, when I say digitizing I do not mean just scanning. I mean
computer automated text conversion with the help of human volunteers.

The site is running off a server attached to my home DSL. It has low
bandwidth, so image loading is slow. I hope not too slow, but we'll
find out.

I welcome any and all feedback.
 
Tim,

I read your post with interest.

Did you know of recaptcha. It was also described on the BBC's digital planet and here is another summary

I see your effort wishes to use the community for recognition.

Personally I think unless you have very fast DSL upload links the bottleneck will be people like me downloading images to be rescanned.

Also I was somewhat overwhelmed by potential tasks a volunteer might be asked to perform!

For me the luddite (sorry occam's razor) methodology would be
a) post magazines to somebody
b) use a very high quality scanner to scan in as image
c) use high quality OCR program (adodbe et al) to perform detailed OCR
d) live with the imperfections
e) optionally post magazines back to you

I just finished scanning over 20,000 pages in colour using this mechanism. How many pages have you got?
 
d) live with the imperfections

There's the rub, isn't it? I have access to Adobe Acrobat's text capture. I should try it with the complex magazines I've got.

How many pages have you got?

A-lot. Fortunately the scanning part is being accomplished by others. My goal is accurate digitization of the content. If I've underestimated Adobe's software, then I may have to chalk the website up to experience.
 
Any OCR software is only as good as your experience with it.

The longer you use a piece of software, the more quirks you'll find work-arounds for and the more tricks and techniques you'll discover.

It's like anything else, the more time you spend doing it, the better you'll become with it.
 
Tim,

Let me expand on living with imperfections ....

I mean that initially you perform an image scan. Let's say 300 dpi or 6oo dpi colour. Personally I think 300 dpi is sufficient.

Next in Adobe Acrobat you will OCR these images. The worst that can happen is the text rendering is not totally perfect so cut and paste is not 100%, but it is still visually good to a human.

So that's a good compromise to me. You need the OCR so google can potentially index it and that means that people can find it.

And you need it indexed by Google (et al) otherwise you'd not really be sharing the information would you.

Regards marcus
 
With the withdrawal of the Rainbow on Disc project, I've determined we
needed a more distributed process of digitizing magazines.

So I built a community digitization project web site:

http://cocomag.dyndns.org/UnderColor.shtml

Note, when I say digitizing I do not mean just scanning. I mean
computer automated text conversion with the help of human volunteers.

The site is running off a server attached to my home DSL. It has low
bandwidth, so image loading is slow. I hope not too slow, but we'll
find out.

I welcome any and all feedback.

cool project. if your image uploading speed is an issue, i highly recommend just getting a godaddy hosted server. it's so cheap, like $5/mo. and they are great. my server with them maxes out my 10 mbps download cable modem!
 
Back
Top