• Please review our updated Terms and Rules here

How Best to Scan Documentation

cjs

Veteran Member
Joined
Nov 5, 2021
Messages
1,108
Location
Tokyo, Japan
I've got an iCODIS Megascan Pro X9 scanner, which is essentially a 21 megapixel camera on a stand. It comes with software called Scanline that will capture images, split them into left-right pages and crop and decurl each page, optionally OCR the pages (in both Japanese and English), and bundle all this together into a final output PDF. I do have access to the original images it takes as well (they can be found in certain directories, as described at that link), but unfortunately only in JPEG format. I've been uploading a ZIP of those images along with the PDF when I upload to archive.org (see, e.g., TK-85 Training Book), but my current understanding is that this isn't nearly as useful as uploading original TIFF images from the camera because the JPEG encoding interferes with the ability to do post-processing (such as decurling, despeckling, etc.).

The "scanner" does show up as a standard USB camera, so it would be easy enough for me to collect TIFF images from it under Linux at least. But I'm not sure what I could do with them after that point to get the appropriate post-processing that would eventually produce a useful PDF.

What should I be doing do get the best scans I can out of this, both for immediate use as a PDF of reasonable quality, and for long-term archival purposes?
 
Scanning is a bit of an art, and depends on your source and what you want to get from it. Bundled scanner tools are there to make things easy, not good.

I start by collecting raw TIF or PNG files, then using ImageMagick to crop to a consistent size, de skew , adjust contrast, reduce color depth, and such, and save the output as PNG.

For my needs, I try to get the page background true white, and the center of text mostly true black, while having enough gray around the edge so the text does not look jagged on the screen. The idea being I can print the resulting image to a printer and not have dithering all over the place. The processed PNG files become part of my larger archive.

From there it is a matter of feeding the pages in to an OCR/PDF tool. I do recommend using something that can create PFDs that are compatible with older and less common PDF readers.
 
Scanning is a bit of an art, and depends on your source and what you want to get from it.
I'm looking for two things:
  1. Getting "raw" images (in the sense of original photos, not "pro camera raw") in the best form possible for later processing by people with better software and more skill than me. Basically, the best unprocessed archival output my 25 megapixel USB camera can produce. My understanding, as I mentioned, is that avoiding lossy compression for this stage is important. This comes from the bitsavers.org top page, which mentions that, "Lossy compression formats, such as JPEG, should NEVER be used to save pages of text, since the compression format destroys edge resolution and contrast."
  2. A PDF (or DJVU would be fine, too) that has the pages split (because I'm scanning bound volumes two-up), decurled, cropped to the page edge, fingers holding pages removed, etc. etc., and is fairly compact (which I guess may mean that the images are converted to JPEG, and possibly resolution lowered).
  3. OCR isn't really necessary; I let archive.org do that (and that also seems to take care of reducing the size of the PDF significantly for the OCR'd version).
I start by collecting raw TIF or PNG files, then using ImageMagick to crop to a consistent size, de skew , adjust contrast, reduce color depth, and such, and save the output as PNG.
ImageMagick is certainly appealing to me, since both that and GraphicsMagick are in the Debian package library (and probably that of pretty much any Linux). But do you automate this stuff somehow, or do you have to edit every page by hand (which I don't consider practical for the many hundreds of pages—probably several thousand, really—that I want to scan).

And how do you deal with decurling and finger removal? (This is the really handy thing, that the otherwise horrible Windows software that came with the scanner does.) Or are you just doing flatbed scans?

I suppose that I could live with a really bad job of getting the pages out of the source images, given that I'm keeping and uploading the source images anyway and can always hope for someone to go do a better job with those.
 
Scanning color prints and flyers requires a low compression and high DPI scanner that is immune to shadowing or pre-processing. Those are things people occasionally like to reproduce and compression/resolution matter. For documents with grayscale illustrations, using grayscale is fine and for line art and solid text there is nothing wrong with using Black and White mode unless the paper is especially damaged. There used to be massive wars on the cctalk mailing list about how to do it "correctly" and honestly, there's so many documents we are scanning in because we DON'T want to print them out again in favor of files a few (or tens of) megabytes in size, the resolution does not matter as long as it's both legible by eye and your average OCR engine has no issues.

I'm looking at you, Sams.

"Lossy compression formats, such as JPEG, should NEVER be used to save pages of text, since the compression format destroys edge resolution and contrast."
Again, unless you are saving items such as marketing literature, posters or detailed diagrams, there is nothing wrong with the JPEG format. The next person who is going to read this PDF very likely does not care if there's a bit of compression noise. They do not care about the grain of the paper.

For the record I use a Fujitsu FI-6770 combination flatbed scanner + duplex ADF which is on the high-end of the scanner market but used, you cannot beat it. Output images are tweaked in Gimp 2.6 and then everything is dropped into and sorted with Adobe Acrobat which also de-skews and ADF's the document if I wish.
 
Last edited:
ImageMagick is a command-line tool. It can perform the same operation on multiple files in one command. Multiple commands can be placed in one batch file. I normally keep one batch file with the common commands I use, copy it in to the folder with a copy of my scanned images, tweak any parameters specific to the job (such as if it needs more contrast adjustment) and let it rip. I may have to separate out any images that need to be processes differently first, such as the cover or pages with detailed color illustrations.

I use a flat bed scanner and a document scanner. Bound manuals and fold ours are scanned in the flat bed scanner, and loose page three ring/spiral binders or cut up bound manuals are fed through the document scanner.

So they usually come out mostly flat, and no fingers in the way.

On the flatbed, I usually only scan one page at a time. Mine just not large enough for two pages on a typical sized manual.

If one were consistent about placing the manual so the pages split in the same place, then an ImageMagick batch file could theoretically split out pages.

I'm afraid I don't know anything about decurling software. I would, however, be weary of software that tries to automatically remove anything from an image - they can get things wrong and remove content.
 
...

The "scanner" does show up as a standard USB camera, so it would be easy enough for me to collect TIFF images from it under Linux at least. But I'm not sure what I could do with them after that point to get the appropriate post-processing that would eventually produce a useful PDF.

What should I be doing do get the best scans I can out of this, both for immediate use as a PDF of reasonable quality, and for long-term archival purposes?
At $dayjob, we have done a great deal of study on some specialized scanning, that of scanning astronomical photographic plates. Our gold standard "scanner" is a 7,000 pound hunk of granite and steel that started life as a Perkin-Elmer PDS 2020 microdensitometer (you can see a smaller version at https://www.pdsmicrod.com/about1in the photo).

Part of this study has involved investigating in great detail how the modern flatbed scanner actually works, down to the pixel layouts in the scan bar itself. For our purposes, knowing precisely where each pixel is located is important, since uncertainty in the scan bar pixel layouts renders the resulting scan useless for astrometry, which precisely measures the position of stars, as well as for spectroscopy on objective prism plates, since the positional integrity of the spectral lines is distorted.

We have had better results with super flat field lenses on high quality DSLR cameras that are precisely mounted than we ever got with any flatbed scanner, and the pixel pattern is determinable, but even then, there is quantization noise, which shows up as moire in the spectra.

I've said all that to back up what I say next: for document scanning of things other than text, the interaction between the scan bar or CCD/CMOS sensor pixel pattern and the specific screen used when halftoning the colors or the grayscale can produce significant moire and other quantization noise artifacts.

You're thinking along the right lines: get as raw an image as possible. If JPG format is all it will do then see how high you can make the JPG quality. It depends upon the sensor as to the native format it will export. Scan at as high of a resolution as possible, then de-screen the scan before down sampling to PDF.

There seems to be at least a couple of de-screen filter plugins for GIMP available; one such can be found referenced in https://www.gimp-forum.net/Thread-Removing-newsprint-halftone-artifacts. Now, it's been a long time since I last used the older de-screen filter, but when I did it worked really well to "unhalftone" scans, as long as the scan is sufficiently oversampled.

OCR can be very useful, as long as it's well proofread, and there are some good Linux tools for doing that. Again you're barking up the right tree by taking a TIFF as the source material, and doing all your processing at the highest possible resolution before down sampling to PDF.

Ira at www.trs-80.com has done some serious scan touchup work before, and, even though he uses Windows, most of the techniques are portable to Linux. I don't recall if he's on here or not, but you could reach out to him and ask him about his scanning pipeline.

Hope that's helpful; the more people properly scanning and archiving documentation the better!
 
Last edited:
I think the greater issue might not be the software or the camera - but your approach to the scanning. People always seem to scan books on a flat surface which leads to visible reflective differences in the paper, or cut the pages / spine from the book before scanning. It always seems to be a choice between destroying the book and getting the best scan, or keeping the book and taking whatever you can get.

But you can also build a jig to help if your camera supports scanning in a different direction from "directly overhead" and you can add lighting well to the sides in all directions so reflective surfaces are more evenly illuminated, and you can use things like sheets of plastic or glass to hold pages flat.

Here's a cool jig someone made up on Youtube that prevents books from having to be opened more than 90 degrees. Two cameras would let you scan two pages at once.

 
...the resolution does not matter as long as it's both legible by eye and your average OCR engine has no issues.
So just don't worry about it? I think that archivists might have issues with that. Keep in mind that, while I agree with you on the PDFs I'm producing for my own use, I'm aiming to produce the best original images possible with my equipment (consistent with scanning with reasonable speed and convenience) for others to process if they're interested in getting better copies.

Again, unless you are saving items such as marketing literature, posters or detailed diagrams...
There are definitely detailed diagrams in some of the work I'm scanning. Particularly schematics. You've mentioned SAMS; I've seen plenty of those and similar where it's been very frustrating trying to figure out what's on a schematic.

Oh, and it's probably worth mentioning that a fair amount of the material is Japanese, which simply seems to need higher resolution than English, on average, to get decent OCR. (This is probably not surprising, since it has to tell the difference between characters such as 諸 and 諳.)

...there is nothing wrong with the JPEG format. The next person who is going to read this PDF very likely does not care if there's a bit of compression noise. They do not care about the grain of the paper.
Well, again, see my first paragraph above. And also, you didn't mention the effects on processing, where it seems that @Al Kossow at least disagrees with you.

Output images are tweaked in Gimp 2.6...
Ouch! I definitely don't want to be touching Gimp (unless it's via an automated script). Remember, there are two things I'm trying to achieve here:
  1. The best possible quality of source image for someone else to tweak in Gimp or whatever they wish.
  2. "Good enough" quality PDFs for my own reference until someone uses those source images to make something better.
....sorted with Adobe Acrobat which also de-skews and ADF's the document if I wish.
What is ADF?

ImageMagick is a command-line tool. It can perform the same operation on multiple files in one command. Multiple commands can be placed in one batch file.
Yup. I'm familiar with ImageMagick in general, and use it for some basic operations (such as format conversion) daily. I'm not familiar with how sophisticated one can get with the scripts, though.

I normally keep one batch file with the common commands I use....
I use a flat bed scanner and a document scanner.
Yeah, so it sounds as if you don't have some of the processing requirements I do: in particular, identifying the locations of the pages in the image (which vary from image to image, because it's camera), decurling, and finger removal.

If one were consistent about placing the manual so the pages split in the same place, then an ImageMagick batch file could theoretically split out pages.
Yeah, I don't think I can get that consistent with my scanner. It's going to require something that can identify where the page is.

I would, however, be weary of software that tries to automatically remove anything from an image - they can get things wrong and remove content.
It's not a big deal here; I have and upload all the original images with the processed PDF, so if something is removed in making the PDF, it's still there in the original. And if it's serious, I don't mind reprocessing a page or two by hand here or there to fix it; I just don't have the time or inclination to do hundreds (or even thousands) of pages by hand.

I've said all that to back up what I say next: for document scanning of things other than text, the interaction between the scan bar or CCD/CMOS sensor pixel pattern and the specific screen used when halftoning the colors or the grayscale can produce significant moire and other quantization noise artifacts.
I don't really understand that, but....
Scan at as high of a resolution as possible, then de-screen the scan before down sampling to PDF.
Yeah, probably not an issue for me. So long as I get the original image as good as I can, and the PDF I create is readable, I can leave it to others to create better PDFs as and when they find the inclination.

But you can also build a jig to help if your camera supports scanning in a different direction from "directly overhead" and you can add lighting well to the sides in all directions so reflective surfaces are more evenly illuminated, and you can use things like sheets of plastic or glass to hold pages flat.
Most of what I'm scanning is books on matte paper, so reflections haven't been much of an issue, though I do have a diffuser I often enough use for the covers.

But unfortunately a scanning jig, such as the one in the video, is out of the question: something that size would involve selling off a half dozen vintage computers to make the space to store it. (I live in Tokyo, in 25 m².)

But actually the glass idea might be worth trying; it would at least get my fingers off of the page edges if I can manage to handle the reflection issues. (Unfortunately it still doesn't help with the need for decurling, since I have to lay the book flat.)
 
I don't really understand [my statements about halftoning and moire], but....
For schematics moire shouldn't be an issue. Dealing with the printed halftoning screen will be an issue for color or grayscale images, for instance if there are photos of parts or adjustments, or photos showing parts locations.

I know I've run across scanned manuals where the illustrations were rendered useless because the halftoning screen wasn't dealt with in the scan.

Yeah, probably not an issue for me. So long as I get the original image as good as I can, and the PDF I create is readable, I can leave it to others to create better PDFs as and when they find the inclination.
Of course, it's your project, and any scan is better than no scan at all; I appreciate all such efforts. But once it's down sampled to PDF it may not be possible to correct some artifacts. Text and line drawings will survive; printed images that have a halftone screen may not. Much detail can be lost to moire.

There's a much better explanation than mine at https://www.scantips.com/basics06.html. I apologize for my relatively poor explanation.
 
But actually the glass idea might be worth trying; it would at least get my fingers off of the page edges if I can manage to handle the reflection issues. (Unfortunately it still doesn't help with the need for decurling, since I have to lay the book flat.)

I forgot how small your living/collecting space was. I can see why anything you want to reuse has to take up as little space as possible.

The main parts of that jig though might be reproduced in a folding way though and then it shouldn't take up any more room than a book or magazine when stored... Well, an oversized magazine. And you wouldn't need to change your camera. You could just photograph the flat part with the other side vertical.

A couple of pieces of glass with a hinged frame so it could open to 90 degrees and the same with a base which could be made from wood. Then you could put the book between them and lay one side flat, photograph it, then lay the other side flat and do it too. It would take a little work to make a good jig though to allow it to fold, yet lock to 90 degrees.

As for reflections, if the light is far enough away and the angle is small enough, then plenty of light will hit the page and be scattered, but there will be no reflective path from the light to the camera lens to cause reflections.

It may be overkill depending on what you want to do, but the inability to open old books up to 180 degrees was always the biggest problem I had with scanning things, even in an amateur way and I just wanted to get clean scans without lines looking bent all the time. I've had books I didn't want to scan because the spines were failing too, and limiting how far they open seemed the best solution - though I didn't know of this way to do it at the time and I damaged them.

You could also make permanent 90 degree frames, and just store them in a corner, which is still space effective, but far less convenient as it means they will always be at the back of something.
 
...
As for reflections, if the light is far enough away and the angle is small enough, then plenty of light will hit the page and be scattered, but there will be no reflective path from the light to the camera lens to cause reflections.

Sometimes polarizing filters help when using cameras. Not always, but sometimes.

...
You could also make permanent 90 degree frames, and just store them in a corner, which is still space effective, but far less convenient as it means they will always be at the back of something.
I really like your 90 degree frame jig idea.
 
For schematics moire shouldn't be an issue. Dealing with the printed halftoning screen will be an issue for color or grayscale images, for instance if there are photos of parts or adjustments, or photos showing parts locations.
Ah, wait, now it's coming back to me. I now remember half-tone screens from way back in the day, back when I used to do desktop publishing by putting a blank galley on a desk, cutting up the columns of text that came out of the photo-typesetter, and sticking them to the galley with glue. Yes, I totally see where the moire is coming from now. (And the explanation you linked helped, thanks.)

It's not something I need to solve myself, but is obviously something I want to leave solvable in the original image. I'll have a close-up examination of one when I hit something with photographs (those tend to be rare in the documents I'm scanning) and try to see how bad it is at that point. But I guess all I can really do is try to get a lossless format out of the webcam (apparently, from looking into it just now, a lot of them encode to JPEG right in the camera, ugh) and of course use the highest resolution available.

But once it's down sampled to PDF it may not be possible to correct some artifacts.
Yeah. Once again, to be clear, I will be maintaining and uploading the original images; the PDF is a derived document that just has to be "good enough" to hold me over until (and if) someone with better resources than me can do a better job of producing another PDF.

And you wouldn't need to change your camera. You could just photograph the flat part with the other side vertical.

A couple of pieces of glass with a hinged frame so it could open to 90 degrees and the same with a base which could be made from wood. Then you could put the book between them and lay one side flat, photograph it, then lay the other side flat and do it too. It would take a little work to make a good jig though to allow it to fold, yet lock to 90 degrees.
Hm. Now you've got me thinking. Yes, I wonder if I might be able to come up with something reasonably small that gets the job done. I guess I see a trip to Tokyu Hands in my future!

As for reflections, if the light is far enough away and the angle is small enough, then plenty of light will hit the page and be scattered, but there will be no reflective path from the light to the camera lens to cause reflections.
Yeah, it's not quite as simple as that, unfortunately, as I've discovered. My workbench is mostly (completely, when I turn one light off) illuminated entirely with indirect lighting, bouncing a couple of photography lights off the ceiling and walls, but that doesn't eliminate all reflections. That's why I ended up buying a medium-size (about 100 by 60 cm) diffuser panel, which can help a lot, but there's definitely a bunch of work and playing about with positions of everything to get rid of all the reflections.

It may be overkill depending on what you want to do, but the inability to open old books up to 180 degrees was always the biggest problem I had with scanning things, even in an amateur way and I just wanted to get clean scans without lines looking bent all the time.
Yeah, actually, if this removes the need for decurling and thumb removal, that actually solves a major problem. Then I just need to deal with edge recognition.
 
(And the explanation you linked helped, thanks.)
Glad it helped; I felt like I wasn't doing a very good job with my description.

But I guess all I can really do is try to get a lossless format out of the webcam (apparently, from looking into it just now, a lot of them encode to JPEG right in the camera, ugh) and of course use the highest resolution available.
Hardware JPEG (I keep wanting to leave out the E for some reason...) encoding at the camera is pretty common, but maybe you can find an endpoint that yields something better. That in and of itself would be a good find. Or a control to set the compression level in the JPEG output at least.

Yeah. Once again, to be clear, I will be maintaining and uploading the original images; the PDF is a derived document that just has to be "good enough" to hold me over until (and if) someone with better resources than me can do a better job of producing another PDF.
.
It sounds like you getting a solid start, at least in my opinion. I for one would be interested in hearing how it goes.
 
What is ADF?
Automatic Document Feeder. The nicer document scanners let you drop a stack of papers into a feed hopper and it will scan them all in batch. The PC-DOS manual takes about three minutes to do the entire binder at 300dpi, color.
 
Yeah, it's not quite as simple as that, unfortunately, as I've discovered. My workbench is mostly (completely, when I turn one light off) illuminated entirely with indirect lighting, bouncing a couple of photography lights off the ceiling and walls, but that doesn't eliminate all reflections. That's why I ended up buying a medium-size (about 100 by 60 cm) diffuser panel, which can help a lot, but there's definitely a bunch of work and playing about with positions of everything to get rid of all the reflections.

I wonder if you could do something like they do in astronomy and within photographic lenses by shrouding the light within the first angle of incidence entirely to remove diffused reflections? eg, maybe a cloth "bucket" of dark black material that shrouds the top of the camera and extends outwards or downwards far enough that it provides a curtain against light in your workspace?

If you draw a line down to the edge of the "book" area from the camera, then the incident reflected line back up to wherever the curtain would be, that's the area you need to cover. Once you do that, you'll end up with a local "dark" zone as far as reflections go and won't get diffused reflected light from the rest of your room - eg, walls, ceiling, etc. If it's soft material, you could easily remove and insert your hands and then you can control the light -

Like how they do in audio studios, except for light... And then the only diffused reflections will come from the camera head itself, and the glass should appear transparent even when there's intense light from the side.

You could use a bucket also, but a cloth curtain with a spring hoop would be very very foldable. A modified umbrella would also work, but would be a lot larger.
Yeah, actually, if this removes the need for decurling and thumb removal, that actually solves a major problem. Then I just need to deal with edge recognition.

On the topic of reflections. Don't forget you'll get reflections of the page causing a "double image" effect near the periphery of the glass, due to double internal reflection, but it should be marginal and thin glass ( or a material that doesn't have this effect so much like plexiglass / perspex ) should address any issues.

Polycarbonate is great for this kind of work also, is very tough and easy to work with, though scratches easily... Although any solution you come up with for polycarbonate also means you can make the clear window part disposable. Also, polycarbonate clear sheet can be creased and even hinged around itself (eg, flattened, folded etc ) many times without breaking.

You have me really curious now about the quality of scans you might achieve.
 
Back
Top