an introduction to document scanning, understanding your requirements
DESCRIPTION
Learn about the basic decisions required for business document scanning. Indexing, file formats, document resolution, color space, and more. Learn about estimating volumes and automated capture technology such as barcode recogonition, OCR, batch document processing and more.TRANSCRIPT
An Introduction to Document Scanning
Business Document Scanning 101: From the Data Capture Prospective
So you have a lot of this?
And you’ve decided this is the answer.
So you need a crash course in scanning
Lessons:
Lesson 1: Simplex or Duplex
Lesson 2: Resolution
Lesson 3: Color Depth
Lesson 4: File Formats
Lesson 5: Indexing
Lesson 6: Document Prep and Estimating Volumes
Homework: Learn More About Data Capture and Document Management
Lesson 1: Simplex or Duplex
Are the documents single or double-sided? This may
seem obvious but…
You many not want documents such as purchase
invoices scanned in duplex where the back of the
document only contains terms and conditions.
On the other hand, if the documents have high legal
importance you may want every conceivable item of
information captured such as small signatures or
notes on the back.
Duplex scanning requires
more scanning
time/processing and
results in larger files.
And you don’t have to be a genius to know that
is more costly.
Lesson 2: Resolution
So what is resolution and why does it matter?
Resolution is expressed as the number of dots per inch
(dpi) or less frequently pixels. Pixel refers to “picture
element” per inch (ppi) which make up the image or
really at what the image was sampled.
What is Resolution?
Implications of Resolution
This graphic contains two
images, a “0” as a grayscale
image and an “x” as black
and white.
Implications of Resolution
• If we halved the size of the grid horizontally and vertically
(doubled the resolution), the pixels would appear smoother
and produce a better quality image, the inverse would be true
if we doubled the size of the squares.
• If we kept the squares the same size but reduced the size of
the characters significantly the resolution is insufficient.
Implications of Resolution
• The higher the resolution, the better the image quality.
• For small characters, increase the resolution to capture
them effectively
So:
And, the higher the resolution, the slower the scan and the larger the file.
And, the higher the resolution, the slower the scan and the larger the file.
Which means higher scanning and file storage costs, Einstein.
Typical Scanning Resolutions
• Web graphic – 96 dpi
• Standard archive document – 200 dpi
• Document required for optical character recognition (OCR)
– 300 dpi
• Plans/drawings for vectorization – 400 dpi
• Documents required for historical archiving – 600 dpi
Resolution is generally determined by intended use.
Lesson 3: Color Depth
Documents scanned in black and white are always
scanned as grayscale within the scanner. The scanner
then applies a process known as thresholding to the
image to produce the black and white image.
Thresholding simply determines when a pixel should
be black or white.
Understanding Black and White
Grayscale is used when the image contains color or
grayscale data and the tone of the image needs to be
retained, i.e. photographs or shaded graphics.
Understanding Grayscale
Color is obviously used when the image contains color
data. Some users wish to retain important color
information for example, land boundaries or graphical
data, and not letterhead logos, highlighters, etc.
Understanding Color
Bits per pixel
File Storage Requirements
24 8 1
Bits per pixel
File Storage Requirements
24 8 1
So the storage requirements for a grayscale image is 8
times larger than a black and white, and color
requirements are 24 times more than black and white.
And, remember Einstein, larger files equals higher costs.
Lesson 4: File Formats
TIFF
JPEG
For an in-depth look visit: PDF v. TIFF
• Well established format
• Most often used for black and white documents
• Supports multiple pages
• Interpreted correctly by most applications with a caution
on certain color implementations
• “Group 4” format refers to the compression method used
on black and white images which is a “lossless”
compression where original data is not lost in
compression/decompression.
Understanding TIFF* TIFF
*Tagged Image File Format
• Well established format by Adobe
• Supports color, grayscale, and black and white
• Supports multiple pages
• Generally stored using Group 4 and JPEG compression
although supports other formats too.
• Used when more advanced features are needed within the
file such as embedded Optical Character Recognition
(OCR), hyperlinking, digital signing and other security
features.
Understanding PDF* PDF
*Portable Document Format
Searchable PDF:
Understanding PDF Variations PDF
Many scanning applications can create searchable PDF files.
Here, the scanner applies OCR technology to make the file
text searchable. Your application may label this as “make
searchable”, “apply OCR”, “text-under-image” or
“searchable PDF.” If selected, your file will be text
searchable or text selectable within the Acrobat viewer and
many other programs that search PDF files
PDF/A:
Understanding PDF Variations PDF
PDF/A is an ISO-standard for digital preservation or
archiving of electronic documents.
It differs from standard PDF by omitting features not
necessary for long-term archiving, such as font linking.
Growing in international government and industry
segments, including legal systems, libraries, newspapers, and
regulated industries.
Understanding JPEG JPEG
*Joint Photographic Expert Group
• Well established format
• Most often used for photographs and graphics
• Supports single page only
• A “lossy” compression format, that is, some of the data is
lost during compression. however it provides good
compression ratios for grayscale and color images.
Compression and File Size
*Comparison courtesy of Wikipedia
OMG, right?
JPEG
Compression and File Size
*Comparison courtesy of Wikipedia
OMG, right?
The bottom line: experiment with your images and file size. A middle qualit y scan may meet your needs and save
tremendous file space.
Lesson 5: Indexing
For an in-depth look visit: What is Document Indexing?
What is Indexing?
Document indexing (sometimes referred to as metadata)
enables a users to quickly and efficiently locate their
documents, either through a folder structure, database or
electronic document management system.
Avoid a disaster
Avoid a disaster
Great care should be taken to design an efficient indexing
scheme. If the design is not devised correctly at the outset, trying
to rectify it later can be both difficult and costly.
Sometimes it makes sense to replicate the current manual method
for document location to create a familiar, but faster system.
Don’t worr y, there is automation
Technologies such as
• Barcode recognition
• OCR
• Batch processing
• Data Mining, Text Mining
can save time and money by automating indexing and more.
Using Barcodes for Indexing
Intelligent data capture
software can extract
data from barcodes to
create and send index
information to a
document management
system.
For an in-depth look at barcodex in data capture visit: What Can Barcodes Do For Me?
With OCR, make your image-based file fully text searchable or extract data from a zone for indexing.
Using OCR for Indexing
With zonal OCR, document areas
are identified for automatic OCR
capture. Additionally, drag-and-
drop OCR allows an operator to
highlight document text which is
automatically OCR'd and dropped
into index fields.
TIPS for OCR
• Scan at 300 dpi for greater accuracy and ensure
that small text is captured.
• Limit the use of color on documents.
• Pre-process the image with image enhancement
software (available in many data capture
products, learn more).
Intelligent data capture solutions often use batch processing that lets you process
a whole folder of documents at a time. Some products can “watch folders,” and
process files as they are scanned into the folder.
What is Batch Processing?
For an in-depth look visit: What is Batch Document Processing?
Intelligent data capture solutions often use batch processing that lets you process
a whole folder of documents at a time. Some products can “watch folders,” and
process files as they are scanned into the folder.
What is Batch Processing?
Processing can include indexing, file routing, file splitting, and cleaning/enhancing the scans. Learn more.
Lesson 6: Document Prep and
Estimating Volumes
Preparation, qualit y control and indexing are the most time consuming elements of any scanning job and usually the most costly.
TIPS for OCR
Typically a good operator can prepare 750-1000 documents per hour, however a number of factors may drop throughput to 300 or 500.
Odd Size Document Type sales receipts, photos, plans/drawings,
Bindings three ring, spiral, glue, folder
Fasteners staples, paper clips binder clips, rubber bands
Attachments Post-its, tabs
Factors that Influence Document Prep
Estimating Volumes and Storage
Type
Paper
Folders Ring Binder
Lever arch
folder
Transfer
Cases
Bankers
Boxes Archive Boxes
Filing
Cabinets
Simplex
(avg #s)
30 to 100 200 500 500 500 2500 3000/drawer
Duplex
(avg #s)
60 to 200 400 1000 1000 1000 5000 6000/drawer
Learn more about estimating volumes
Homework: Learn More About Data Capture and Document
Management
More
Document Management
Determine if you require a full document
management system or do you just need a simple
search and retrieval system?
Can I use it as a stepping stone while I evaluate
my document management system?
More
Learn More
Call us for information on: How to digitize medical or dental records. The best way to scan medical or dental records. Scanning paper records. Document scanning for medical or dental records. Going paperless at the medical or dental office. How to capture medical or dental records efficiently. Scanning medical or dental records with Fujitsu ScanSnap. Touchscreen scanning of medical or dental records. How to improve your medical or dental workflow with document scanning. Scanning to EMR or scanning to EDR How to maximize your Fujitsu ScanSnap Using your ScanSnap for a basic document management system Using barcodes and the Fujitsu ScanSnap Scanning with the Fujitsu ScanSnap Automating workflow with the Fujitsu ScanSnap Automating document management capture Scanning into Dentrix Indexing into Dentrix Understanding basic Document Scanning
Things your teacher never told you about Document Scanning An introduction to Document Scanning Scanning Fundamentals for the average Joe
By DocuFi
Makers of ImageRamp Data Capture Solutions
30 years’ Experience in the Document Imaging Market
Proven Fujitsu ISV Partner
Find out more at ImageRamp and
www.docufi.com
Image Credits
• Pjohnkeane, Requirements, requirements, requirements, http://bit.ly/1fcULDf • Doug Waldron, “Files (85)”, http://bit.ly/1bfciII • UBC Learning Commons, “Scanner_icon-1024x671”, http://bit.ly/1eewI4P • Knile Lucy, you have some sor ting to do! http://bit.ly/19bSgjF • Michael 1952, SJSA Fifth Grade - I Fell in Love With The Teacher, http://bit.ly/1eevu9A • Ton Haex, ”Einstein show.... “, http://bit.ly/LVqeBi • Loco Steve, “Sunrise under scrutiny”, http://bit.ly/1eevSVv • Tax Credits, “ Coins”, http://bit.ly/1mtQj5j • j_baer, ”Ubuntu Color Wheel”, http://bit.ly/1jARikx • Marcin Wichar y, Alphabetical, http://bit.ly/1aILOku • David Erickson e-strategyblog.com, “Hindenburg Disaster”, http://bit.ly/1jASeFF • William Warby w warby,” Gears”, http://bit.ly/1dwtU1S • Alan Cleaver,” watching”, http://bit.ly/1h1k9k7 • Zoetnet, “overflowing,” http://bit.ly/KHW9Em • Seattle Municipal Archives, “Comptroller's Office employees, 1960”, http://bit.ly/1eBvLGE • Seattle Municipal Archives , “Cit y Light worker with office machine, 1954”,
http://bit.ly/1eBw3NM • Patrick Hoesly, “Thank you” http://bit.ly/17xKErE
All images are owned or licensed by DocuFi with acknowledgement given to: