![Page 1: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/1.jpg)
Practices and Open Problems of Document Digitization For
Million Book Project
Xiaohui Zheng
Tsinghua Univ. Library
![Page 2: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/2.jpg)
Background
THU participated in CADAL Project at the end of 2002 and finished 50000 E-books and E-dissertations in Jul 2006.
Digitization Center was founded in March of 2003. Affiliated to Digital Library Research Division of THU.
![Page 3: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/3.jpg)
Experiences
In house or out source Planning and Source Material
Selection Digitization Process Facility and Staff Management
![Page 4: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/4.jpg)
In house or out source In House
Pro:
1. Can control over all procedures, handling of materials and quality of products.
2. No worry about working with a vendor who turns out to be incompetent.
![Page 5: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/5.jpg)
In house or out source In HousePro:
3. Provides a foundation of experience that helps to create policies, cost analyses, standard making, and data transferring.
4. keeping the production line in house makes other digitization projects smoothly forward in the whole flexible organization.
![Page 6: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/6.jpg)
In house or out sourceIn House
Con:
1. Less staffing and workflow management experiences
2. Low productivity
3. Small Scale
![Page 7: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/7.jpg)
In house or out source Out SourcePro:
1. Professional staff and developed workflow
2. High productivity. Large output in short time.
3. Large Scale
![Page 8: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/8.jpg)
Our Choice
In house operation 10 staff is enough to finish 50000 E-books
in 3 years Enough time to training staff and improve
efficiency.
![Page 9: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/9.jpg)
Source Material Selection
Copyright was the place to start Easy to handle Good quality of materials (not fragile) Quickly action for submitting the title
list to duduplicate
![Page 10: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/10.jpg)
Digitization Process
Preparation (Selection, Identifier assignment) Scanning Image processing Metadata creation and packaging Quality control Data storage and backup
![Page 11: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/11.jpg)
Ancient book Scanning and Image processing (Double page upside down scanning)
![Page 12: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/12.jpg)
De-speckling and Centering
CADAL制作工具图像处理
![Page 13: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/13.jpg)
Splitting into two pages (Batch processing)
![Page 14: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/14.jpg)
Rotating (Batch processing)
![Page 15: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/15.jpg)
De-skewing (batch processing)
TPI
![Page 16: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/16.jpg)
Format transferring (Batch processing)
![Page 17: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/17.jpg)
Metadata creation and packaging
![Page 18: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/18.jpg)
Facility and Staff Management
Facility:
Three flatbed AVA3 AVISION scanners
Two FB6000E AVISION flatbed scanner
Minolta PS 7000
High speed AVISION AV3800 Staff:
1 manager, 1 technical supervisor, 11 temp. staff
Capacity: 5,000,000 page/year
![Page 19: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/19.jpg)
Network topology and data storage system
WAN
Gigabit Ethernet Switch
NAS Backup System
DAS Dell System
4 Flatbed scanners
High-speed
scanner
9 Manual processing
PCs
6 Automatic processing
PCs
LAN
Gate-way
Face- up
Scanner
![Page 20: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/20.jpg)
Related Software
Scanning: QuickScan…
Image processing: Bookshop, ACDSee, XnView, UltraEdit, Scanfix, DjVuerPro,…
Cataloging and Packaging: CADAL Cataloging Tool, OEBEditor, CMDL Cataloging Toolkit,…
Data transferring: DResManages
![Page 21: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/21.jpg)
Open Problems And Considerations
Content Discovery
Metadata description is rough and inconsistent
Resource Selection
The coverage of the million books is not clear and systematical.
![Page 22: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/22.jpg)
Open Problems And Considerations
OCR Processing
OCR processing has not yet started. The OCR technology for ancient book is under developed.
Copyright Problem
Almost 400,000 dissertations and modern books of CADAL collection haven’t clearly copyright disclaimer .
![Page 23: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/23.jpg)
Open Problems And Considerations
Organization Structure
My suggestion is that more source collection provider, less digitization centers.
![Page 24: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library](https://reader036.vdocuments.site/reader036/viewer/2022062518/5697bf751a28abf838c7ffd2/html5/thumbnails/24.jpg)
Thank you for your attention!