how to bookmark s
TRANSCRIPT
-
8/8/2019 How to Bookmark PDF Files
1/7
Page 1
By
Kedarnath Jonnalagadda, Vaidika Gramam, Hyderabad 2011
A book without bookmarks is the same as bookmarks without a book.
1 Introduction
This document illustrates methods that were used to easily bookmark thousands of PDF pages.
A bookmark is reference to the position of desired detail in reading material.
Reading material comprises traditional printed on paper "books" and "documents" or electronic files with
extensions such as ".txt", ".doc",".xls", ".pdf" and web pages on Internet.
Bookmarks help to "home in" to items of specific interest in a book or even in other books, documents, or
web pages.
Conceptually, a bookmark has two parts (1) a "pointer", "link" or "hyperlink" in an electronic document
and (2) the "pointed" area.
To be of any use, "pointers" and the "pointed" need to be well structured and accurate. Familiar examples
of these are "Contents", "Index" and "List of References" pages in books and documents. "Contents"
page displays topic and the page number where the topic starts. The "Index" page displays details of
topics and where they can be located. "List of References" are "pointers" to other books, documents or
web sites.
Compiling a list of bookmarks in a book or document is time and labor intensive. Such compilations can
become another book. For example, in Sanskrit literature, [anukramAnika] are huge and detailed indexes
of the [mantra],in the [veda] and [sutra] of [pANiNI].
Old documents that have heritage value are Heritage Documents. These are being made inceasingly
available in PDF and other electronic forms. Among these, PDF files have greater appeal to readers
because they are true to life pictures of antique original documents. PDF can be made with (1) pictures of
original texts only (2) or pictures of original text with additional processable text of the same.
Processable text from pictures of text is useful for search and scholarly analysis. And most importantly, to
assist preparation of Bookmarks! Processable text is obtained by manually keying in text seen in pictures.
Or using Character recognition or optical character recognition software.
In principle, using software to generate text from pictures is like employing a robot "typist" to see
the pictures of text, recognize the characters and type it in using a keyboard. Practically, this is fraught with
difficulties. Robot "typists" are notorious for the "mistakes" made. And troubles are compounded by the
huge quantity of errors generated with great speed. Often you could spend more time correcting OCR
mistakes than the time you may take to manually type in the text.
Despite all the above, processable text that is reasonably readable is of very great use to prepare a list of
bookmarks in the document. And such bookmarks can be great help to correct OCR mistakes too.
2 Assumptions and Materials
How To Bookmark PDF Files Of Heritage Documents
-
8/8/2019 How to Bookmark PDF Files
2/7
Page 2
2.1. PDF of a Heritage document is available or you can make it.
2.2. The PDF file has
(a) pictures of text
(b) analyzable text (OCR or manually input) true to the pictures
2.3. Software / facility to extract OCR text from a PDF file.
Some PDF readers allow PDF file to printed "as text", if there is processable text at all in the file.
in the file is prominently printed along with extracted text (b) Coordinates of each item of text on PDF
pages, is optionally printed along with extracted text. http://www.a-pdf.com/
available from http://www.notepad-plus-plus-org
2.5. A combination of PDF software and tools
2.5.1 Primarily, PDFILL Editor and PDFILL PDF Tools available from http://www.pdfill.com2.5.2 And support and testing with
eXPert PDF Reader from http://www.visagesoft.com/
tools and can be accessed via the menu route Start - All Programs - Accessories - Accessibility
- On-Screen Keyboard and Magnifier.
3 Definitions of Resources
3.1. PDF technology offer resources for easy casual reading or scholarly study of books and documents.
The main resources for these in PDF Readers are
with "topics " and "page number" where the topic starts. When a topic of interest on the Bookmarks area
is selected, the software displays the page or area of page where the topic of interest is located.
and sections of pages.
facility for additional Bookmarks that can be modified, sourteed and grouped.
A-PDF Text Extractor has simple but very useful additional features. (a) The serial numbers of PDF pages
2.4. Text editors such as Microsoft Notepad supplied with operating system Windows and Notepad ++ ,
Foxit PDF Reader or Adobe Reader
2.6. On Screen Virtual Keyboard such as Click-N-Type Virtual from http://www.cnt.lakefolks.com
2.7. Screen Magnifier such as Meazure from http://www.cthing.com
The above 2 are necessary for ergonomic and/or accessibility reasons. Microsoft Windows provides such
2.8. Spreadsheet software such as Calc, a part of Openoffice.org from http://www.Openoffice.org
3.1.1 Bookmarks displaying contents of a book or document. These are similar to "Contents" pages in books
Bookmarks can be viewed as used User Designed Contents Area with a list of "hyperlinks" to different pages
3.1.2 Comments are user made highlights,notes and drawing markups in a PDF file. PDF programs make a
a list of hyperlinks to these and displays them in a seperate area Comments. This, can be viewed as
-
8/8/2019 How to Bookmark PDF Files
3/7
Page 3
Generally Bookmarks are designed to reference items as they are in the original book or document.
These do not leave any visible marks on the PDF pages.
Bookmarks and Comments are probably two imperatives for proper study of Heritage Documents. The
number of pages in such books can run into thousands. PDF file sizes can be 90 or 100MB and more.
4 Needs
self explanatory to the extent possible, and numbered such that the list can be sorted to display the same
order of the original single document.
For example,
01-of-48-sv01-a-01-OF-5-akAratRutva-MW 4938173bytes02-of-48-sv01-a-02-OF-5-anu-garjita.-MW 4911093bytes
03-of-48-sv01-a-03-OF-5-abhi-grah-MW 4979304 bytes
4.2. Structured Placement of Files in Structured Folders
All the split files ought to be in one suitably named folder
menu route Start - All Programs - Accessories - Command Prompt. Executing the command
Files_in_Hertage_3_folder .txt will be self explanatory and easily remembered and identified.
Powertoys is set of additional Microsft tools for XP computers. This includes the useful
These tools are available from http//www.microsoft.com downloads section.
5. Reasons And Choice Of PDF Software
for a variety of reasons These coud be due to mistakes while splitting, or including a page or pages as
enhancement for study. For example, a list of Abbreviations in every split file enhances facility for better
study of any file.
in them.
3.1.3 Pages is a software generated list of hyperlinks to individual pages in file..
3.1.4 Attachments is facility to include other files in a PDFfile.
4.1.Sizing things down - Large size PDF files are unwieldy. Working with them can frustratingly slow on
the fastest of computers. The easiest workaround is to split a large PDF fileinto convenient small sizes.
One way is to split them Chapter-wise same as in the original. Names given to the files ought to be
A record of lists of all files is easily made Microsoft Windows Command Prompt resource accessed via
DIR > .txt generates a text file with list of all files in a folder.A file name such as such as
Command Prompt Here toolthat enables easy switch to DOS directory from a Windows folder location.
5.1 Splitting PDF files is imperative for Heritage Documents because PDF of original scanned files
can be unmanageably huge on most computers. Concomitant with splitting is need to Merge PDF files
PDFILL PDF Tools is a very comprehensive collection of tools to process PDF files and Image
5.2 Accessing Split PDF files is most conveniently done by Bookmarks that can link and open the split
PDF files of a large book. Such facility is provided in Foxit Reader but the choice of software is
PDFILL PDF Editor because of the highly innovative and easy method it provides forExport and Import
of Bookmarks as xml file
-
8/8/2019 How to Bookmark PDF Files
4/7
Page 4
When you have 20 or 30 split files of a book, each of these must have a menu to access other split files.
the set split files of an entire Sanskrit dictionary could be imported into every split file of a Sanskrit
Grammar book.
areas of interest in a page with a desired zoom factor.
eXPert PDF Reader from http://www.visagesoft.com/
enable quick manual creation of bookmarks with display of desired area and zoom of PDF pages.
5.3.1 Create a few bookmarks using above and save the PDF.
Bookmarks as xml.
5.3.4 Import the completed set into PDFILL PDF Editor and create a complete PDF file with full
set of Bookmarks.
6. Creating Bookmarks as xml - step 1
6.1 The first requirement to create Bookmarks as xml is to have processable text of the PDF hidden in it.
Adobe Reader enables output PDF files as text, if it has hidden or OCR text.
6.2 Notepad++ enables a quick look at the processable text. It is a powerful text editor that can read and
write xml files too. Bookmarks exported as as xml with PDFILL PDF Editor is also read into Notepad++ to
use as template for a complete set of Bookmarks. Bookmarks exported as as xml have this structure.
CHAPTER III.
3708. That which is
called an affix, has an acute accent on its
The areas of the code of functional interest to create our complete set of Bookmarks are
1 Page The page no the items of our interest are located
2. XYZ ... 1.5 This is the section of the page ad the last 1.5 means 150% zoom
3. Color... This is font style and color parameters of the bookmark
4.> CHAPTER III This is the text appearing in the Bookmark as well as in the book
Bookmarks to access the diiferent split files can be created once and exported as xml. This can then be
import easily into any number of files of the same book or even another book. For example, references to the
5.3 Creating Bookmarks
Bookmarks enable access to page with selected item. Additionally they can enable display of particular
Foxit PDF Reader or Adobe Reader
5.3.2 Use PDFILL PDF Editor to open above saved file and create a template by export of the
5.3.3 Use the template to create a full set ofBookmarks as xml for the document.
Our current interest in the processable text of the PDF is to locate the page numberof items of interest.
7. Creating Bookmarks as xml - step 2
-
8/8/2019 How to Bookmark PDF Files
5/7
Page 5
7.1 The entire processable text from PDF read first in Notepad++ is copied and pasted into a spreadsheet
7.3 A column is inserted between the above two for text that appears in the Bookmark.This is item 4
7.4 Next is locating page numbers in the book. Page numbers follow a pattern in books. They could be on
the top or the bottom of the page. If at the bottom they are usually sole occupants of a single line. If at the
top, theymay be preceded by text if right hand page. If it is a left hand page, text may follow a number.
Our objective is to recognize this pattern and extract these into a separate column having only numbers.
So depending on the pattern we need to insert one or two columns adjacent left to the column.
7.5 The full range of cells including headers is named in Openoffice.org via the menu route
Data-Define Range. The spreadsheet software used could be Microsoft Excel or Gnumeric where
this procedure might be slightly different.
8. Extracting Items For Bookmarks xml
menu route is Data - Filter
8.2 Standard Filter is selected on the column Text_FromPDF. This gives a variety of options such as
Contains, Does Not Contains, Begins With and Ends With. Our objective is to set filter for page
numbers or numerals.
dialog box in Openoffice.org . This diplayed rows will have digits 0 thru 9.Entering [0-9][0-9] will display
double digit items and [0-9][0-9][0-9] fwill display three digit items.
page identifying text, " =Page 1=", "" =Page 2="," =Page 2=" ... etc. When text extracted with this
program is used the standard Filter can be a simple Contains "=Page"
8.3 After appropriate filter has been set for lines with page numbers, text from PDF file in displayed lines
needs to be copied into a blank column inserted adjacent left to the Text_fromPDF and named
appropriatel, Identified_Page_nos. When filter is set in OpenOffice.org, contents cannot of cells cannot be
simply copied and pasted. Formula, however,in the first cell of blank column giving reference to text can be
entered and copy pasted into other cells of the filtered range. for example, in the blank cell F47 the formula
-
8/8/2019 How to Bookmark PDF Files
6/7
Page 6
And then entering a formula to do the job for us. To enter correct formula we need to understand
what we want and state that explicitly. In this example we can see that a number of blank cells
precede the identified page number. In other words, all lines above an identified line with page
number belongs to that page. So, the logic in our formula would be, If cell in Identified_page_Nos
has a number then cell in Identified_lines_and_pgs. should have that number otherwise it should
Translated into spreadsheet formula IF(F!43 "":F143;E144)
6 C D E F G
7 Control_SNo Text_Bookmark Identified_lines Identified_pagText_fromPDF
8 _and_pgs. _Nos column.
9
10 1 ON RULES OF, GENDERS
11 2 CHAPTER I.
12 3 FEMININE GENDER.
13 4 q i 'fag*' u
14 5 1. The Gender.
15 6 Note: There are three16 7 * i '13ft' i wfasmnri sh ii
17 8 2. The Feminine (Gender).
18 9 These two are A'dhikara
47 G47 = Page 1 =
143 IF(F!43 "": G143 = Page 2 =
F143;E144)
8.5 Having got the page nos for all lines with the formulas, the results are copied and pasted special as
numbers only.
8.6 Items such as chapters, sections and any other that occur with regularity or even specially needed
items such as reference to perhaps a particular author can be identified by setting appropriate filter.Such identified items are tagged or marked in a seperate newly inserted column. Procedure is same as
was followed first for the occurrence of page numbers.
described above, related page numbers and the actual text in the PDF is displayed. This can be used
later in the spreadsheet or xml file and re imported into PDFILL PDF Editor. After all work is done a final
filter is set on this column to select non blank items. Now, the items are ready for copy and paste special
as text.
9 Getting the xml file ready in spreadsheet
are copied into the spreadsheet. This is a template, that we may like to use for our Bookmarks.
9.1 A separate sheet renamed xml_prefinal is used for this.
This has 7 columns.
have the same number as that below it.
8.7 Text_Bookmarks is the text displayed as Bookmark. When filter is set for different marked items
as text that is displayed in Bookmarks. There could be OCR and other errors but these can corrected
Bookmarks exported as xml described in section 4. Preparation for Creating Bookmarks as xml - step 1
-
8/8/2019 How to Bookmark PDF Files
7/7
Page 7
document prepared in spreadsheet.
All item in red except col 4 are constants in Bookmarks exported / imported as xml.
All