how to bookmark s

Upload: kedar-jonnalagadda

Post on 09-Apr-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 How to Bookmark PDF Files

    1/7

    Page 1

    By

    Kedarnath Jonnalagadda, Vaidika Gramam, Hyderabad 2011

    [email protected]

    A book without bookmarks is the same as bookmarks without a book.

    1 Introduction

    This document illustrates methods that were used to easily bookmark thousands of PDF pages.

    A bookmark is reference to the position of desired detail in reading material.

    Reading material comprises traditional printed on paper "books" and "documents" or electronic files with

    extensions such as ".txt", ".doc",".xls", ".pdf" and web pages on Internet.

    Bookmarks help to "home in" to items of specific interest in a book or even in other books, documents, or

    web pages.

    Conceptually, a bookmark has two parts (1) a "pointer", "link" or "hyperlink" in an electronic document

    and (2) the "pointed" area.

    To be of any use, "pointers" and the "pointed" need to be well structured and accurate. Familiar examples

    of these are "Contents", "Index" and "List of References" pages in books and documents. "Contents"

    page displays topic and the page number where the topic starts. The "Index" page displays details of

    topics and where they can be located. "List of References" are "pointers" to other books, documents or

    web sites.

    Compiling a list of bookmarks in a book or document is time and labor intensive. Such compilations can

    become another book. For example, in Sanskrit literature, [anukramAnika] are huge and detailed indexes

    of the [mantra],in the [veda] and [sutra] of [pANiNI].

    Old documents that have heritage value are Heritage Documents. These are being made inceasingly

    available in PDF and other electronic forms. Among these, PDF files have greater appeal to readers

    because they are true to life pictures of antique original documents. PDF can be made with (1) pictures of

    original texts only (2) or pictures of original text with additional processable text of the same.

    Processable text from pictures of text is useful for search and scholarly analysis. And most importantly, to

    assist preparation of Bookmarks! Processable text is obtained by manually keying in text seen in pictures.

    Or using Character recognition or optical character recognition software.

    In principle, using software to generate text from pictures is like employing a robot "typist" to see

    the pictures of text, recognize the characters and type it in using a keyboard. Practically, this is fraught with

    difficulties. Robot "typists" are notorious for the "mistakes" made. And troubles are compounded by the

    huge quantity of errors generated with great speed. Often you could spend more time correcting OCR

    mistakes than the time you may take to manually type in the text.

    Despite all the above, processable text that is reasonably readable is of very great use to prepare a list of

    bookmarks in the document. And such bookmarks can be great help to correct OCR mistakes too.

    2 Assumptions and Materials

    How To Bookmark PDF Files Of Heritage Documents

  • 8/8/2019 How to Bookmark PDF Files

    2/7

    Page 2

    2.1. PDF of a Heritage document is available or you can make it.

    2.2. The PDF file has

    (a) pictures of text

    (b) analyzable text (OCR or manually input) true to the pictures

    2.3. Software / facility to extract OCR text from a PDF file.

    Some PDF readers allow PDF file to printed "as text", if there is processable text at all in the file.

    in the file is prominently printed along with extracted text (b) Coordinates of each item of text on PDF

    pages, is optionally printed along with extracted text. http://www.a-pdf.com/

    available from http://www.notepad-plus-plus-org

    2.5. A combination of PDF software and tools

    2.5.1 Primarily, PDFILL Editor and PDFILL PDF Tools available from http://www.pdfill.com2.5.2 And support and testing with

    eXPert PDF Reader from http://www.visagesoft.com/

    tools and can be accessed via the menu route Start - All Programs - Accessories - Accessibility

    - On-Screen Keyboard and Magnifier.

    3 Definitions of Resources

    3.1. PDF technology offer resources for easy casual reading or scholarly study of books and documents.

    The main resources for these in PDF Readers are

    with "topics " and "page number" where the topic starts. When a topic of interest on the Bookmarks area

    is selected, the software displays the page or area of page where the topic of interest is located.

    and sections of pages.

    facility for additional Bookmarks that can be modified, sourteed and grouped.

    A-PDF Text Extractor has simple but very useful additional features. (a) The serial numbers of PDF pages

    2.4. Text editors such as Microsoft Notepad supplied with operating system Windows and Notepad ++ ,

    Foxit PDF Reader or Adobe Reader

    2.6. On Screen Virtual Keyboard such as Click-N-Type Virtual from http://www.cnt.lakefolks.com

    2.7. Screen Magnifier such as Meazure from http://www.cthing.com

    The above 2 are necessary for ergonomic and/or accessibility reasons. Microsoft Windows provides such

    2.8. Spreadsheet software such as Calc, a part of Openoffice.org from http://www.Openoffice.org

    3.1.1 Bookmarks displaying contents of a book or document. These are similar to "Contents" pages in books

    Bookmarks can be viewed as used User Designed Contents Area with a list of "hyperlinks" to different pages

    3.1.2 Comments are user made highlights,notes and drawing markups in a PDF file. PDF programs make a

    a list of hyperlinks to these and displays them in a seperate area Comments. This, can be viewed as

  • 8/8/2019 How to Bookmark PDF Files

    3/7

    Page 3

    Generally Bookmarks are designed to reference items as they are in the original book or document.

    These do not leave any visible marks on the PDF pages.

    Bookmarks and Comments are probably two imperatives for proper study of Heritage Documents. The

    number of pages in such books can run into thousands. PDF file sizes can be 90 or 100MB and more.

    4 Needs

    self explanatory to the extent possible, and numbered such that the list can be sorted to display the same

    order of the original single document.

    For example,

    01-of-48-sv01-a-01-OF-5-akAratRutva-MW 4938173bytes02-of-48-sv01-a-02-OF-5-anu-garjita.-MW 4911093bytes

    03-of-48-sv01-a-03-OF-5-abhi-grah-MW 4979304 bytes

    4.2. Structured Placement of Files in Structured Folders

    All the split files ought to be in one suitably named folder

    menu route Start - All Programs - Accessories - Command Prompt. Executing the command

    Files_in_Hertage_3_folder .txt will be self explanatory and easily remembered and identified.

    Powertoys is set of additional Microsft tools for XP computers. This includes the useful

    These tools are available from http//www.microsoft.com downloads section.

    5. Reasons And Choice Of PDF Software

    for a variety of reasons These coud be due to mistakes while splitting, or including a page or pages as

    enhancement for study. For example, a list of Abbreviations in every split file enhances facility for better

    study of any file.

    in them.

    3.1.3 Pages is a software generated list of hyperlinks to individual pages in file..

    3.1.4 Attachments is facility to include other files in a PDFfile.

    4.1.Sizing things down - Large size PDF files are unwieldy. Working with them can frustratingly slow on

    the fastest of computers. The easiest workaround is to split a large PDF fileinto convenient small sizes.

    One way is to split them Chapter-wise same as in the original. Names given to the files ought to be

    A record of lists of all files is easily made Microsoft Windows Command Prompt resource accessed via

    DIR > .txt generates a text file with list of all files in a folder.A file name such as such as

    Command Prompt Here toolthat enables easy switch to DOS directory from a Windows folder location.

    5.1 Splitting PDF files is imperative for Heritage Documents because PDF of original scanned files

    can be unmanageably huge on most computers. Concomitant with splitting is need to Merge PDF files

    PDFILL PDF Tools is a very comprehensive collection of tools to process PDF files and Image

    5.2 Accessing Split PDF files is most conveniently done by Bookmarks that can link and open the split

    PDF files of a large book. Such facility is provided in Foxit Reader but the choice of software is

    PDFILL PDF Editor because of the highly innovative and easy method it provides forExport and Import

    of Bookmarks as xml file

  • 8/8/2019 How to Bookmark PDF Files

    4/7

    Page 4

    When you have 20 or 30 split files of a book, each of these must have a menu to access other split files.

    the set split files of an entire Sanskrit dictionary could be imported into every split file of a Sanskrit

    Grammar book.

    areas of interest in a page with a desired zoom factor.

    eXPert PDF Reader from http://www.visagesoft.com/

    enable quick manual creation of bookmarks with display of desired area and zoom of PDF pages.

    5.3.1 Create a few bookmarks using above and save the PDF.

    Bookmarks as xml.

    5.3.4 Import the completed set into PDFILL PDF Editor and create a complete PDF file with full

    set of Bookmarks.

    6. Creating Bookmarks as xml - step 1

    6.1 The first requirement to create Bookmarks as xml is to have processable text of the PDF hidden in it.

    Adobe Reader enables output PDF files as text, if it has hidden or OCR text.

    6.2 Notepad++ enables a quick look at the processable text. It is a powerful text editor that can read and

    write xml files too. Bookmarks exported as as xml with PDFILL PDF Editor is also read into Notepad++ to

    use as template for a complete set of Bookmarks. Bookmarks exported as as xml have this structure.

    CHAPTER III.

    3708. That which is

    called an affix, has an acute accent on its

    The areas of the code of functional interest to create our complete set of Bookmarks are

    1 Page The page no the items of our interest are located

    2. XYZ ... 1.5 This is the section of the page ad the last 1.5 means 150% zoom

    3. Color... This is font style and color parameters of the bookmark

    4.> CHAPTER III This is the text appearing in the Bookmark as well as in the book

    Bookmarks to access the diiferent split files can be created once and exported as xml. This can then be

    import easily into any number of files of the same book or even another book. For example, references to the

    5.3 Creating Bookmarks

    Bookmarks enable access to page with selected item. Additionally they can enable display of particular

    Foxit PDF Reader or Adobe Reader

    5.3.2 Use PDFILL PDF Editor to open above saved file and create a template by export of the

    5.3.3 Use the template to create a full set ofBookmarks as xml for the document.

    Our current interest in the processable text of the PDF is to locate the page numberof items of interest.

    7. Creating Bookmarks as xml - step 2

  • 8/8/2019 How to Bookmark PDF Files

    5/7

    Page 5

    7.1 The entire processable text from PDF read first in Notepad++ is copied and pasted into a spreadsheet

    7.3 A column is inserted between the above two for text that appears in the Bookmark.This is item 4

    7.4 Next is locating page numbers in the book. Page numbers follow a pattern in books. They could be on

    the top or the bottom of the page. If at the bottom they are usually sole occupants of a single line. If at the

    top, theymay be preceded by text if right hand page. If it is a left hand page, text may follow a number.

    Our objective is to recognize this pattern and extract these into a separate column having only numbers.

    So depending on the pattern we need to insert one or two columns adjacent left to the column.

    7.5 The full range of cells including headers is named in Openoffice.org via the menu route

    Data-Define Range. The spreadsheet software used could be Microsoft Excel or Gnumeric where

    this procedure might be slightly different.

    8. Extracting Items For Bookmarks xml

    menu route is Data - Filter

    8.2 Standard Filter is selected on the column Text_FromPDF. This gives a variety of options such as

    Contains, Does Not Contains, Begins With and Ends With. Our objective is to set filter for page

    numbers or numerals.

    dialog box in Openoffice.org . This diplayed rows will have digits 0 thru 9.Entering [0-9][0-9] will display

    double digit items and [0-9][0-9][0-9] fwill display three digit items.

    page identifying text, " =Page 1=", "" =Page 2="," =Page 2=" ... etc. When text extracted with this

    program is used the standard Filter can be a simple Contains "=Page"

    8.3 After appropriate filter has been set for lines with page numbers, text from PDF file in displayed lines

    needs to be copied into a blank column inserted adjacent left to the Text_fromPDF and named

    appropriatel, Identified_Page_nos. When filter is set in OpenOffice.org, contents cannot of cells cannot be

    simply copied and pasted. Formula, however,in the first cell of blank column giving reference to text can be

    entered and copy pasted into other cells of the filtered range. for example, in the blank cell F47 the formula

  • 8/8/2019 How to Bookmark PDF Files

    6/7

    Page 6

    And then entering a formula to do the job for us. To enter correct formula we need to understand

    what we want and state that explicitly. In this example we can see that a number of blank cells

    precede the identified page number. In other words, all lines above an identified line with page

    number belongs to that page. So, the logic in our formula would be, If cell in Identified_page_Nos

    has a number then cell in Identified_lines_and_pgs. should have that number otherwise it should

    Translated into spreadsheet formula IF(F!43 "":F143;E144)

    6 C D E F G

    7 Control_SNo Text_Bookmark Identified_lines Identified_pagText_fromPDF

    8 _and_pgs. _Nos column.

    9

    10 1 ON RULES OF, GENDERS

    11 2 CHAPTER I.

    12 3 FEMININE GENDER.

    13 4 q i 'fag*' u

    14 5 1. The Gender.

    15 6 Note: There are three16 7 * i '13ft' i wfasmnri sh ii

    17 8 2. The Feminine (Gender).

    18 9 These two are A'dhikara

    47 G47 = Page 1 =

    143 IF(F!43 "": G143 = Page 2 =

    F143;E144)

    8.5 Having got the page nos for all lines with the formulas, the results are copied and pasted special as

    numbers only.

    8.6 Items such as chapters, sections and any other that occur with regularity or even specially needed

    items such as reference to perhaps a particular author can be identified by setting appropriate filter.Such identified items are tagged or marked in a seperate newly inserted column. Procedure is same as

    was followed first for the occurrence of page numbers.

    described above, related page numbers and the actual text in the PDF is displayed. This can be used

    later in the spreadsheet or xml file and re imported into PDFILL PDF Editor. After all work is done a final

    filter is set on this column to select non blank items. Now, the items are ready for copy and paste special

    as text.

    9 Getting the xml file ready in spreadsheet

    are copied into the spreadsheet. This is a template, that we may like to use for our Bookmarks.

    9.1 A separate sheet renamed xml_prefinal is used for this.

    This has 7 columns.

    have the same number as that below it.

    8.7 Text_Bookmarks is the text displayed as Bookmark. When filter is set for different marked items

    as text that is displayed in Bookmarks. There could be OCR and other errors but these can corrected

    Bookmarks exported as xml described in section 4. Preparation for Creating Bookmarks as xml - step 1

  • 8/8/2019 How to Bookmark PDF Files

    7/7

    Page 7

    document prepared in spreadsheet.

    All item in red except col 4 are constants in Bookmarks exported / imported as xml.

    All