extracting math from postscript documents

ISSAC-2004 1

Extracting Math from PostScript Documents

Michael YangUniv. Calif., IrvineRichard Fateman

Univ. Calif, Berkeley

ISSAC-2004 2

Why Extract Math from Documents?

• The current and recent past publications of scholarly journals in mathematics are not adequately indexed.

• Imagine a query: “Find papers that involve this differential equation:”

x2 y’’+xy’+(x2-m2)y=0

• Or “Is there a common name for this equation? [Ans: yes, Bessel’s]

ISSAC-2004 3

Why Extract Math from Documents?

• Find papers that may be relevant to a formula or a proof of a related theorem.

• Find out if a discovery is actually novel or a rediscovery of a previous result.

• Even: Is this formula true?

ISSAC-2004 4

How can we search, anyway?

• Search in integral tables using hashing, flexible pattern matching.– Example: TILU (Fateman, Einwohner)

• The general problem looks like a huge challenge of unification with simplifications of analytic functions. Is a=f(b) the same as f-1(a)=b ?

ISSAC-2004 5

These are obviously hard questions

• But we are much better off if we can start with a few decades of the most recent math papers and their formulas to search.

• Prerequisite: encoding of formulas with semantic markup, the point of this paper.

ISSAC-2004 6

Why start with PostScript or PDF?

• We have many papers, including math journals, online, some of them free, with essentially all markup removed, stored for printing as PS or PDF.

• Automation of inserting the markup, even if only partly successful, can help enable further work to make it possible to index and search for math.

ISSAC-2004 7

Is this easier or harder than OCR?

• It should be easier, because all the characters are known as error-free glyphs.

• OCR tends to make erroneous symbol identifications if there is inadequate word-based context.

• For example o0O°º, 1lI|!i , Illinois (!), -_=• Well-known sources of PS provide stereotypes

for the font/glyph/location mapping.• But it could be harder if the PostScript is truly

obscure (PS is Turing equivalent, after all)

ISSAC-2004 8

An ExampleFrom a paper by Cyril Banderier et al,``Random Maps, Coalescing Saddles, Singularity Analysis, and Airy Phenomena,'' Random Structures and Algorithms, 19 3-4, 194--246 (2001)} only slightly edited by inserting newlines. [explain origin]

....0.002 0.0025 200 400 600 800 1000 k Figure 3. Left: The standard Airy distribution. Right: Observed frequencies of core sizes k 2 [20; 1000] in 50,000 random maps of size 2,000, showing the bimodal character of the distribution. variety of integral or power series representations including (see [1, 45]) 1) Ai(z) 1 2 Z 1 1 e i(zt t 3 =3) dt = 1 3 2=3 1 X n=0 3 1=3 z n ( n 1) 3) n sin 2(n 1) 3 :Equipped with this de nition, we present the main characterof the paper, a probability distribution closely related to the Airy function. De nition 1. The standard ....

ISSAC-2004 9

What is this really?In this particular case, extraction of the document image shows two formulasin the middle of the citation:

ISSAC-2004 10

How could we encode this image?

Recognize the characters on the page as equivalent to a expression, for example:

$${\mbox Ai}(z) = {1\over{2 \pi}}\int _{-\infty}^{+\infty} e^{i(zt+t^3/3)}dt$$$$~~= {1 \over {\pi 3^{2/3}}}\sum_{n=0}^\infty (3^{1/3}z)^n {{\Gamma((n+1)/3)} \over {n!}} \sin {{2(n+1)\pi}\over 3}.$$

or some alternative in MathML or OpenMath.What are the barriers to getting to this point?

ISSAC-2004 11

Detecting Math in the first place

• Look for changes in font, italics, font size changes, altered baselines.

• Consider the density of text (formulas are low density).

• Notice the presence of special characters unusual in text: = is common in math, but not in text (Also +, -, parens).

ISSAC-2004 12

Implementation

• Run PostScript through a modified Ghostscript (PS interpreter) to output text file information suitable for geometric/math processing.

• Run this file through previously developed OCR-based technology (in Lisp) for using bounding-boxes, contents, positions,… to create a geometric 2-D “relative position” tree. Process further to identify semantic relationships if possible and output a hierarchical tree-representation of math formulas.

• Convert this to TeX (could be MathML equally well).

ISSAC-2004 13

Possible Future Work

• Better font tools• Look at more producers of PS (not just TeX and

dvips), e.g. Acrobat Distiller.• Run some tests (NEC) to see if we can extract

sufficient formulas to add to the indexing information.

• Examine the issue of “formula similarity” e.g. parameter substitution, simplification, rearrangement. (relatively easy in the context of integration because there is a designated variable of integration.)

ISSAC-2004 14

Conclusions

• It’s possible to automatically revisit previously typeset documents and invent plausible versions of TeX source-code for some, perhaps much, of published TeX.

• This provides an additional link to a chain which may eventually lead to more widespread semantic encoding of math for index and retrieval.

• Given the difficulties, a better route for the future is to have authors or editors use semantic mark-up for digital mathematical documents for “born digital documents.” Publishers should encourage this kind of work, although standards are currently disappointing.

ISSAC-2004 15

Another paper, not included

• Submitted to ISSAC-2004

• Author: R. Fateman

ISSAC-2004 16

Rational Function Computing with Poles and Residues

• Here’s the idea: consider 2 forms for the same rational expression.

ISSAC-2004 17

Which form is better?

• Generality of representation

• Complexity (Cost) of operations– Arithmetic (+, *, /)– Integration, derivatives, limits, series, …– Numerical evaluation– Display for human viewing

ISSAC-2004 18

Keep constant numerators over (powers of) linear denominators ( +

polynomial)• Works for encoding arbitrary rational

functions (over complex numbers) in one variable.

• Plausibly requires high-precision floats if you start with ratio of polynomials where the roots of the denominator cannot be expressed as exact rational numbers.

ISSAC-2004 19

PRO: Once you have this representation

• Addition of rational functions is essentially free, compared to standard representation since no polynomial GCD is required.– a/b + c/d is already simplified except for

sorting and the possibility that b=d

• Multiplication of rational functions is inexpensive also, again no GCD needed.

ISSAC-2004 20

CON: Do you want to use this representation?

• Division is not fast, so it is more appropriate if division is infrequent.

• If the input is not already in residue/pole form, or if you have to do division, finding zeros introduces approximations [maybe for the first time in a problem].

• Output forms may look longer.

ISSAC-2004 21

Examples

• Ordinary addition: orders of magnitude faster. E.g 45,000 times faster.

• Ordinary multiplication: maybe 2X faster

• What about mixtures of + and * together? What important algorithms are there?

• Sparse determinant calculation.

ISSAC-2004 22

A determinant benchmark

• Consider matrices with entries of this form:

• Determinant of 8X8 matrix in Macsyma 2.4, on a 2.6GHz Pentium 4 computer.– Using Gaussian Elimination 112 sec– Using Minor Expansion 109 sec– Using Residues/Poles (75% in bignum arithmetic)

41 sec– Using Residues/Poles and double-floats 1.6sec

ISSAC-2004 23

Conclusions

• No surprise that avoiding GCDs is a winner.• Using approximate calculations can provide

huge speedups. Do we really need exact computation everywhere we provide it?

• We have a potential application for high-precision zero-finding, as well as non-overflowing software floats (GMP, ARPREC)