lessons from the trenches: a survey of approaches for large-scale projects

2
Lessons from the trenches: A survey of approaches for large-scale projects Robert Miner Design Science, Inc., 413 Wacouta Street, Suite 550, St. Paul, MN, 55101, USA. Technical documents are generally difficult to author, publish and maintain due to the complexity inherent in mathematical notation, diagrams, and tables. This article surveys approaches for addressing these challenges used by large-scale, Web- based projects involving ongoing workflows. Such workflows fall into four broad categories: the XML Repository model, the TeX Repository model, the Page Layout model, and exotic models. 1 Introduction Publication of mathematical and scientific material has long been problematic. The basic issue is that while math is like text in some ways, it is more two-dimensional and diagrammatic, and thus difficult to capture in print. Further, notation is so central to mathematics as a thought aid, it resists standardization, and even standard notational forms are complex, hard to linearize, and hard to typeset compared to plain text. The use of the Web as a means of communication inherits these longstanding challenges of mathematical communication, and adds several more, though it also promises new approaches and solutions as well. 2 Large-scale projects Large-scale, web-based mathematics and science publishing projects are expected to match functionality that is commonplace for plain text information sources. Many areas of functionality that are now considered standard for text publishing projects are challenging to provide for content containing mathematics. These areas include accessibility, searching, interactivity and collaboration tools (forums, whiteboard, chat, etc.) In part, this is because new algorithms and techniques must be developed for mathematics, as in the case of accessibility and search. In others, it is because standard software tools and techniques for providing such functionality rarely include support for mathematics. A hallmark of large projects is that they are ongoing, and have a need for a sustainable publication workflow. Consequently math-capable tools and techniques are required for all phases of traditional publishing workflows: authoring, editing and proofing, conversion, layout and design, validating markup and enforcing style guidelines, composition, and versioning and content management. The ability to set and enforce policies for these kinds of activities provide the kind of quality control and sustainability that are expected of large-scale information resources. The last decade has seen the emergence of a handful of general types of large-scale, web-based math and science informa- tion resources. The backbone of academic and industrial research are digital libraries of research articles. Some are aimed primarily at current material (the ArXiv preprint server, IEEE Xplore, Elsevier ScienceDirect, etc.) while others focus on making legacy literature available (NUMDAM, WDML, JSTOR, etc.) A closely related type of resource is the abstracting and indexing service, such as Zentralblatt and MathSciNet, or the experimental search services Mathdex and MathWeb. Another successful type of resource that aims at a wider audience is the reference work or encyclopedia. Some notable examples of this kind are MathWorld, the Digital Library of Mathematical Functions, Planet Math, Wikipedia, and the Online Encylopedia of Integer Sequences. A final type of information resource deserving mention is the online educational resource. These are not so homogenous in their characteristics, but rather span assessment systems (e.g. Webworks, ETS) course management systems (eCollege, Moodle, Blackboard, etc.,) and interactive learning materials (Le ActiveMath, Explore Math, Connexions, etc.) 3 Common approaches to technical challenges Large-scale, web-based Math on the Web projects must address three basic technical challenges if they are to be successful: display of math on the Web, authoring of mathematical content, and creation of a math-aware workflow process and tool set. Sadly, after a decade of work, the most basic of these challenges, displaying notation, remains among the most problematic. Various display strategies are in common use, none entirely satisfactory. In practice, these techniques are often combined, through use of XSL, and/or server-side and client-side content negotiation. e-mail: [email protected], Phone: +1 651 223 2883 PAMM · Proc. Appl. Math. Mech. 7, 10105051010506 (2007) / DOI 10.1002/pamm.200701100 © 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim © 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Upload: robert-miner

Post on 06-Jul-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lessons from the trenches: A survey of approaches for large-scale projects

Lessons from the trenches: A survey of approaches for large-scale projects

Robert Miner ∗

Design Science, Inc., 413 Wacouta Street, Suite 550, St. Paul, MN, 55101, USA.

Technical documents are generally difficult to author, publish and maintain due to the complexity inherent in mathematicalnotation, diagrams, and tables. This article surveys approaches for addressing these challenges used by large-scale, Web-based projects involving ongoing workflows. Such workflows fall into four broad categories: the XML Repository model, theTeX Repository model, the Page Layout model, and exotic models.

1 Introduction

Publication of mathematical and scientific material has long been problematic. The basic issue is that while math is like text insome ways, it is more two-dimensional and diagrammatic, and thus difficult to capture in print. Further, notation is so centralto mathematics as a thought aid, it resists standardization, and even standard notational forms are complex, hard to linearize,and hard to typeset compared to plain text. The use of the Web as a means of communication inherits these longstandingchallenges of mathematical communication, and adds several more, though it also promises new approaches and solutions aswell.

2 Large-scale projects

Large-scale, web-based mathematics and science publishing projects are expected to match functionality that is commonplacefor plain text information sources. Many areas of functionality that are now considered standard for text publishing projectsare challenging to provide for content containing mathematics. These areas include accessibility, searching, interactivity andcollaboration tools (forums, whiteboard, chat, etc.) In part, this is because new algorithms and techniques must be developedfor mathematics, as in the case of accessibility and search. In others, it is because standard software tools and techniques forproviding such functionality rarely include support for mathematics.

A hallmark of large projects is that they are ongoing, and have a need for a sustainable publication workflow. Consequentlymath-capable tools and techniques are required for all phases of traditional publishing workflows: authoring, editing andproofing, conversion, layout and design, validating markup and enforcing style guidelines, composition, and versioning andcontent management. The ability to set and enforce policies for these kinds of activities provide the kind of quality controland sustainability that are expected of large-scale information resources.

The last decade has seen the emergence of a handful of general types of large-scale, web-based math and science informa-tion resources. The backbone of academic and industrial research are digital libraries of research articles. Some are aimedprimarily at current material (the ArXiv preprint server, IEEE Xplore, Elsevier ScienceDirect, etc.) while others focus onmaking legacy literature available (NUMDAM, WDML, JSTOR, etc.) A closely related type of resource is the abstractingand indexing service, such as Zentralblatt and MathSciNet, or the experimental search services Mathdex and MathWeb.

Another successful type of resource that aims at a wider audience is the reference work or encyclopedia. Some notableexamples of this kind are MathWorld, the Digital Library of Mathematical Functions, Planet Math, Wikipedia, and the OnlineEncylopedia of Integer Sequences. A final type of information resource deserving mention is the online educational resource.These are not so homogenous in their characteristics, but rather span assessment systems (e.g. Webworks, ETS) coursemanagement systems (eCollege, Moodle, Blackboard, etc.,) and interactive learning materials (Le ActiveMath, Explore Math,Connexions, etc.)

3 Common approaches to technical challenges

Large-scale, web-based Math on the Web projects must address three basic technical challenges if they are to be successful:display of math on the Web, authoring of mathematical content, and creation of a math-aware workflow process and tool set.Sadly, after a decade of work, the most basic of these challenges, displaying notation, remains among the most problematic.Various display strategies are in common use, none entirely satisfactory. In practice, these techniques are often combined,through use of XSL, and/or server-side and client-side content negotiation.

∗ e-mail: [email protected], Phone: +1 651 223 2883

PAMM · Proc. Appl. Math. Mech. 7, 1010505–1010506 (2007) / DOI 10.1002/pamm.200701100

© 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

© 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Page 2: Lessons from the trenches: A survey of approaches for large-scale projects

PDF, possibly with HTML abstracts are typical of journals, and is the most prevalent technique. Unfortunately, PDF isamong the worst solutions in terms of making content available to machine agents. HTML with images is also common, andhas the appeal of nearly universal support in browsers, though the display quality is generally low. Consequently, image-basedtechniques are often enhanced with the use of CSS and JavaScript. Good examples of this are ASCIIMath and jsMath. Aftera major push from the World Wide Web Consortium, XHTML + MathML has some currency as a display format, and issupported by Firefox with extra fonts installed or Internet Explorer with Design Science’s MathPlayer extension installed.However, such browser requirements are typically too limiting for publishers seeking to reach general audiences. Thus,publication of XHTML+MathML mostly arises in specialize contexts where high functionality is at a premium, or as asupplement to other methods.

For authoring content, there are three models in wide usage. The first allows author submissions in a variety of formats,which are then converted to a common format. This is typically done offshore, and mostly involves converting and/or rekeyingWord or LaTeX documents in XML. This model is typical of commercial journal production. A second model that is gainingcurrency is the use of custom-built, browser-based authoring systems. These are the first choice for authoring databases ofhighly structured content, such as SCORM learning objects, problem banks for online assessment systems, and so on. The lastmodel for authoring content, used primarily by society and small publishers, is to require LaTeX submissions from authors,using publisher provided templates.

There are four basic models for math-aware workflows in widespread usage. The first is the XML Repository model. Someexamples of publishers using this model are AIP, IEEE, Elsevier, Airbus, PubMed, Connexions, many others. Pushed heavilyby tool vendors and industry visionaries, implementation has come slowly but is now probably dominant. In this model, authorsubmissions are converted to XML. Copy editing and proofing are done directly in XML (using XML editors such as Arbortextor XMetaL with add-on math support such as Design Science MathFlow). Following copy editing, a batch processing stepvalidates content and applies style guidelines, (e.g. via XSL and scripts.) Finally, content is composed using industry standardsoftware such as XyEnterprise XPP or PTC’s Advanced Print Publisher, which have native support for MathML. Alternatively,mathematics is composed to EPS or other image formats by specialized software, and incorporated into final page content.

The second workflow model is the TeX Repository. Examples include MathSciNet, Planet Math, MathWorld, DLMF,ArXiv, many others. It is most common in specific disciplines where LaTeX is dominant. In this model, authors must knowor learn LaTeX. Copy editing is also done in LaTeX. Composition is accomplished via the TeX engine and the use of customstyle files.

The third model is a Page Layout model, which is mostly used for highly-designed content such as magazines and textbooks.Examples include IEEE Spectrum, and textbooks from large publishers such as Pearson and Houghton-Mifflin. Author contentis imported into Quark Xpress or Adobe InDesign. Production is handwork by experts. Composition is accomplished vianative tool functionality, often with add-in modules for math support. When other media formats are required, they aregenerated as necessary from the page layout source, though as a general rule, this doesn’t work particularly well.

Finally, there are a handful of exotic models. These systems tend to aim at special-purpose, streamlined functionality.Examples include Wikipedia, ETS, Webworks, SCORM. They generally use custom-built, web-based systems, and theirrepository formats are adapted to their specialized tasks. Custom-built production software and publishing environment iscentral to their function.

4 Other persistent problem areas

Regardless of the display, authoring and workflow choices a publisher makes, there are still a couple of perennial problemareas. Fonts for mathematics top this list. Most new software tools are now based on Unicode, but Unicode is a movingtarget, and many good, older math fonts don’t have Unicode encodings. The problem of variant glyphs is pervasive and toolspecific. Adding fonts to client machines is problematic because users may not have the inclination, knowledge or authority,and there are intellectual property issues with distributing fonts. Server-side font management is problematic because Webfont technologies are limited and expensive.

Another persistent problem area concerns compound documents, utilizing several kinds of XML markup. In general, thereis no single, dominant, standard XML document type for scientific documents. Some tools rely on namespaces, others don’tsupport them, or interpret them differently. Custom tool setup is often required for each document type. Changes tend tocascade, e.g. a DTD change requires style sheet changes, tool changes, etc.

5 Conclusions

Many large-scale Mathematics on the Web projects are in operation. While they are in rough parity with other large-scaledisciplinary Web resources, their complexity and expense is higher. Good solutions for Web-display of mathematics, fontsand compound document problems remain elusive. Tool support is improving, but much custom integration and developmentis still generally required.

© 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

ICIAM07 Minisymposia – 01 Computing 1010506