tech 802: data, databases & xml
DESCRIPTION
Monday, January 14, 2012 presentation on 3 different data types (unstructured, structured and semi-structured) and how xml plays a role in content management systems, onix (bibliographic data sharing), RSS (real simple syndication) and xml-first publishing for ebooks.TRANSCRIPT
Data, Databases & XMLA Crash Course.
Monique Sherre8monique@boxcarmarke>ng.com
3 Types of DataUnstructured Data• eg. Word documents, PDFs, audio/video files, emails, • No search• No version controlStructured Data• eg. Inventory management database, wordpress• Searchable• Version and user control (secure access)• Rela>onship structures (show everything tagged “winter”)• Import / Export• Display op>ons• Machine readable; run queries against the dataSemi-‐Structured Data• eg. xml (html, onix, rss) • formal/standardized data
2
Structured Data: Wordpress• Open Source content management system based on PHP and MySQL
– Open Source: source code is freely available, which encourages development by many independent programmers.
– CMS: a database + presenta>on layer (set of templates)– MySQL: a type of database
– PHP: a scrip>ng language designed to produce dynamic web pages
• Plugin architecture (Akismet for spam, SEO by Yoast, WP to Twi8er, etc.)
• Pages & Posts
• Categories & Tags
3
Pages vs PostsPage (~unstructured)
• Sta>c content, won’t change frequently
• eg. About page
• Can be organized manually a hierarchy. Page (parent) and subpages (child)
– About Us > Team; About Us > History
Post (~structured)
• Frequently updated content dynamically organized in a hierarchy (chronological, category), plus archive
– News ar>cles, Event informa>on
– Frequently published in an RSS feed that is subscribed to by users
4
Semi-‐Structured Data: RSS• Real Simple Syndica>on or Rich Site Summary
• Publish it. Subscribe to it. Pull it into other websites.
• RSS is a standardized XML file format.
5
WordPress As Database• Instead of a series of HTML files, WordPress offers a system that allows for the
organiza>on and efficient storage & retrieval of informa>on.
– Structured data can be exported into semi-‐structured data (RSS, XML)
6
RSS is XML• eXtensible Markup Language (XML) is a markup language that defines a set of rules
for encoding documents in a format that is machine-‐ and human-‐readable.
• RSS, XHTML (unzipped EPUB) and ONIX (ONline Informa>on eXchange—standard for sharing bibliographic data) are some of the 100s of XML-‐based languages that have been developed.
• How might we use XML for the Tech Project?
7
8
Current db
New db
Export to XML
Rename / Modify XML
Import from XML
9
ONIX is XML• Interna>onal standard for represen>ng and communica>ng book and product info
in electronic form
– text-‐readable (human & computer)
– tagged/markup– transferred by email or rp (file transfer protocol)
– More info Bisg.org
10
11
Publisher db
Bookseller db
Export to ONIX & FTP file to
Server
Grab file from Server & Import
from ONIX
Server
12
Publisher db
Bookseller db
Export to ONIX & FTP file to
Server
Grab file from Server & Import
from ONIX
Server
EDI: Electronic Data Interchange• structured (db to db) transmission of data
• Oren XML tagged format
13
Sour
ce
Ques>ons on XML?
• Data, database ques>ons?• Tech project?
14
WEBCAST
A Roadmap to Efficiently ProducingMulti-Format/Multi-Screen eBooks
Lessons from Market Innovators
November 8, 2012
Speakers
§ Thad McIlroy– Electronic publishing analyst and author
The Future of Publishing
§ Stephen Driver – Vice President, Production Services
The Rowman & Littlefield Publishing Group
XML Workflows for eBooks
17
XML Adoption by Sector
STM Educational Trade
XML Defined
XML is:n A device-independent, system-
independent method of storing and processing electronic text
n Markup for form and/or meaningn A data interchange format used by many
applications on the Web.
XML Provides Real Solutionsn But it is a big, ugly, unwieldy bearn And its conceptual metaphors bear little
resemblance for book publishersn It’s based on 25-year-old thinking about
technical documents and ecommercen Yet it’s the only real game in townn ONIX book metadata is enabled by XML
The Importance of XMLn XML enables content managementn Separates form from contentn Combines of style sheets with the power
of databases in an extensible languagen Its long-term killer feature is semantic
markup – marking up meaning, making text discoverable
n Future-proofing content
XML TaggingSemantic tagging requires human judgmentbut offers the benefit of meaning
<book price=“49.95" ISBN="string" publicationdate="2012-12-09"> <title>string</title> <author> <first-name>string</first-name> <last-name>string</last-name> </author> <genre>string</genre> </book>
24
Structured Taggingby Authors?
Typéfi sample approach
If you show this to editors... “They’re going to start drinking at their desks”
Templated DesignsHow much book content fitsinto automatic composition?
The Human FactorNew Internal Skills & Positions
n The production skill set changes substantially
n Much of the existing knowledge base changes or obsoletes
n The move from design & composition & production management to content & product architecting and engineering
n There is an enormous training challenge ahead
Key Takeaways
n XML is complex, but packed with valuen XML is not an all-or-nothing deal
n Your should start with small stepsn XML’s complexity demands outside help
n Services, consultants, trainers, associationsn The rapid proliferation of output formats
can only be mastered with a structured approach like XML
Obstacles to using XML
• XML is in>mida>ng, full of jargon
• We’re editors, not programmers
• And what about the authors?
• You mean I can’t move that line of text half a pica?! And other design concerns
• Editorial, or “my book’s too good for a template”
So how’d we solve it?
• We manipulated XML to our uses, not the other way around
• We s>ll used authors’ Word documents as the source
• Template interiors were something we had already been doing for years
• XML coding was translated into a coding structure virtually all produc>on people know: typeseung short tags
• We adapted exis>ng XML approaches to our specific needs by discarding coding that didn’t fit our content
But weren’t there problems?
A Mul>-‐Channel Workflow Example
1. Word document received from author
2. Word file coded for XML conversion (resembles standard typeseung short tags)
3. Typeseung short tags replaced with XML via conversion process (some file edi>ng required.)
4. Final PDF generated arer style template applied to XML file.
EPUB, .mobi and WebPDF generated.
Insider Tips
• Know your staffWho can adjust and how will you address those who can’t?
• Know your contentUsing the right tool for the job is cri>cal, not all content is suitable for XML composi>on
• Be realisCc about the learning curveIf you’re s>ll paper edi>ng, making the leap straight to XML may be too great, so start small
• Be flexibleYou’ll likely revisit several core values of your publishing program, iden>fy the most important things and be honest about the less important ones
Insider Tips, cont.
• XML need not be an off-‐the-‐shelf productYou can and should work to customize it to your own produc>on needs
• See it throughIt’s taken us two years to arrive at a point where we’re comfortable, and we’re s>ll making changes
• Partner with the right vendorsFind someone willing and capable of adap>ng to your publishing needs
• When you need a hammer, use a hammerRemember XML is just another tool, it shouldn’t be your only tool.
Ques>ons?
38
What’s NextTech Course 802
1. Chris>ne on Tues 15th: coming in to talk templates and wordpress
2. Next Tues 22nd: Chloe and Stacey coming in to talk about ebooks, and xml3. Following Mon 28 and Tues 29: Brenda J Walker and Haig Armen on apps
Tech Project 6071. This Wed 16th: Content to present assignment to Design & Tech so we can all be on
the same page and on Thurs carry on with wireframes/design mockups (Design), plaworm set up (Tech) and discoverability/ed calendar (Content)
2. Following Wed 23rd: Present to Alan and David designs and ideas so far.