janez Štebe ddi experience in adp (2002) arhiv družboslovnih podatkov (adp) university of...
TRANSCRIPT
Janez Štebe DDI Experience in ADP (2002)
Arhiv družboslovnih podatkov (ADP)
University of LjubljanaE-mail:
http://www.adp.fdv.uni-lj.si
MOST (UNESCO) and GESIS workshop, Berlin, 22-24 February 2002
Topics of a presentation
A brief history of technical standards and its influence on Data Archives organisation
The adoption of DDI in 1999
Advantages and disadvantages of using existent but still emerging standard
What are XML and DDI?
Quick look inside DDI DTD document structure
DDI XML Codebooks production line in ADP
Discussion
A brief history of data archives technical standards (Tannenbaum,
Taylor 1990)
Late 1950s – IBM cards
Easily reproduced, recycled – the advent of DA
1960s – electronic computers – end of storage standards
A task of data conversion and interchange – DA matured
Beginning of the www era in early 90s (DDI Committee, 2001)
CSSDA electronic codebook specificationOSIRIS Codebook Dictionary (SRC,ICPSR)Standard study description
But lack of coordination resulted in noncompatible catalogues
“Midwife function” (Scheuch, 1990)
A role of ZA in late 1960 when 5 new archives were established in Europe:
“offers to share experiences, especially of past errors”
“technical information on data storage and retrieval”
Situation in 1997 when ADP establishes
“Multiplicity of classificatory languages, search techniques and standards for documenting data” (DDI Committee, 2001)
Every organisation adopt its own dialect of existing standards
A CESSDA IDC functioned as a lone example of still living integrating efforts
But... DDI was under discussion
March 1999 – DDI Beta version became operable
ADP applied for a grant which secured a six-month long intensive learning and practise of its own XML codebooks production
Results:
1. Successful implementation of first ten XML codebooks
2. Enhancing a production line for a routine codebook production.
2000 - 2001
Preparation of our own XSL for XML Codebook presentation on the internetMarch 2000 –DDI DTD Version 1.0 was publishedMachine conversion of DDI DTD Beta XML Codebooks into 1.0 version Continuing production of XML Codebooks
NESSTAR
Meanwhile a parallel refinement of NESSTAR tool was developing, which promises to add functionality to a growing collection of XML codebooks
End of 2001 – a configuration of ADP NESSTAR server catalogue
Advantages and disadvantages of using existent but still emerging
standard
There is no need for (re)inventing a local catalogue rules
Cooperation in document production (sharing documents between sites)
A danger of staying alone if others will not adopt the same standard
Less capability to add specific emphasis according to local needs
+/ -
Use of existing and emerging software tools suitable for the standard environment
Virtual catalogue
Conversion tools from SPSS and CAI software files
Dependency on others timetable in dynamic of tools production
E.g. NESSTAR was late in full adoption of UTF-8 convention which was crucial fur us
What is xml?
“XML is to a document’s intellectual content what HTML is to the physical structure of that document” (Thomas, Bloc 2001)
Why XML?
XML can be accomplished without professional or expert knowledge (user-friendly)
It is ready for preparing a multiple format presentations, e.g. printed book, internet etc.
It can be filled by different authors - each with specialist knowledge of its subject area. All obey the same content structure.
DDI DTD <> XML?
DTD= xml Document type definition
DDI DTD = a special Data documentation initiative XML Codebook definition
A Codebook xml document must be “well-formed” and “valid”
Well-formed
Any XML document, e.g. HTML, can be well-formed – in accordance with the XML syntaxMain features: <tags> must be closed</tags>Sensitive “UPPER–lower” case namingOnly one <tag-name ID=“id-entry”> per document
Valid = Well formed +
Conforms to a specific DTD
Example: an underlined path calls ...
<!DOCTYPE codeBook SYSTEM "CONFIG10/CODEBOOK-EN.DTD“>
<codeBook>
<docDscr> ...
... a file "CONFIG10/CODEBOOK-EN.DTD“>(Content of a file):...<!ELEMENT codeBook (docDscr*
, stdyDscr+ , fileDscr* , dataDscr* , otherMat*) >
<!ATTLIST codeBook %a.global; >
...
What does it all mean?
You do not have to look in the “machine-readable” “codebook.DTD” file to fill-in a .XML Codebook: A XML editor helps to check well-formedness and document
validity It helps choosing appropriate elements in accordance with
the DTD while editing
A “human-readable” Tag Library consists of element definition with practical examples. It gives you guidance on type and form of information
Integrates different levels of information in a same documentdocDscr (XML document and sources description)stdyDscr (Overall study + stdy level references)fileDscr (Physical data files)dataDscr (variables)othMat (additional material for variables documentation)
It specifies both...
The content of catalogue - suitable as input to virtual catalogue of different sites, produced on various platforms.
The content of codebook (variables description) – suitable as input to “virtual library of all individual measurements in the studies in a collection”
A dilemma of Library vs. Data service concept (Scheuch, 1990
The unit of storage is “study”
The unit of storage is the variable
In a DDI DTD XML codebook you can integrate meta-information about...
Intellectual content of a study
Its scope
Methodological details
Retrieval and dissemination policies
File location and format
(+) References to accompanying documents, e.g.
Reports on methodology,
Publications,
Classifications lists,
Questionnaires and similar,
Computer syntax files,
Tables of results, etc.
(+) Hyperlink cross-references inside and outside document
The use of ID and IDRefs attributes
The use of URI attributes
To sum up:
XML is similar to HTML in that it is:
Easy to use,
Broadly accessible,
Hyper-textual
In addition it has:
Computer&human readable and understandable structure of document content
DDI XML Codebooks production line in ADP
First step:1. Basic information about new data set file, depositor,
and accompanying material is first entered in ADP Inventory book (ACCESS Data base)
2. After choosing best suited predefined XML DDI Codebook template we extract the information from ACCESS data base to the draft XML Codebook
3. A resulting codebook is moved to an Internet catalogue for quick info about new study, viewing is supported by referenced XSL through IE5 or better.
Second step: Full Study description
1. A depositor is requested to fill a MS Word form, containing elements corresponding to DDI DTD study description section
2. A draft XML Codebook from previous step is edited with XMetaL® XML editor. Missing peaces of information are added manually
Third step: Codebook Data description generated from SPSS
data file 1. Final SPSS data file, if fully labelled, is
converted with the NSD XML Generator ® to an XML data description section of DDI Codebook and integrated with previous study description
Step 4: Codebook Data description with full questions text
1. For most important data sets full questions text is entered into dD section from original questionnaire text file
or 2. by using a conversion tool from CAI
computer readable files to a DDI XML files.
Finally NESSTAR ®
Final two documents, Slovene and English language DDI XML Codebooks, are converted into a NESSTAR complaint format and together with the data file published into a NESSTAR catalogue.
Codebook.xsl
Original paper documents
Free-text documents
Codebook.xml (XML Editor)
Computer readableHuman + computer
readable
Human readable
IE explorer view Printed codebook
NESSTAR Catalogue + Data Explorer
SPSS data + labels,
CAI quest. docDscr stdyDscr fileDscr dataDscr othMat...
Coversion Tools
stdyDscr form filled-in by depositor
Code-book.dtd
Tag Library
Common issues in DDI XML codebooks production
1. XML editors does not necessarily support UNICODE
2. The use of entities in XML document helps to standardise document production, makes it faster and easier to translate into English
Conclusions:
DDI DTD receive growing attention in a community which guaranty production of new tools for enhancing its use
Despite continuing developments and overlapping archival standards, DDI 1.0 as today’s technology promises the longevity of XML Codebook 1.0 documents
Slovene ADP have taken the experience with DDI for guidance of its organisation.
Main references
DDI Committee (2001): The Data Documentation Initiative (DDI) Version 1.1: The New Specification for Social Science Metadata. Project Description.
Data Documentation Initiative. A Project of a Social Science Community. (2002) http://www.icpsr.umich.edu/DDI
Scheuch, Erwin K. (1990): From a data archive to an infrastructure for the social sciences. International Social Science Journal, No. 123, pp. 93-111.
Tanenbaum, Eric and Marcia Taylor (1990): Developing social science archives. International Social Science Journal, No. 124?.
Thomas, Wendy L. And William C. Block (2001): An Introduction to the Data Documentation Initiative (DDI). ICPSR OR Meeting 2001. http://www.icpsr.umich.edu/DDI/PAPERS/