writing simple perl scripts to create, convert and analyze xml documents presented for:
DESCRIPTION
Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for: APIII - Advancing Practice Instruction and Innovation through Informatics Marriott City Center, Pittsburgh, PA Friday, October 10, 2003 Session E2 Perl and Python Programming Workshop - PowerPoint PPT PresentationTRANSCRIPT
Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents
Presented for:APIII - Advancing Practice Instruction and Innovation through InformaticsMarriott City Center, Pittsburgh, PAFriday, October 10, 2003Session E2
Perl and Python Programming WorkshopSession Organizers: Jules Berman and Jim Harrison
Jules J. Berman, Ph.D., M.D.Program Director for Pathology InformaticsCancer Diagnosis ProgramNational Cancer InstituteNational Institutes of HealthRockville, MD
Virtually everything presented can be reviewed at you leisure at:
http://65.222.228.150/jjb/tutor.htm
This site contains literally hundreds of Perl programming tips and scripts
What is the purpose of XML?
XML allows heterogeneous systems to communicate and exchange their data
It achieves this through metadata (data about data).
Can produce an ideal document that completely describes itself, including all data and all metadata.
COMMON XML TASKS
1. Converting an HTML file to an XML file.
2. Converting an XML file to an HTML file (e.g. making an XML file presentable while preserving its information content)
3. Converting an Excel file to an XML file Converting an XML file to a different data structure (e.g. moving XML into a standard database)
4. Querying an XML file
5. Querying multiple XML files for related information
Lets do a simple conversion of an html file to an XML file.
Here’s the html file (notice that the top header information has been removed)
<body><h1>Simple HTML document</h1><br>List to follow:<ul><li>First<li>Second<li>Third</ul></body></html>
open (TEXT, "html.htm")||die"Cannot"; #substitute your html pageopen (STDOUT, ">html.xml")||die"Cannot"; #substitute your html pageprint "\<\?xml version \= \"1\.0\" encoding \= \"ISO\-8859\-1\"\?\>\n";$line = " ";%dictionary = ("body" => "document","h1" => "title","ul" => "list","ol" => "list");@keysarray = keys(%dictionary);while ($line ne "") { $line = <TEXT>; $line =~ s/\<\/html\>//; $line =~ s/\n//; if ($line =~ /^\<br\>/) { $line = "\<line\>$'\<\/line\>"; print $line; next; } if ($line =~ /^\<li\>/) { $line = "<item>$'\<\/item\>"; print $line; next; } foreach $key (@keysarray) { $line =~ s/(\<[\/]?)$key/$1$dictionary{$key}/g; } print $line; }exit;
Most important parts of HTML->XML script:%dictionary = ("body" => "document","h1" => "title","ul" => "list","ol" => "list");@keysarray = keys(%dictionary);
foreach $key (@keysarray) { $line =~ s/(\<[\/]?)$key/$1$dictionary{$key}/g; }
Converting an XML file to an HTML file (many many different ways to do this)
Converting an XML file to an HTML file:
use XML::Parser; #calls an external moduleopen (STDOUT, “>output.htm");
my $parser = XML::Parser->new( Handlers => { Init => \&handle_doc_start, Final => \&handle_doc_end, Start => \&handle_elem_start, End => \&handle_elem_end, Char => \&handle_char_data, });my $file = "presum.xml";$parser -> parsefile($file);
sub handle_doc_start{my $header = <<HEADER;<html><head><title>Precancer Classification</title></head><body><center><h1>Precancer Classification</h1></center><br><br>HEADERprint $header;}
sub handle_doc_end{my $header = <<HEADER;<br></body></html>HEADERprint $header;}
sub handle_elem_start { my ($expat, $name, %atts) = @_; if ($name eq "concept") { $count++; print "\<br\><font color=\"0000ff\">$name $count</font><ul>\n"; return; } }
Etc., etc., etc.,
Remember: Perl XML-related modules can be downloaded/installed at no cost fromwww.activestate.com ppm service.
PPM> search xmlPackages available from http://www.ActiveState.com/PPMPackages/5.6:CGI-Form2XML [1.3 ] Render CGI form input as XMLCGI-ToXML [0.02 ] Converts CGI to an XML structureCGI-XML [0.1 ] Perl extension for converting CGI.pm variables to/from XMLCGI-XMLForm [0.10 ] Extension of CGI.pm which reads/generates formated XML.CGI-XMLPost [1.3 ] receive XML file as an HTTP POSTDBIx-XML-DataLoader [1.1b ]DBIx-XMLMessage [0.05 ] XML Message exchange between DBI data sourcesDBIx-XML_RDB [0.05 ] Perl extension for creating XML from existing DBI datasourcesData-DumpXML [1.05 ] Dump arbitrary data structures as XMLGoXML-XQI [1.1.4 ] Perl extension for the XML Query Interface at xqi.goxml.com.HTTP-WebTest-Plugin-XMLReport [1.01 ] Report plugin for HTTP::WebTest generates output in XML format
Tk-XMLViewer [0.15 ] Tk widget to display XMLXML-AutoWriter [0.37 ] DOCTYPE based XML outputXML-Beautify [0.05 ] Beautifies XML output from XML::Writer (soon to do any XML).XML-DOM [1.25 ] A perl module for building DOM Level 1 compliant document structuresXML-DOMHandler [1 ] Implements a call-back interface to DOM.XML-DTDParser [1.7 ] quick and dirty DTD parserXML-Excel [0.02 ] Perl extension converting Excel files to XMLXML-Node [0.11 ] Node-based XML parsing: an simplified interface to XML::ParserXML-SAX [0.12 ] Simple API for XMLXML-SAX-Base [1.02 ] Base class SAX Drivers and FiltersXML-SAX-Builder [0.02 ] build XML documents using SAXXML-SAX-Expat [0.37 ] SAX Driver for ExpatXML-SAX-Machines [0.4 ] manage collections of SAX processors
XML-SAX-PurePerl [0.80 ] Pure Perl XML Parser with SAX2 interfaceXML-SAX-RTF [0.1 ] SAX Driver for Microsoft's Rich Text Format (RTF)XML-SAX-Simple [0.02 ] SAX version of XML::SimpleXML-SAX-Writer [0.44 ] SAX2 XML WriterXML-SAXDriver-CSV [0.07 ] SAXDriver for converting CSV files to XMLXML-Writer [0.4 ] Perl extension for writing XML documents.XML-Writer-String [ 0.1 ] Capture output from XML::Writer.XML-XPath [1.12 ] a set of modules for parsing and evaluating XPath statementsXML-XPath-Simple [0.05 ] Very simple interface for XPathsXML-XPathScript [0.03 ] Stand alone XPathScriptXML-XQL [0.68 ] A perl module for querying XML tree structures with XQLXML-XSLT [0.40 ] A perl module for processing XSLT
Creating an XML file from an Excel file
1.Example is done in Windows, and because it’s using an Windows-based application, and the Windows API, it won’t work in Linux (not Perl’s fault).
2.There are plenty of other approaches that will work in Linux
3.Also, requires Excel to be installed.4.The complete Perl script is opener7.pl and found
in the perl tutorial: http://65.222.228.150/jjb/tutor.htm
Creates a Windows OLE object for Excel - NON_PERLISHmy $app = CreateObject OLE "Excel.Application" || die "Can't open";$app->Workbooks->Open($xlfile);
Creates the XML tags by collecting a list of the column headersforeach my $column_place (@column_array) { $thing = $app->Range("${column_place}1")->{'Value'}; if ($thing ne "") { $thing =~ s/ /_/g; $thing =~ s/[^\w0-9]//g; $thing =~ s/2nd/Second/g; $nextthing = "$column_place||$thing"; print "$nextthing\n"; push(@index, $nextthing); } else { last; } }
Creates a Windows OLE object for Excel - NON_PERLISH
foreach my $arrayvalue (@index) { $arrayvalue =~ /\|\|/; my $key = $`; my $value = $'; $thing = $app->Range($key . $row)->{'Value'}; #substitute & for & $thing =~ s/\&/\&/; #substitute > for > $thing =~ s/\>/\>/; #substitute < for < $thing =~ s/\</\</; #substitute &apos for ' $thing =~ s/\'/\&apos/; #substitute " for " $thing =~ s/\"/\"/; $thing =~ tr/a-zA-Z0-9 //cd; print " \<$value\>$thing\<\/$value\>\n"; }$row++;
BUILDING THE COOPERATIVE PROSTATE CANCER TISSUE RESOURCE TISSUE MICROARRAY FILE
1. Get xls file with core informationTMACPCTR.XLS 98,816 7-24-03 11:17am A
2. convert the xls file to an xml file using opener7.plOPENER7 .PL 3,663 7-24-03 11:39am AThis produces file block2.xmlBLOCK2.XML 328,263 7-24-03 11:39am A
3. Add header and trailer information to the xml file
Header information is basically:<?xml version="1.0"?><histo><tma><header><title>CPCTR Microarray 1</title><creator>CPCTR</creator><subject>Tissue Microarrays</subject><description>CPCTR TMA XML</description><rights>public domain</rights><filename>tmacpctr.xml</filename></header>
Trailer information is basically:</core></block></tma></histo>
This produces:TMACPCTR .XML 331,636 7-28-03 10:58am A
4. Check validity of the tmacpcrt.xml file using validtma.plVALIDTMA .PL 9,132 5-21-03 3:06pm AThe TMA validating Perl script can be obtained by going to the TMA specification paper:The tissue microarray data exchange specification: A community-based, open source tool for sharing tissue microarray dataJules J Berman1 , Mary E Edgerton2 and Bruce A Friedman3 BMC Medical Informatics and Decision Making 2003 3:5http://www.biomedcentral.com/1472-6947/3/5
The validating protocol produces a screen output that includes:
c:\tmacpctr.xmlBegining to parse c:\tmacpctr.xml now.Finished. c:\tmacpctr.xml is a valid Tissue Microarray File.The one-way hash of your file is e2ad62a75974628b7499bd7d771b82f0
Querying an XML file
1. Many many ways. Most people use XSLT (Extensible Stylesheet Language Transformations)
2. When you haven’t converted your XML into another data structure (like a database structure) and you’re using straight XML as the document that you’re querying, then a query is the same as a transformation where you through everything away except the stuff that matches your query.
HETEROGENEOUS XML MERGES/QUERIES
1.Can be thought of as a special form of XSLT
2.Or as a data structure conversion
3.Or as a straightforward Perl programming job
HETEROGENEOUS XML MERGES/QUERIES
HETEROGENEOUS XML MERGES/QUERIES
This is where namespaces becomes important