writing simple perl scripts to create, convert and analyze xml documents presented for:

31
Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for: APIII - Advancing Practice Instruction and Innovation through Informatics Marriott City Center, Pittsburgh, PA Friday, October 10, 2003 Session E2 Perl and Python Programming Workshop Session Organizers: Jules Berman and Jim Harrison Jules J. Berman, Ph.D., M.D. Program Director for Pathology Informatics Cancer Diagnosis Program National Cancer Institute National Institutes of Health Rockville, MD

Upload: maura

Post on 06-Jan-2016

57 views

Category:

Documents


6 download

DESCRIPTION

Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for: APIII - Advancing Practice Instruction and Innovation through Informatics Marriott City Center, Pittsburgh, PA Friday, October 10, 2003 Session E2 Perl and Python Programming Workshop - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents

Presented for:APIII - Advancing Practice Instruction and Innovation through InformaticsMarriott City Center, Pittsburgh, PAFriday, October 10, 2003Session E2

Perl and Python Programming WorkshopSession Organizers: Jules Berman and Jim Harrison

Jules J. Berman, Ph.D., M.D.Program Director for Pathology InformaticsCancer Diagnosis ProgramNational Cancer InstituteNational Institutes of HealthRockville, MD

Page 2: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

Virtually everything presented can be reviewed at you leisure at:

http://65.222.228.150/jjb/tutor.htm

This site contains literally hundreds of Perl programming tips and scripts

Page 3: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:
Page 4: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

What is the purpose of XML?

XML allows heterogeneous systems to communicate and exchange their data

It achieves this through metadata (data about data).

Can produce an ideal document that completely describes itself, including all data and all metadata.

Page 5: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

COMMON XML TASKS

1. Converting an HTML file to an XML file.

2. Converting an XML file to an HTML file (e.g. making an XML file presentable while preserving its information content)

3. Converting an Excel file to an XML file Converting an XML file to a different data structure (e.g. moving XML into a standard database)

4. Querying an XML file

5. Querying multiple XML files for related information

Page 6: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

Lets do a simple conversion of an html file to an XML file.

Here’s the html file (notice that the top header information has been removed)

<body><h1>Simple HTML document</h1><br>List to follow:<ul><li>First<li>Second<li>Third</ul></body></html>

Page 7: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

open (TEXT, "html.htm")||die"Cannot"; #substitute your html pageopen (STDOUT, ">html.xml")||die"Cannot"; #substitute your html pageprint "\<\?xml version \= \"1\.0\" encoding \= \"ISO\-8859\-1\"\?\>\n";$line = " ";%dictionary = ("body" => "document","h1" => "title","ul" => "list","ol" => "list");@keysarray = keys(%dictionary);while ($line ne "") { $line = <TEXT>; $line =~ s/\<\/html\>//; $line =~ s/\n//; if ($line =~ /^\<br\>/) { $line = "\<line\>$'\<\/line\>"; print $line; next; } if ($line =~ /^\<li\>/) { $line = "<item>$'\<\/item\>"; print $line; next; } foreach $key (@keysarray) { $line =~ s/(\<[\/]?)$key/$1$dictionary{$key}/g; } print $line; }exit;

Page 8: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

Most important parts of HTML->XML script:%dictionary = ("body" => "document","h1" => "title","ul" => "list","ol" => "list");@keysarray = keys(%dictionary);

foreach $key (@keysarray) { $line =~ s/(\<[\/]?)$key/$1$dictionary{$key}/g; }

Page 9: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:
Page 10: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

Converting an XML file to an HTML file (many many different ways to do this)

Page 11: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:
Page 12: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

Converting an XML file to an HTML file:

use XML::Parser; #calls an external moduleopen (STDOUT, “>output.htm");

my $parser = XML::Parser->new( Handlers => { Init => \&handle_doc_start, Final => \&handle_doc_end, Start => \&handle_elem_start, End => \&handle_elem_end, Char => \&handle_char_data, });my $file = "presum.xml";$parser -> parsefile($file);

Page 13: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

sub handle_doc_start{my $header = <<HEADER;<html><head><title>Precancer Classification</title></head><body><center><h1>Precancer Classification</h1></center><br><br>HEADERprint $header;}

Page 14: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

sub handle_doc_end{my $header = <<HEADER;<br></body></html>HEADERprint $header;}

Page 15: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

sub handle_elem_start { my ($expat, $name, %atts) = @_; if ($name eq "concept") { $count++; print "\<br\><font color=\"0000ff\">$name $count</font><ul>\n"; return; } }

Etc., etc., etc.,

Page 16: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:
Page 17: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

Remember: Perl XML-related modules can be downloaded/installed at no cost fromwww.activestate.com ppm service.

Page 18: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

PPM> search xmlPackages available from http://www.ActiveState.com/PPMPackages/5.6:CGI-Form2XML [1.3 ] Render CGI form input as XMLCGI-ToXML [0.02 ] Converts CGI to an XML structureCGI-XML [0.1 ] Perl extension for converting CGI.pm variables to/from XMLCGI-XMLForm [0.10 ] Extension of CGI.pm which reads/generates formated XML.CGI-XMLPost [1.3 ] receive XML file as an HTTP POSTDBIx-XML-DataLoader [1.1b ]DBIx-XMLMessage [0.05 ] XML Message exchange between DBI data sourcesDBIx-XML_RDB [0.05 ] Perl extension for creating XML from existing DBI datasourcesData-DumpXML [1.05 ] Dump arbitrary data structures as XMLGoXML-XQI [1.1.4 ] Perl extension for the XML Query Interface at xqi.goxml.com.HTTP-WebTest-Plugin-XMLReport [1.01 ] Report plugin for HTTP::WebTest generates output in XML format

Page 19: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

Tk-XMLViewer [0.15 ] Tk widget to display XMLXML-AutoWriter [0.37 ] DOCTYPE based XML outputXML-Beautify [0.05 ] Beautifies XML output from XML::Writer (soon to do any XML).XML-DOM [1.25 ] A perl module for building DOM Level 1 compliant document structuresXML-DOMHandler [1 ] Implements a call-back interface to DOM.XML-DTDParser [1.7 ] quick and dirty DTD parserXML-Excel [0.02 ] Perl extension converting Excel files to XMLXML-Node [0.11 ] Node-based XML parsing: an simplified interface to XML::ParserXML-SAX [0.12 ] Simple API for XMLXML-SAX-Base [1.02 ] Base class SAX Drivers and FiltersXML-SAX-Builder [0.02 ] build XML documents using SAXXML-SAX-Expat [0.37 ] SAX Driver for ExpatXML-SAX-Machines [0.4 ] manage collections of SAX processors

Page 20: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

XML-SAX-PurePerl [0.80 ] Pure Perl XML Parser with SAX2 interfaceXML-SAX-RTF [0.1 ] SAX Driver for Microsoft's Rich Text Format (RTF)XML-SAX-Simple [0.02 ] SAX version of XML::SimpleXML-SAX-Writer [0.44 ] SAX2 XML WriterXML-SAXDriver-CSV [0.07 ] SAXDriver for converting CSV files to XMLXML-Writer [0.4 ] Perl extension for writing XML documents.XML-Writer-String [ 0.1 ] Capture output from XML::Writer.XML-XPath [1.12 ] a set of modules for parsing and evaluating XPath statementsXML-XPath-Simple [0.05 ] Very simple interface for XPathsXML-XPathScript [0.03 ] Stand alone XPathScriptXML-XQL [0.68 ] A perl module for querying XML tree structures with XQLXML-XSLT [0.40 ] A perl module for processing XSLT

Page 21: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

Creating an XML file from an Excel file

1.Example is done in Windows, and because it’s using an Windows-based application, and the Windows API, it won’t work in Linux (not Perl’s fault).

2.There are plenty of other approaches that will work in Linux

3.Also, requires Excel to be installed.4.The complete Perl script is opener7.pl and found

in the perl tutorial: http://65.222.228.150/jjb/tutor.htm

Page 22: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

Creates a Windows OLE object for Excel - NON_PERLISHmy $app = CreateObject OLE "Excel.Application" || die "Can't open";$app->Workbooks->Open($xlfile);

Creates the XML tags by collecting a list of the column headersforeach my $column_place (@column_array) { $thing = $app->Range("${column_place}1")->{'Value'}; if ($thing ne "") { $thing =~ s/ /_/g; $thing =~ s/[^\w0-9]//g; $thing =~ s/2nd/Second/g; $nextthing = "$column_place||$thing"; print "$nextthing\n"; push(@index, $nextthing); } else { last; } }

Page 23: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

Creates a Windows OLE object for Excel - NON_PERLISH

foreach my $arrayvalue (@index) { $arrayvalue =~ /\|\|/; my $key = $`; my $value = $'; $thing = $app->Range($key . $row)->{'Value'}; #substitute &amp for & $thing =~ s/\&/\&amp/; #substitute &gt for > $thing =~ s/\>/\&gt/; #substitute &lt for < $thing =~ s/\</\&lt/; #substitute &apos for ' $thing =~ s/\'/\&apos/; #substitute &quot for " $thing =~ s/\"/\&quot/; $thing =~ tr/a-zA-Z0-9 //cd; print " \<$value\>$thing\<\/$value\>\n"; }$row++;

Page 24: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:
Page 25: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

BUILDING THE COOPERATIVE PROSTATE CANCER TISSUE RESOURCE TISSUE MICROARRAY FILE

1. Get xls file with core informationTMACPCTR.XLS 98,816 7-24-03 11:17am A

2. convert the xls file to an xml file using opener7.plOPENER7 .PL 3,663 7-24-03 11:39am AThis produces file block2.xmlBLOCK2.XML 328,263 7-24-03 11:39am A

3. Add header and trailer information to the xml file

Page 26: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

Header information is basically:<?xml version="1.0"?><histo><tma><header><title>CPCTR Microarray 1</title><creator>CPCTR</creator><subject>Tissue Microarrays</subject><description>CPCTR TMA XML</description><rights>public domain</rights><filename>tmacpctr.xml</filename></header>

Trailer information is basically:</core></block></tma></histo>

This produces:TMACPCTR .XML 331,636 7-28-03 10:58am A

Page 27: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

4. Check validity of the tmacpcrt.xml file using validtma.plVALIDTMA .PL 9,132 5-21-03 3:06pm AThe TMA validating Perl script can be obtained by going to the TMA specification paper:The tissue microarray data exchange specification: A community-based, open source tool for sharing tissue microarray dataJules J Berman1 , Mary E Edgerton2 and Bruce A Friedman3 BMC Medical Informatics and Decision Making 2003 3:5http://www.biomedcentral.com/1472-6947/3/5

The validating protocol produces a screen output that includes:

c:\tmacpctr.xmlBegining to parse c:\tmacpctr.xml now.Finished. c:\tmacpctr.xml is a valid Tissue Microarray File.The one-way hash of your file is e2ad62a75974628b7499bd7d771b82f0

Page 28: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

Querying an XML file

1. Many many ways. Most people use XSLT (Extensible Stylesheet Language Transformations)

2. When you haven’t converted your XML into another data structure (like a database structure) and you’re using straight XML as the document that you’re querying, then a query is the same as a transformation where you through everything away except the stuff that matches your query.

Page 29: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

HETEROGENEOUS XML MERGES/QUERIES

1.Can be thought of as a special form of XSLT

2.Or as a data structure conversion

3.Or as a straightforward Perl programming job

Page 30: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

HETEROGENEOUS XML MERGES/QUERIES

Page 31: Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for:

HETEROGENEOUS XML MERGES/QUERIES

This is where namespaces becomes important