layout analysis on newspaper archives · regular desktop computers. this corpus has been the focus...

Post on 25-Aug-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Layout analysis on newspaper archives VincentBuntinxvincent.buntinx@epfl.chE2 colePolytechniqueFederaledeLausanne,SwitzerlandFrédéricKaplanfrederic.kaplan@epfl.chE2 colePolytechniqueFederaledeLausanne,SwitzerlandArisXanthosaris.xanthos@unil.chUniversitedeLausanne,Switzerland

Thestudyofnewspaper layoutevolution throughhistorical corpora has been addressed by diversequalitative andquantitativemethods in thepast fewyears (Antonacopoulos et al, 2013; Gonzalez et al,2001;Liuetal,2001;MitchellandHong,2004;SinghandBhupendra,2014).Therecentavailabilityoflargecorpora of newspapers is nowmaking the quantita-tive analysis of layout evolution ever more popular.Thisresearchinvestigatesamethodfortheautomaticdetectionoflayoutevolutiononscannedimageswithafactorialanalysisapproach.Thenotionofeigenpag-es isdefinedbyanalogywitheigenfacesused in facerecognition processes. The corpus of scanned news-papersthatwasusedcontains4millionpressarticles,covering about 200 years of archives. This methodcan automatically detect layout changes of a givennewspaper over time, rebuilding a part of its pastpublishingstrategyandretracingmajorchangesinitshistory in termsof layout.Besides theseadvantages,italsomakes itpossibletocompareseveralnewspa-pers at the same time and therefore to compare thelayoutchangesofmultiplenewspapersbasedonlyonscansoftheirissues.

Introduction to the Corpus

The corpus consists of digitized facsimiles of twoSwiss newspapers, “Journal de Geneve” (JDG) fromyears1826to1997and“GazettedeLausanne”(GDL)fromyears1804to1997.Scanneddailyissuesofeachjournal were transcribed using an optical characterrecognition (OCR) system (Rochat et al, 2016). The

entire scanned data weighs more than 20TB, whichmakesmostusualanalysistechniquesoutofreachforregulardesktopcomputers.Thiscorpushasbeenthefocus of several studies analyzing textual data (suchas linguisticchanges (Buntixetal,2016)andnamedentityrecognition(Ehrmannetal,2016)).Anexam-pleofdifferent layoutsofGDL’s firstpage isgiven inFigure 1 which shows the evolution of various fea-tures, suchas title sizeandposition, fontsandnum-berofcolumns.

Bitmap Factorial Analysis

Inordertoanalyzelayoutevolution,weproposetobuilda static layout representation for everyyear inthe corpus. Thus, when studying each newspaper’sfirstpage,wedefinethepixeltofthestaticrepresen-tation𝑃",$ofmonthmofyearyas

𝑃",$% =1𝑁"$

𝑃",$,)%

*+,

)-.

Where𝑁"$ is the number of issues in monthm ofyearyand𝑃",$,) isthefirstpageofdaydofmonthmofyeary.Thepixeltofthestaticrepresentation𝑃"ofyearyisthendefinedas

𝑃"% =1𝑁"

𝑃",$%*+

$-.

=1𝑁"

1𝑁"$

𝑃",$,)%

*+,

)-.

*+

$-.

Were𝑁"isthenumberofmonthrepresentations𝑃",$inyeary.Adiagramoftheprocessisshowninfigure2.

Figure 1. Different layouts of GDL in years 1825, 1850 and 1875 (top, left to right), 1925, 1950 and 1975 (bottom, left to right).

Figure 2. Process diagram creating a yearly representation of first page layouts.

These representations give a vision of the meanlayout over the course of a given year. Each yearlyrepresentationcanbeprojectedinatwo-dimensionalspace by performing a principal component analysis(PCA)whichmaximizesthecovarianceoneverypixel.This method is analogous to the eigenfaces methodusedfor facerecognition(TurkandPentland,1991a,1991b)Wecomputetheeigenvectors,thatwenamedeigenpages, aswell as the eigenvalues of the covari-ancematrixof thepixels.Theyearlyrepresentationsare then projected in the two-dimensional space ofthe two eigenvectors which have the highest eigen-values. The resulting projections of yearlymean im-ages of JDG and GDL from years 1900 to 1998 areportrayedinFigure3.Inthesefigures,eachpointisayearly image and consecutive years are linked inor-der to highlight the change over time. The furtherapart the points are, the bigger the layout’s changesoccurring between two years. Visual inspection re-veals several clusters of yearswith a similar layout.Furthermore, homogeneous sequences of yearsmaybe clustered automatically based on the (unproject-ed) distance between them (e.g. by computing thedistance between year y and y+1 and “cutting” thesequence of years at positions where their distanceexceedsanarbitrarythreshold.

Figure 3: PCA projected results of the yearly representa-

tions of first pages of JDG (top, blue) and GDL (bottom, red) from years 1900 to 1998 with clusters obtained by visual

inspection.

Discussion

The PCA technique allows us to quantify layoutchangesbycovarianceanalysisofthepixelsofyearlyrepresentations. The proportion of covariance infor-mationshownbythePCAis73%forJDGand76%forGDL. Visual interpretation reveals different chrono-logicalclusterswhicharedisplayedinTables1and2along with their mean positions in the two-dimensional space of eigenpages as well as meanimages representing theseperiods (computed in thesamewayasyearlyimages,cf.Figure2).Thesemeanimages reveal the major layout transitions in eachjournalwhichmaybesummarizedasfollows:

JournaldeGeneve(JDG):

● 1900-1915:6columns, titleabovecolumns2to5,littlespacebetweencolumns.

● 1916-1931:4columns, titleabovecolumns1to4,morespacebetweencolumns.

● 1932-1964:4columns,changeofthelayoutaroundthetitleandthefirsttitleposition.

● 1965-1968:4columns,changeofthelayoutaround the title, boxes with black bordersbegintoappear.

● 1969-1991: 4 columns, total change of thetitle, title above columns 2 to 4, logo ap-pears, more space between columns andboxes,articletitlesarebigger.

● 1992-1995: 5 columns, fusion of JDG andGDL,bigchangeoflayout,boxesinsidebox-esbegintoappear,morestablestructure.

● 1996-1998: 6 columns, big change in titlefont, previous column layout replaced by amoreclassicone,articletitlesareplacedatthetopofthefirstpage.

GazettedeLausanne(GDL):

● 1900-1945:6columns, titleabovecolumns2to5,littlespacebetweencolumns.

● 1946-1966:7columns, titleabovecolumns2 to6,morespacebetweencolumnsyield-ingparticularlysmallcolumnsizes.

● 1967-1970:5columns, titleabovecolumns2 to 5, first column begins before the titlewhichisontheright,advertisementsplacedbelowthepage.

● 1971-1973: 6 columns,more classic layoutwitharticletitlesatthetop.

● 1974-1991: 4 columns, lots of space be-tween columns and articles, bigger articletitles.

● 1992-1995: 5 columns, fusion of JDG andGDL,bigchangeoflayout,boxesinsidebox-esbegintoappear,morestablestructure.

● 1996-1998: 6 columns, big change in titlefont,columnlayoutreplacedbyamoreclas-sic one, the article titles are placed at thetopofthefirstpage.

Theautomaticclusteringmethoddescribed inprevi-ouschapterhasbeenappliedonunprojecteddistanc-esandproducesimilarclusteringresults(dependingon the threshold parameter). Qualitative analysisconfirms that the resulting clustersareall separatedbyimportantlayouttransitionphases.

Table 1: Chronological clusters with their mean first page representations and their positions in the axes of PCA

eigenpages (JDG). PCAPCgenpag(JDG)obtained by PCA for JDG.

Table 2: Chronological clusters with their mean first page representations and their positions in the axes of PCA

eigenpages (GDL).

This analysis is also useful to compare severalnewspaper publishing strategies. We projected thetwonewspapers in the same two-dimensional spacerepresentation (presented in Figure 3) using thesame method with yearly representations of bothjournals inorder tocompare theirchronological tra-jectories. The covariance information shown by thePCA is 67%. Visual inspection reveals three mainclustersforeachjournal.Eachoftheseclustersturnsouttocorrespondtogroupsofclustersthathasbeendetectedinthepreviousprojections.Weobservethatthe layout of both journals has evolved in a similarway but with different timescales. GDL is more dis-persedthanJDGandhasexploreddifferentstrategiesduring the period 1900-1966. However, GDL hasadopted a style more similar to JDG style between1967and1973 justbefore it enteredamajor layouttransitionin1974(5yearslaterthanJDG).

Figure 4. PCA projected results of the yearly representa-tions of first pages of JDG (blue) and GDL (red) from years 1900 to 1998 in the same two-dimensional space represen-

tation with clusters obtained by visual inspection.

Conclusion Thesefirstresultsdemonstrateapromisingmeth-

od of detecting layout evolution automatically. Themethodisapplicabletoalargevarietyoflongitudinalimagecorporawithoutanyprerequisites,sinceitonlyrequiresimagesinbitmapformat.Itmakeitpossibletocompareseveralcorporaanddetermineperiodsoflayout transitions in a common two-dimensionalspace for visual interpretation. In addition, unpro-jected distances can be used to determine layoutchangesinanentirelyautomaticfashion,byanalyzingthe representation space through clustering algo-rithms. Future work on this method should includetheintegrationofanalignmentmethodinthebitmappreprocessing step, because alignment errors mayimpact the pixel covariance analysis and eigenpagescreation.

Bibliography Antonacopoulos,A.,Clausner,C.,Papadopoulos,C.,and

Pletschacher, S. (2013) ICDAR2013 Competition onHistorical Newspaper Layout Analysis – HNLA2013.12thInternationalConferenceonDocumentAnalysisandRecognition.

Buntinx, V., Bornet, C., and Kaplan, F. (2016) Studying

LinguisticChangeson200YearsofNewspapers.2016.DH2016,Krakow,Poland,July11-16.

Ehrmann, M., Colavizza, G., Rochat, Y., and Kaplan, F.

(2016). Diachronic Evaluation of NER Systems on OldNewspapers.13thConferenceonNaturalLanguagePro-

cessing(KONVENS2016),Bochum,Germany,September19-21.

González,J.,Rojas,I.,Pomares,H.,Salmerón,M.,Prieto,

A., andMerelo, J.J. (2001)Optimizationofwebnews-paper layout in real time.Computer Networks, Volume36, Issues 2–3, July, Pages 311-321, ISSN 1389-1286,http://dx.doi.org/10.1016/S1389-1286(01)00158-X.

Liu, F., Luo, Y., Yoshikawaf, M., and Dongcheng, H.

(2001).ANewComponent basedAlgorithm forNews-paper Layout Analysis. 2001. 6th International Confer-enceonDocumentAnalysisandRecognition.

Mitchell, P. E., and Hong, Y. (2004) Newspaper layout

analysis incorporating connected component separa-tion.ImageandVisionComputing,Volume22,Issue4,1April, Pages 307-317, ISSN 0262-8856,http://dx.doi.org/10.1016/j.imavis.2003.11.001.

Rochat, Y., Ehrmann, M., Buntinx, V., Bornet, C., and

Kaplan,F.(2016).Navigatingthrough200yearsofhis-toricalnewspapers.2016.iPRES,Bern,October3-6.

Singh, V., and Bhupendra, K. (2014). Document layout

analysis for Indian newspapers using contour basedsymbiotic approach. 2014. International Conference onComputerCommunicationandInformatics(ICCCI-2014),Jan.03–05,Coimbatore,INDIA

Turk.M.,andPentland,A.(1991a)Facerecognitionusing

eigenfaces.1991.Proc.IEEEConferenceonComputerVi-sionandPatternRecognition.pp.586–591.

Turk.M.,andPentland,A.(1991b)Eigenfacesforrecogni-

tion. Journal of Cognitive Neuroscience. 3 (1): 71–86.doi:10.1162/jocn.1991.3.1.71.PMID23964806.

top related