sandhaus, van valkenburg, cotler; nyt technical team: the future of the past
TRANSCRIPT
![Page 1: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/1.jpg)
The Future of The Past
The New York Times and the Challenge of Archives
Evan Sandhaus, Sophia Van Valkenburg
Jane Cotler
The New York Times@nytarchives
![Page 2: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/2.jpg)
![Page 3: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/3.jpg)
(us)
![Page 4: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/4.jpg)
A Problem of Archives“How do you faithfully represent information created with one technology using another?”
![Page 5: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/5.jpg)
A Problem We Know Well• Migrating The Index to The Times Information Bank• Migrating The Microfilm Archive to TimesMachine• Migrating Legacy Web Content to Modern Online
Presentation (or the challenge of multiple legacy formats)
![Page 6: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/6.jpg)
The Problem By The Numbers
60,000Issues Published Since
September 18, 1851
Almost
![Page 7: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/7.jpg)
The Problem By The Numbers
3,500,000+Unique Pages Printed Since
September 18, 1851
![Page 8: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/8.jpg)
The Problem By The Numbers
15,000,000+Articles Published
September 18, 1851
![Page 9: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/9.jpg)
Digital Archives
1851-
1859
1860-
1865
1866-
1949
1970-
1980
1981-
1995
1996-
2016
Full Text NYT5
Full Text NYT4
Abstracts NYT4
Abstracts NYT5
1950-
1959
1960-
1969
![Page 10: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/10.jpg)
The New York Times Information Bank
![Page 11: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/11.jpg)
The Index
Evan Sandhaus
![Page 12: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/12.jpg)
The New York Times Company Archives
![Page 13: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/13.jpg)
The New York Times Company Archives
![Page 14: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/14.jpg)
The New York Times Company Archives
![Page 15: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/15.jpg)
The New York Times Company Archives
![Page 16: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/16.jpg)
The New York Times Company Archives
![Page 17: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/17.jpg)
TimesMachine
![Page 18: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/18.jpg)
The Deep Archive
0
45000
90000
135000
180000
1851
1858
1865
1872
1879
1886
1893
1900
1907
1914
1921
1928
1935
1942
1949
1956
1963
1970
1977
1984
1991
1998
2005
2012
Scanned Articles Digital Articles Blogs
≈75% ≈25%
![Page 19: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/19.jpg)
The Deep Archive
![Page 20: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/20.jpg)
The Numbers
46,592Issues Published Since
September 18, 1851
![Page 21: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/21.jpg)
The Numbers
2,335,446Unique Pages Printed Since
September 18, 1851
![Page 22: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/22.jpg)
The Numbers
11,298,320Articles Published
September 18, 1851
![Page 23: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/23.jpg)
The Scanned Archive
![Page 24: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/24.jpg)
The Scanned Archive
HeadlineCROWD ROARS THUNDEROUS WELCOME;
Breaks Through Lines of Soldiers and Police and Surging to Plane Lifts Weary Flier from His Cockpit AVIATORS SAVE HIM FROM FRENZIED MOB OF
100,000 Paris Boulevards Ring With Celebration After Day and Night Watch -- American Flag Is
Called For and Wildly Acclaimed
![Page 25: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/25.jpg)
The Scanned Archive
Lede ParagraphPARIS, May 21. -- Lindbergh did it. Twenty minutes
after 10 o'clock tonight suddenly and softly there slipped out of the darkness a gray-white airplane as 25,000 pairs of eyes strained toward it. At 10:24 the Spirit of St. Louis landed and lines of soldiers, ranks
of policemen and stout steel fences went down before a mad rush as irresistible as the tides of the
ocean.
![Page 26: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/26.jpg)
The Scanned Archive
“Dirty” ASCII…Lifte Fro'm His Cockpit. As he was lifted to the
ground Lindbergh w as l,-:k:, :::. - hair unkempt, he looked completely worn out. lle h-:: strength
enough, however, to smile, and waved his hand to t? ' crowd. Soldiers with fixed bayonets were unable to keep bach the crowd. United States Ambassador
Herrick was among the first to welcome and congratulate the hero.s…
![Page 27: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/27.jpg)
The Scanned Archive
Indexing MetadataHeadings
People, Places, Organizations, Subject
AbstractsConcise summary of the facts in the article
![Page 28: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/28.jpg)
Demo
TimesMachineVersion 2.0
![Page 29: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/29.jpg)
Archive Transcription
![Page 30: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/30.jpg)
The Problem
• As a subscriber exclusive TimesMachine does not appear in Google Search results.
• Lack of full text before 1980 makes it difficult to rank, or even appear, in Google results.
• For example: In 1945 The Times published 161,961 articles and only a tiny fraction appear in Google results.
![Page 31: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/31.jpg)
The Solution
• Transcribe articles from archival scans and publish these assets as searchable pages on nytimes.com.
• Transcribe and publish 1964 as pilot.• If that works transcribe and publish all remaining
articles between 1960-1980.
![Page 32: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/32.jpg)
Progress & Results
• All articles between 1960-1980 transcribed.• All articles between 1970-1979 available on
nytimes.com with more to come.• Google now indexing 672,500 new assets published
between 1970-1979!• Plans to publish 1960-1969, and to monitor
performance of new pages.
![Page 33: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/33.jpg)
Online Archive Modernization
![Page 34: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/34.jpg)
Online Archive Modernization
![Page 37: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/37.jpg)
The Initial Solution
new format for CMS (JSON)
print data(XML)
![Page 38: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/38.jpg)
The Case Of The Missing Articles
![Page 39: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/39.jpg)
The Case Of The Missing Articles
web data(HTML)
new format for CMS (JSON)
print data(XML)
![Page 40: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/40.jpg)
The Case of the Missing Articles
1. What is the complete list of article URLs from 1996-2006?
2. How do we identify which of the missing web articles correspond to existing print articles so that we can combine them and avoid duplicate content?
3. Which articles are web-only and not in our print archive at all, and how do we scrape that page for content & metadata?
4. Can we build a system that will process all the data for each year easily & efficiently?
![Page 41: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/41.jpg)
The Definitive List of Articles
4 different sources:
1. Print archive2. Site analytics (from the past 6 months)3. Movie, theater, and restaurant reviews4. Sitemaps
![Page 42: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/42.jpg)
The Archive Migration Pipeline For A Given Year
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
![Page 43: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/43.jpg)
The Archive Migration Pipeline For A Given Year
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
![Page 44: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/44.jpg)
The Archive Migration Pipeline For A Given Year
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
![Page 45: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/45.jpg)
The Archive Migration Pipeline For A Given Year
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
![Page 46: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/46.jpg)
The Archive Migration Pipeline For A Given Year
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
![Page 47: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/47.jpg)
The Archive Migration Pipeline For A Given Year
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
![Page 48: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/48.jpg)
The Archive Migration Pipeline3%
12.9%
36.2%
48.3% Print Archive (56K)Print Archive and Web (42K)Web-only (15K)Bad urls (3K)
2004 Articles (116K total)
![Page 49: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/49.jpg)
All The Little Things…
• 1996• Article Matching• Better URLs• Quality Assurance• Next Steps
![Page 50: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/50.jpg)
Article Matching: Fusion
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
![Page 51: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/51.jpg)
Fusion Explained
web data(HTML)
print data(XML)
![Page 52: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/52.jpg)
Search Engine Optimization27iht-scoutus.t.html
![Page 53: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/53.jpg)
Search Engine Optimizationcurb-violates-free-speech-supreme-court-rules-72-justices-void-internet.html
![Page 54: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/54.jpg)
The Case Of The Missing Sections
![Page 55: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/55.jpg)
The Case Of The Missing Sections
![Page 56: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/56.jpg)
Next Steps
1851-
1859
1860-
1865
1866-
1949
1970-
1980
1981-
1995
1996-
2016
1950-
1959
1960-
1969
Full Text
No Full Text
![Page 57: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/57.jpg)
Next StepsPhotos
![Page 58: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/58.jpg)
Next Steps
Digital preservation
![Page 59: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/59.jpg)
To Conclude…
![Page 60: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past](https://reader034.vdocuments.site/reader034/viewer/2022042510/58809ca51a28abd8158b5c5d/html5/thumbnails/60.jpg)
Thank You!
Evan Sandhaus, Sophia Van Valkenburg, Jane Cotler
The New York Times