modernisation of price collection at statistics netherlands
TRANSCRIPT
Els Hoogteijling ESS Modernisation Workshop, Bucharest, March 2016
Modernisation of Price Collection at Statistics Netherlands
Overview
Why modernise price collection for CPI?
Scanner data
Webscrapers
Robot assisted price collection
Critical success factors and lessons learned
Why modernisation of price collection ?
– Dynamics of consumer market
– Internet purchases
– Reduction of administrative burden
– Cost effective
– Improved quality of CPI/HICP
– More detail
Price collection at Statistics Netherlands
Before 2000 Mainly price collection in shops Questionnaires Price collection by telephonic interviews 2000-2010 Introduction of scanner data Introduction of price collection on internet Reduction of price collection in shops From 2010 More scanner data More internet data Registers and administrative data Strong reduction of price collection in shops
4
0
1000
2000
3000
4000
5000
6000
7000
8000Number of shops visited by interviewers (per month)
Scanner data, transaction data, administrative data
Started in 2003 - Strong growth from 2010
• Scanner data 14 supermarkets; no price collection by interviewers 2 DIY-shops; more DIY-shops in 2016/2017 2 drugstores; more drugstores in 2016 1 department store; more department stores in 2016/2017
• Transaction data travel agencies • Transaction data fuels
• Registers energy prices
5
0
5
10
15
20
25
30
35
40
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2018
Scanner data in CPI-production
Internet data by webscrapers
Why internet as a data source Internet robots, how do they work? Price collection from webshops – clothing Classification, computation, results Monitoring
Robots / crawlers / bots / spiders / scrapers: how do they work? (1)
Browser
Website
Internet Requests
code, images,
style, data, etc.
Graphical markup
You Commands
Robots / crawlers / bots / spiders / scrapers: how do they work? (2)
Robot/ spider/ crawler
Website
Internet Requests
Navigation
code, images,
style, data, etc.
Data
You
Robots / crawlers / bots / spiders / scrapers: how do they work? (3)
Robot/ spider/ crawler
Website
Internet Requests
Navigation
code, images,
style, data, etc.
Data
Monitor actively
Generic software for: - site navigation - product details - monitoring
Data Data
Data Data
Webscrapers: legal aspects
– Netiquette Robots identify as “CBSBot, Statistics Netherlands” Robots operate during night / morning Robots minimize load: wait for a second between requests
– Communication Statistics Netherlands informs web site owners in case of considerable data retrieval
– Database law / intellectual property rights
Statistics Netherlands operates under the Dutch statistics law and does not use the data for any other means than specified in that legislation.
Webscrapers for CPI
Started in 2012: ‐ robots collects daily all products and prices from webshops ‐ including product description and classification characteristics ‐ major webshops for clothing
Data analysed for some years; classification and methodology developed
Now:
‐ 15 websites are scraped daily ‐ 3 websites are used for computation of CPI ‐ automated collection, monitoring, postprocessing, transport and
storage ‐ daily/weekly monitoring
Future:
‐ 20 – 30 websites in 2018?
Price collection from webshops
Inspection of data – very volatile
13
Number of articles
Number of articles : sale
From data to statistics
Challenges: ‐ from volatile data to stable statistics ‐ how to classify multiple less structured data sources
Seasonal pattern
Product characteristics
15
From product characteristics to classification
Brand: H&M Division: women Type: jacket Article description: blazer Fabric: leather Color: dark grey Size: medium …….. …….. Price: € 49.99
Classification per website DIVISION: men, women, children LAYER: underwear, upperwear FUNCTION: regular, special occasions; nightwear; sports PART OF BODY: upper body, lower body; legs, arms, head, …
Price
Monitoring Websites change constantly and unexpectedly Monitoring of collected data is a must Articles per division Articles in sale New types of articles New structure of website DevOps team: Development – IT-experts- Operations (CPI) working close together
Some results
60
110
160
201412 201501 201502 201503 201504 201505 201506 201507 201508 201509 201510 201511 201512 201601
031210 Garments for men
60
110
160
201412 201501 201502 201503 201504 201505 201506 201507 201508 201509 201510 201511 201512 201601
031220 Garments for women
60
110
160
201412 201501 201502 201503 201504 201505 201506 201507 201508 201509 201510 201511 201512 201601
031230 Garments for infants and children
Website‐X all shops
There is more than webscraping
• Webscrapers are suited for many prices on few sites • In CPI we also collect few prices from many sites
for example: driving lessons, cinema tickets, pizza delivery services
• Not feasible to build a robot for every single site too expensive, monitoring, maintenance
Start of robot assisted data collection
Robot assisted data collection Robot tool automatically checks whether prices are changed
Traffic light indicates status:
• Green: nothing changed, prices is saved in database • Red: some change, need attention of statistician • Two clicks to hold old price or store a new one
Robot assisted data collection – impact on organisation Specialists who used to collect prices manually from the internet now use the robottool More prices collected in less time (80% productivity improvement) Better quality and less rework (reduced chance of making errors) Work is more interesting No need for organisational changes
Robot assisted data collection – try it yourself
Robot Tool is available on request for other NSI’s http://research.cbs.nl
Critical succes factors
Close cooperation methodologists, IT and CPI‐experts
Feeling of urgency, wish to change
• Adapt quickly to changes in the website (using an
efficient framework for the robots)
• Automatic classification of the data
• Methodology to calculate prices indices
• Patience: you can’t change overnight
• Balance between impatience and cautiousness
Lessons learned
• Monitor the data weekly, even if not in production for CPI • Implement the new methods gradually, no ‘big boom’ o Learn from the robots in production o Improve monitoring o Improve classification algoritms o Lower risks
• In traditional methods the collection of prices is the end of a proces, in internet robots it is just the beginning
Robots are not perfect, neither is price collection in shops
Conclusion
Scanner data and internet data: • Reduce administrative burden (85% less price collection in shops) • Cost effective • Better quality by using millions of prices • Can be done without large impact on organisation
25
Price collection by weighting share in CPI Scanner data
Electronic questionnaireRents surveyE‐data energy and fuels
Internet prices and pricelistsE‐data travel agencies
Electronic/paperquestionnairesPrice collection bytelephonePrice collection in shops
Thank you for your attention! Questions? Discussion