european commission€¦  · web viewin both approaches, nltd has to go through a series of...

28
EUROPEAN COMMISSION EUROSTAT Directorate B: Methodology; Dissemination; Cooperation in the European Statistical System Unit B-1: Methodology; Innovation in official statistics ESTAT/B1/WGM Available in EN only 5 th meeting of the Working Group on Methodology (WGM) Luxembourg, 1 April 2020 Item 1.1 of the agenda

Upload: others

Post on 26-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

EUROPEAN COMMISSIONEUROSTAT

Directorate B: Methodology; Dissemination; Cooperation in the European Statistical SystemUnit B-1: Methodology; Innovation in official statistics

ESTAT/B1/WGMAvailable in EN only

5th meeting of theWorking Group on Methodology (WGM)

Luxembourg, 1 April 2020

Item 1.1 of the agenda

Web Intelligence Hub

Page 2: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

EXECUTIVE SUMMARY

1. RECOMMENDATION FOR ACTION

The WGM is invited to discuss the progress in the setting up of the ESS Web Intelligence Hub to collect, process and disseminate statistical information derived from Web data, starting with online job advertisements.

2. BACKGROUND AND BRIEF HISTORY

Following the adoption by the ESSC of the Bucharest memorandum on ‘Official Statistics in a datafied society (Trusted Smart Statistics)’1 , the ESSC at its meeting on 16 May 2019 discussed the main strategic orientations allowing the ESS to embrace new data sources for the regular production of European statistics. They build upon the outcomes of the ESS Vision 2020 project BIGD, which explored the use of multiple new data sources and developed proofs of concept for the generation of new statistical products in response to users’ needs. The Big Data ESSnets I and II were instrumental in identifying priority areas and maturing the work.

In 2015-16, the European Centre for the Devolopment of Vocational Training (Cedefop) launched a study to assess the feasibility of developing its own multilingual system for collecting and analysing data from online job advertisements (OJA). The success of the feasibility study prompted Cedefop to pursue this activity further and in 2017 Cedefop launched a project to develop its own fully-fledged system, collecting information from all 28 EU Member States.

The ESSnet Big Data I collaborated closely with Cedefop, in particular with the organisation of joint workshops, where developments in the two activities were shared, and the use of OJA data from Cedefop by the ESSnet. In its final report, the ESSnet Big Data recommended to work towards the use of the data collected by the Cedefop OJA data system for the production of official statistics.

Following those recommendations, Eurostat presented in February 2019 to the Steering Group on Big Data and Official Statistics (SG-BD) the proposal to integrate the European system of collecting, processing and analysing online job vacancies that is currently being developed by the Cedefop into an ESS information system fed by internet sources. The SG-BD discussed the proposal towards integration in the ESS of online job advertisements highlighting its strategic potential for the ESS. It raised important questions pertaining to methodology and data quality, IT, architecture and governance, that would require thorough analysis of the impact of the envisaged system on the production of statistics. Members recognised and stressed the widened scope of the project to additionally include information on skills. They pointed out the importance of a multi-purpose, modular approach and the need to integrate national and European needs into a common system. It would provide a crucial element for ESS capacity building and reputation. Eurostat undertook to elaborate further in collaboration with the ESSnet Big Data II work package B (online job advertisements), which has developed options for how such system should be set up.

In DIME/ITDG Steering Group meeting on February 2020, Eurostat presented in depth the business case for the Web Intelligence Hub (WIH), which will be part of the Trusted Smart Statistics Centre. The hub is built on the work of the previous ESSnets on Big Data and the work made by CEDEFOP for skills statistics using online job advertisements. The planned services of the hub were presented. In particular, Eurostat emphasized the possible efficiency gains for a common European platform. Finally, the next steps with an indicative timeline were presented. 1 Bucharest Memorandum on Official Statistics in a Datafied Society (Trusted Smart Statistics), 104 th DGINS

Conference, Bucharest, 10 and 11 October 2018. Adopted by the ESSC of 12 October 2018; available at: https://europa.eu/!Gw87JQ

Page 3: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

The Steering Group members requested clarification about the exact outputs of the hub, the overall governance structure as well as data access and privacy requirements. The members emphasized the risk of possible non-cooperative stakeholders that should be mitigated by communication. In the discussion Eurostat explained, that the first goal is to setup a working system for online job advertisement to profit from synergies. The system should be open to all NSIs of the ESS. It can be extended with other uses cases in the future.

In conclusion, the DIME/ITDG SG endorsed the business case of the WIH and discussed options for the governance, which shall be developed further in the process of the WIH implementation. The first use case of the WIH will be on extracting statistical information from online job advertisements.

3. POLICY CONTEXTIn order to reap the benefits of data revolution and move towards trusted smart statistics, one priority area of the draft multi-annual action plan (MAP) implementing the European Statistics Programme 2021-27 is the establishment of a Trusted Smart Statistics Centre to develop IT, methodological and quality frameworks, guidelines, tools and infrastructure suitable for big data processing. As part of this Centre, the ESS will produce official statistics based on web intelligence. This document proposes the way forward in the web intelligence hub.

Reflecting the importance of jobs and skills for both the Juncker Commission and the Von der Leyen Commission, a first instance of the web intelligence hub focuses on the use of online job advertisement data in order to address some gaps in their statistical measurement. Further priority projects include the collaborative economy, enterprise characteristics (platform data and enterprise websites) and price statistics.

4. CONSEQUENCES FOR NATIONAL STATISTICAL INSTITUTES

Harvesting data from the Web requires development and deployment of new technological and methodological approaches for the ESS. Several National Statistical Institutes in the Member States have already created centres for big data and smart statistics. The Eurostat trusted smart statistics centre with its web intelligence hub will - in close collaboration with the National Statistical Institutes - play an important role in the network of these centres, effectively co-ordinating and providing multi-purpose and multi-domain capabilities to produce European statistics based on new data sources and deploying new statistical processes.

5. OUTSTANDING ISSUES

There is a need to engage subject matter domain experts and to agree an ESS governance structure that reflects the different usages of the web intelligence hub, considering purpose of use (experimentation, collaboration, training, production), actors, and subject domain.

6. RISK ASSESSMENT

If the European Statistical System does not invest in joint capabilities for using data from the internet, there is a risk that only a limited set of Member States are capable to do so. Similarly, approaches would differ from country-to-country, thus posing a risk to data quality and comparability.Participation of the ESS members should be ensured in creating appropriate structures.

7. NEXT STEPS

Page 4: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

Eurostat will release in the first quarter an infrastructure addressed to the ESS and opened in a first step to the ESSnet Big Data. In parallel, implementation of producing experimental data for the first use cases based on job advertisement data will start. A sustainable structure for participation in developments and maintenance of the WIH will be prepared.

Page 5: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

Trusted Smart Statistics Centre: Web Intelligence Hub

1. CONTEXT

1.1 SITUATION DESCRIPTION AND URGENCY

Trusted Smart Statistics

Following the exploration of the use of new data types and sources for official statistics during 2015-2019, the ESSC at its meeting on 16 May 2019, discussed principles and strategic orientations to guide the path towards their regular implementation as so called trusted smart statistics ([ESSC2019]). New data types and sources require new methodological approaches to transform the raw data into statistical information, to address quality issues and to be jointly processed with traditional ones. At the same time, the integration of non-traditional data sources requires new IT capabilities, enabling the processing of huge amounts of structured and unstructured data and deployment of new statistical methods. As many of the new sources are held in the private sector, alternative scenarios for sustainable and secure use will need to be established. Finally, ESS skills and communication strategies need to match these requirements.

To provide a common umbrella for these developments and cater for synergies and economies of scale, a trusted smart statistics centre (TSSC) is being established. Several National Statistical Institutes in the Member States have already created centres for big data and smart statistics. The ESS trusted smart statistics centre will play an important role in the network of these centres, effectively co-ordinating and providing multi-purpose and multi-domain capabilities to support new statistical processes and statistical outputs, in close collaboration with the National Statistical Institutes.

The centre will be organised into hubs, with each hub specialising on a specific group of data sources with similar characteristics and processing similar types of data. Each hub can serve multiple statistical domains and each statistical domain can be served by several hubs. Construction of the hubs will be output-driven, centring around concrete use cases, which are defined so that they generate new or contribute to European statistics in line with users’ needs.

European statistics on skills

Reflecting the importance of skills for the Juncker Commission (2014 – 2019), in 2016 the Commission adopted ‘A New Skills Agenda for Europe’ ([EC2016]), with the aim to ensure that the skills available on the EU labour market correspond to the needs of business and the economy in general. The focus on skills continues now with the new Von der Leyen Commission, with the “Political Guidelines for the next Commission 2019 – 2024” ([EC2019]) calling for empowering people through education and skills and recognising that skills and education drive Europe’s competitiveness and innovation.

Page 6: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

The New Skills Agenda for Europe require answers to questions such as where and in what kinds of jobs there is high demand, what skills are demanded and where, how job requirements evolve, what possible career moves and new jobs and skills there are and what jobs and skills are subject to shortages.

Despite the importance of skills at European level, European official statistics still present some gaps in their statistical measurement. The Eurostat statistical working paper “Statistical approaches to the measurement of skills” ([ESTAT2016]), from 2016 and drawn from the report of an internal Eurostat technical group on the measurement of skills, concludes that existing statistics cover a high number of countries over longer periods, but produce rather aggregate measures of skills. In particular, in the case of skills demand (by employers), it suggests that one way to address this limitation is to exploit new data sources. Finally, it recognises that “Although for the moment in a rather incipient phase, big data has the potential to become an important and relatively inexpensive data source for direct measurement of the skills demanded by businesses and offered by individuals. On the other hand, its use for official statistics still needs to be thoroughly assessed.”

Since 2016, the use of big data, in particular from the World Wide Web, for skills demand measurement has gone through significant development. The ESSnet Big Data I has explored the use of online job advertisements (OJA) published on the Web for the production of statistics. At the same time, Cedefop, the European Centre for the Development of Vocational Training, an European agency, is finalising the building of a pan-European system for the use of OJA data for skills measurement.

Web Data

One of the types of data sources that has been explored in the ESS is the World Wide Web. There are two types of information exhausted from the use of the Web: the content (websites and Web APIs) and traffic (when websites are accessed). This information is located in two different types of places, the web servers and the web clients (web browsers). The content of the Web is located in the web servers and is for the most part openly available or available upon a registration (and eventually the payment of a fee). The traffic is normally not openly available and is located both in the web servers (in the web logs) and in the web clients (i.e. web browsers). Sometimes, web servers make aggregated traffic data produced from their web logs openly available.

The content of the Web is expressed as HTML (HyperText Markup Language) which is stored (or generated) and transmitted by web servers and interpreted and displayed by web browsers. HTML includes both structured and unstructured information. Structured information normally follow a regular structure between web pages of the website, but it rarely follow standards across websites. Sometimes structured data is tagged within the HTML source code that makes it easier to identify and extract. Unstructured information is for the most part expressed as natural language textual data.

Traditional statistical methods are not immediately applicable to natural language textual data (NLTD). Natural Language Processing (NLP) is a sub-field of Artificial Intelligence (AI) dedicated to the application of statistical methods to the analysis of this type of data or at least to its transformation into a format to which traditional statistical methods can be applied. There are two

Page 7: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

main approaches to the analysis of NLTD. The first one is knowledge extraction that tries to parse natural language sentences as a human does and extract entities and their relationships as a knowledge graph2. The second one is “bag of words” that takes words simply as tokens and analyses the frequency with which they are used between different “documents” (e.g. tweets, job advertisements). In both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of stop words to its mathematical representation in the form of TF-IDF (Term frequency-Inverse term frequency), word2vec or some other. The final result is the structured representation of the content of the NLTD or the (automatic) classification of the “document”. The automatic classification of the “documents” is done via machine learning where a sample of human classified cases are used to “train” an algorithm (more than a hundred different possible algorithms exist nowadays), which is then used to automatically classify other non-classified cases.

The use of Web data for statistical purposes

The ESSnet Big Data I launched in 2016 two pilots dedicated to the exploration of this data source, in particular content data. The first one approached job advertisements in Web portals for enhancing job vacancies statistics and the second one attempted to extract business data and enterprises’ characteristics from their websites for enhancing business registers and business statistics (e.g. ICT usage statistics). At the time of finalisation of the ESSnet Big Data I, there was a call for a refocus on the implementation of the most successful pilots towards statistical production. Answering to this call, the ESSnet Big Data II, launched at the end of 2018, included a track on implementation. Three use cases were selected for implementation work, two of which were the ones exploring Web data.

The Big Data ESSnet apply mostly a national approach to the introduction of the exploration of Web data in official statistics, where ESS partners explore either national data sources or global sources restricting their scope to specific countries, and develop parallel and to some extent complementary research and development activities.

There have been also European level initiatives exploring Web data. The Directorate-General for Communications Networks, Content and Technology (DG-CONNECT) of the European Commission launched a series of projects with such purpose. In 2013, it launched "MOVIP - Monitor of Online Vacancies3 for ICT Practitioners" (Figure 1), followed by "Vacancies for ICT- Online RepositorY (VICTORY): Data Collection" in 2015, a project for "data on job vacancies for ICT practitioners (and possibly other professional categories) obtained by crawling job advertisements published online in relevant job advertisement outlets". The latter initiative was repeated in 2016, with VICTORY 2. As a result of VICTORY 2, DG-CONNECT has launched a near real time online dashboard, the Monitor for ICT online vacancies. DG-CONNECT is now reflecting on the next generation of VICTORY.

2 Wikidata is an example of this approach where the content of all Wikipedia articles is converted into a knowledge graph (the Wikidata) which can then be queried.

3 In the past, the term “online job vacancies” was normally used to refer to online advertisements of job vacancies. However, in this document the term “online job advertisement” is preferred, in order to highlight that it refers to a statistical unit different from the one used in job vacancies statistics.

Page 8: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

Figure 1 – DG-CNECT’s Monitor for ICT online vacancies (screenshot taken 08/02/2019)

Figure 2 – Cedefop’s Skills-OVATE: Skills Online Vacancy Analysis Tool for Europe (screenshot taken 30/10/2019)

URL: http://www.pocbigdata.eu/monitorICTonlinevacancies/general_info/ (now deactivated)

URL: https://www.cedefop.europa.eu/en/data-visualisations/skills-online-vacancies

Similarly but with a much larger scope, Cedefop launched in 2014 the "Real-time labour market information on skill requirements: feasibility study and working prototype" (Cedefop, 2016) with the goal of assessing the potential of online job advertisements data for statistics on skills. Following the conclusion that such data source had potential and could feasibly be explored, Cedefop launched "Real-time Labour Market information on Skill Requirements: Setting up the EU system for online vacancy analysis" with the goal of setting up a fully fledge system to carry out analysis of vacancies and skills based on online job advertisements (Figure 2).

The ESS has followed closely the development of the Cedefop system. The data collected in the scope of the Cedefop feasibility study was used for the first European Big Data Hackathon, run in March 2017. The Hackathon had the purpose of exploring the data source for building innovative statistical products which would help a particular policy question, how to design policies for reducing mismatch between jobs and skills at regional level in the EU. The Hackathon follow-up workshop has then put together participants of the Hackathon, the development team of the Cedefop system and the members of the ESSnet Big Data I OJA work package team to discuss synergies. Until the end of the lifetime of ESSnet Big Data I, the Work Package OJA team has organised several joint workshops with Cedefop and has proposed a strategy for ongoing engagement ([ESSNET2018]) where it recommends:

“It is expected that the first OJV data from the Cedefop vacancy scraping system from selected countries will start becoming available towards the end of 2018. It is hoped that data from the Cedefop will become available for use by NSIs within the ESS and will become an important, and possibly, the main source of OJV data for the ESS. It is also expected that NSIs will contribute by providing statistical expertise as well as other data

Page 9: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

sources to validate data from the Cedefop system. These data source would include Job Vacancy Survey (JVS) data and OJV data obtained from other sources.”

Eurostat has carried out its own exploration of Web data. Traffic data provided by the Wikimedia Foundation for the Wikipedia (Wikipedia page views) has been used to produce experimental statistics4 in the domains of culture statistics5 and urban statistics6. This data source has also been explored to be used on the temporal disaggregation of tourism indicators.

Eurostat has also developed a prototype on the extraction of business data from dbpedia and Wikidata, knowledge graphs extracted from the Wikipedia, as a complement to increase the amount and timeliness of information feeding the Euro-Group Register.

Finally, it should be noted that at the same time, Web content data has been explored in other statistical domains such as online prices for prices statistics.

Skills statistics based on Web data (OJA)

The experience of the ESSnet Big Data with the exploration of OJA data and the development of pan-European systems with that purpose allows us to have a good understanding of what can be produced with such data. The ESSnet Big Data I has successfully used OJA data to nowcast JVS, demonstrating that it may be possible to enhance current job vacancies statistics by providing increased timeliness and granularity (Figure 3).

Figure 3 – Nowcasting job vacancies statistics (JVS) using a naive S-ARIMA model and using a model with OJA as auxiliary data (BG)

The Cedefop system now includes working algorithms to ingest and process Web pages with OJA data and automatic classifiers for 12 variables characterising the vacancy advertised:

economic activity of the employer (NACE, 1º and 2º level), with a success rate7 of 96% occupation (ISCO, 4º level), with a success rate of 100%

4 https://ec.europa.eu/eurostat/web/experimental-statistics/world-heritage-sites

5 https://ec.europa.eu/eurostat/web/products-statistical-books/-/KS-04-15-737

6 https://ec.europa.eu/eurostat/en/web/products-statistical-books/-/KS-01-16-691

Page 10: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

location (NUTS & LAU), with a success rate of 99% education level required (ISCED), with a success rate of 66% working hours (“full-time”, “part-time”), with a success rate of 78% type of contract (“permanent”, “self-employed”, “temporary”), with a success rate of 83% experience required (custom taxonomy with 8 levels), with success rate of 45% salary offered (custom taxonomy with 13 levels), with success rate of 12% Skills required (ESCO), with success rate of 92%

Eurostat can develop new classifiers for additional variables, as needed, based on the raw data (natural language) collected. The raw data can also be used directly, for example in a search engine for initial exploration of new topics that are still not integrated in the existing classifications8.

Besides producing statistics, classified microdata can be made available via a Data Science Lab to users (e.g. in Commission policy DGs and the research community) for detailed data analysis.

Apart from the use of OJA data to enhance job vacancies statistics, particular statistics on skills still need to be agreed upon within the ESS, but possible examples are:

Incidence (percentage of adverts) for the top 10 skills demanded by occupation Incidence (percentage of adverts) for the top 10 skills demanded by region Incidence (percentage of adverts) for identified emerging skills Skills profiles required by the enterprises

1.2 IMPACT OF THE SITUATION ON THE ESS

Web content data is normally openly accessible, but systematic access to this data source is not necessarily easy. Despite of the enabling legal situation, many websites try to prevent the exhaustive scraping of their content (by blocking access). Reasons might be the posing of high loads on their web servers or, in the case of Web platforms9, protection of information against automatic extraction by third parties including competitors. For this reason, the sustainable use of Web content data requires the establishment of agreements with the website owners. From the point of view of statistical offices, agreements are preferable as they may include the direct access to the database in the backend facilitating the process of acquiring the data.

The ability to process the amounts of data acquired from the Web requires specialised big data infrastructure as necessary condition to implement web data sources into statistical production. Despite the efforts of the last 5 years in making such infrastructure available to the ESS partners for the purposes of the big data pilots (Big Data Sandbox, BDTI), to this date very few NSIs have their own big data infrastructure.

Transforming Web data into official statistics requires specialised skills to develop and maintain web scrapping and machine learning algorithms. Despite the efforts of Eurostat in the last 5 years in

7 Percentage of advertisements for which it was possible to extract the variable.

8 In a recent meeting with DG-CONNECT, it was suggested to make use of such OJA search engine to get an initial understanding of the needs for artificial intelligence (AI) skills in the labour market.

9 Contrary to simple static websites that normally serve simply as a dissemination channel, Web platforms use websites as an interface through which they provide a service and rely on a database in the backend from which they derive their value.

Page 11: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

providing big data training via the ESTP, only few NSIs have acquired sufficient skills to run a full-fledged production system.

The challenges faced by European statistics in the use of Web data have to be put in the context of the data market in which official statistics now finds itself. Large players on the Web use the data generated in their activities to offer statistical/analytical services, including to public institutions and for the purpose of policy making. That is the case of, for example, Google, which uses the data generated by the use of their search engine to offer Google Trends (GT). Google Trends has been used at least by the ECB for nowcasting official statistics indicators, so far at an experimental level, but one cannot exclude the possibility of it guiding or at least influencing monetary policies in the future. Another example is LinkedIn that has been willing to offer analytical services to the European commission to guide policy making in particular in the domain of employability of higher education graduates.

The situation concerning the move towards implementing the use of Web data into statistical production can be summarised as the lack of capabilities of Eurostat and of the vast majority of the National Statistical Institutes to use Web data for the production of official statistics. At the same time, there is the opportunity to take advantage of the development work of Cedefop for online job advertisements, and extend such system to other Web data sources and leap forward the use of Web data in official statistics in the domain of skills.

1.3 INTERRELATIONS AND INTERDEPENDENCIES

The use of Web data in official statistics and the challenges it poses in terms of lack of capabilities interrelates with the statistical domains closely related to the statistics that can be produced with this data source. For example, in the case of online job advertisements, labour market statistics would be related and in the case of business data it is related with business statistics.

The situation is also interrelated with the IT infrastructure of Eurostat and the NSIs, both in terms of hardware and in terms of the architecture of the statistical production software stack.

2 EXPECTED OUTCOMES

The expected outcomes of this initiative are to:

(1) generate new statistical information from Web data and investigate adherence to ESS quality standards and meeting users’ demands. This comprises the respective methodological developments. The initial focus will be put on the generation of statistical products based on (a) online job advertisements/skills and (b) business data on multinational enterprises. Functionalities and architectures will be developed in a way to prepare for future extensions to further application domains and use cases.

(2) make available and operational Eurostat, and ESS, capabilities for harnessing statistical information based on data from the Web based on an EC supported IT environment. This represents

Page 12: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

a long-term commitment, since the outcomes are far beyond the project itself, and implies beneficial changes in statistical production process at the ESS level.

3 POSSIBLE ALTERNATIVES

3.1 ALTERNATIVE A: DO NOTHING

General Description

If we do nothing then there will be at least one European level system to gather and process online job advertisement data, run only by Cedefop, and potentially feeding policy making in the skills domain, without the oversight of European official statistics.

Eurostat would continue funding the development of the use of Web data by the NSIs via grants. Given that only a limited set of Member-States has been participating on the Web data dedicated work packages of the ESSnet Big Data, this would mean that not all Member-States would develop such a capability for several years. We already can observe this development in the domain of price statistics, where there is a growing gap between NSIs capable of using price information from the web and those who do not have the required capabilities.

The funding would also need to be spread between the several Member-States that would be dedicated mostly to the development of national capabilities. We can take the investment of Cedefop to build its system (EUR 1M) as a benchmark for the cost of developing Web data capabilities at a fully operational level. At the level of investment of the ESSnet Big Data on OJA (EUR 215 000 for 2 years), it could take several decades to have such a system running in every Member-State.

While this scenario is assuming that Web data capabilities would spread to all the ESS, the most likely outcome is that some NSIs will invest in Web data, while others will abstain or will try to buy services from the private market with limited quality documentation.

SWOT Analysis Strengths Weaknesses No running costs for Eurostat; It would be very expensive to build the

capabilities to use Web data in the ESS; It would take very long to have the

capabilities spread throughout the ESS; Capabilities would be spread very unevenly

within the ESS; Some NSIs would tend to buy services with

unclear quality documentation; Provision of statistical information on skills

would not be possible at EU level at least for a number of years

Opportunities Threats With rapid IT developments, cheaper

solutions could be available in 2-3 years' time;

Policy making done with statistics not produced by the ESS;

National systems which would produce non-comparable results;

Page 13: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

ESS would be bypassed by as provider of statistics on skills (demand)

Qualitative Assessment

This is not really a viable alternative. Given the cost and time required to develop a system for the acquisition and processing of Web data, even in the presence of rough estimates, the development of capabilities to use Web data in European official statistics would take such a long time that would render its usefulness null.

3.2 ALTERNATIVE B: WEB DATA USE FOR STATISTICS SOMEWHERE ELSE AT EUROPEAN LEVEL

General Description

In this alternative, the ESS follows closely the development of the system being developed by Cedefop and NSIs simply use it as a data source to produce job vacancies statistics at national level (and potentially statistics on skills). The ESS would not take any ownership of the system. This would mean that the system would not have the opportunity to be extended to other Web data sources, given the more limited scope of action of Cedefop (restricted to the labour market domain). Additional Web data sources would require the development of additional systems, even if part of the processing were similar to the one of the OJA. The ability of European official statistics to influence the system would be limited and as a result the input from the Cedefop system into statistical production would not be as good as it could be if the ESS was involve in its running.

SWOT Analysis Strengths Weaknesses No running costs for Eurostat; Lower running costs for the NSIs;

Additional Web data sources require the building of additional systems;

Limited ability of the ESS influence the system;

Data availability would be fragmented across the EU

Opportunities Threats At least, some data on skills could be

published by Eurostat Cedefop decides in some years not to

maintain the system anymore; There is no incentive for stakeholders to

involve Eurostat

Qualitative Assessment

Although it would mitigate some of the weaknesses and threats of doing nothing, this alternative is still not ideal.

3.3 ALTERNATIVE C: CREATE A WEB INTELLIGENCE HUB AT EUROPEAN LEVEL

General Description

The recommended alternative is to establish a Web Intelligence Hub (WIH) for the development and provision to Eurostat and the rest of the ESS of fundamental building blocks for harvesting

Page 14: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

information from the web. In order to do so, it would set-up those building blocks required for the collection and processing of data for specific use cases, gradually building up the WIH capabilities.

Eurostat, and the NSIs via the ESSnet Big Data II, closely collaborate with Cedefop, with a view to adapt the system developed by Cedefop for use in official statistics. This will require making it sustainable and aligned with Official Statistics quality requirements. Preparatory work has already started comprising contacts with Cedefop to establish a possible partnership and with the Directorate-General for Informatics of the European Commission (DIGIT) on the design and acquisition of IT infrastructure.

The use cases selected for the launch of the WIH are data from online job advertisements and business data of multinational enterprises.

The ESSnet Big Data engaged in a project to explore which statistics can be derived from such online job advertisement data. Based on a partnership with Cedefop, there is now an opportunity for establishing a joint system for processing and analysing online job advertisements with the potential to serve multiple uses in the domain of labour market analysis and official statistics.

The sustainable use of business data from the Web, in particular on multinational enterprises, requires now its integration in a dedicated platform such as the WIH.

Further use cases will then be added over time, progressively increasing the set of capabilities of the WIH.

The WIH will also be an important infrastructure which can be used to support the activities on measuring the collaborative economy in case they involve raw data, e.g. via web scraping or direct access via APIs. This concerns in particular data management, processing specific to Web scrapped data and text analytics.

In order to establish the WIH, this project will build on the methodological developments of ESSnet Big Data I web-scrapping work packages (online job advertisements and enterprises websites) and on the system already under development by Cedefop. It will generalise the architecture of the system developed by Cedefop, making it extendable to all web data sources (and to other hubs of the Trusted Smart Statistics Centre). Finally, it will port Cedefop system to the new TSSC/WIH architecture.

The TSSC and the WIH represent a long-term commitment, effectively launching a new process in the European Statistics production chain. Therefore, the time horizon of the outcomes go far beyond this project itself.

SWOT Analysis Strengths Weaknesses Allows the ESS to deliver very fast; The system would be ready for easy

extension to additional data sources and use cases;

The system would be complex;

Page 15: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

Building a single system covering all Member-States could save considerable cost of building such infrastructures at the level of the ESS

Opportunities Threats The ESS would demonstrate its capability to

use new data sources for filling gaps in statistical information

The European Commission IT infrastructure is not capable of handling such a system;

Qualitative Assessment

This alternative presents clear advantages and provides a system which is prepared for further future developments.

To conclude, based on the above analysis of alternatives, the solution proposed is to create a Web Intelligence Hub. This alternative would allow Eurostat and the ESS to put in production (as experimental statistics at first) statistics produced based on Web data and at the same time build an infrastructure that is ready to be extended to other Web data sources and use cases.

4 SOLUTION DESCRIPTION

4.1 LEGAL/POLICY BASIS

The work implements the Bucharest Memorandum on “Official statistics in a datafied society” adopted by the ESSC at its meeting on 12 October 2018 ([ESSC2018]).

It also forms part of establishing the trusted smart statistics centre, one of Eurostat's top priorities for the next Commission ([ESTAT2019]).

4.2 BENEFITS

The impact on ESS processes and organisation is the following:- improved ESS reputation in terms of its ability to harness new data sources and innovate its processes and product portfolio;- improved ESS capability and efficiency with regard to generate statistical information from data on the Web;- Improved ESS and Eurostat’s ability to serve Commission services' needs;

4.3 SUCCESS CRITERIA

The initiative will be considered a success if:- Experimental statistics based on Web data collected and processed are disseminated;- ESS members actively participate in the design, implementation and use of the web intelligence hub;- The infrastructure reveals to be cost efficient and sustainable;- The project transitions into a process for the dissemination of statistical information from Web data;

Page 16: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

4.4 SOLUTION IMPACT

The proposed solution will have the following impact in the ESS: The NSIs will use the services of the Hub for the acquisition of the data, processing,

information extraction and production of statistics. The NSIs can use the complete pipeline or only parts of it (e.g. NSIs may acquire data themselves and use the services of the Hub just to do the data processing).

The NSIs may deploy at national level some of the components (i.e. services) of the Hub and operate them in conjunction with the remaining services of the Hub. The Hub would then be a federated network of shared services deployed at European and National level.

The services of the Hub will evolve within the ESS with the NSIs improving machine-learning algorithms and building new automatic classifiers.

The NSIs will participate in the running of use cases in the Hub by providing support in the identification of sources and obtaining training data for the machine learning algorithms.

4.5 DELIVERABLES

This initiative would have the following deliverables:1. A system providing capabilities to Eurostat and the ESS to use Web data to produce

statistics, comprising:a. processes for web scraping and the use of APIs;b. conditions and agreements with trans-national players, such as platform owners, web

aggregators etc. and templates for respective national agreements;c. portfolio of multi-purpose text processing and analytics services (text parsing,

mining, classification, interpretation);d. mechanisms and technologies to assure data confidentiality;e. a governance structure;

2. Experimental statistics from online job advertisements (on skills and enhanced job vacancies statistics)

3. A new tool to analyse multinational enterprises

4.6 ASSUMPTIONS

Project assumptions related to Business

Members of the ESS will be willing to buy-in.Partners will be found to develop further use cases.

TechnologyThe demands in IT infrastructure will be high and it is assumed that DIGIT will be able to deliver an appropriate infrastructure and service.

ResourcesAppropriate resources are put at the disposal of the project.

4.7 CONSTRAINTS

- the scarce availability of resources in the statistical production units;

Page 17: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

- European Commission rules and policies, in particular concerning the provision of on-premises infrastructure;- established rules (and legislation) at national level restricting the integration of a non-national system in the processing of statistical confidential data;

4.8 RISKS

The main risks for the project are:- Missing buy-in from domain units;- Inability to establish agreements with key data holders (i.e. web platforms);- Inability to reconcile the aggregated data compiled from the web with existing official statistics;- Inability of ESS members to use the web intelligence hub (e.g. due to lack of necessary skills, IT infrastructure);- Mobility of key staff.

4.9 ROADMAP

The major milestones of the project are the following.

Date Milestone

Governance02/2020 Discussion at DIME / ITDG Steering Group05/2020 ESSC06/2021 Launch of WIH centre of excellence

WIH Platform (WIP)03/2020 Functional architecture and detailed roadmap06/2020 WIP preproduction release

12/2020-22 WIP annual releases 10/2023 WIP final release (end of project, transfer to process)

WIH Use CasesUse case 1 (OJA)

06/2020 Setting up a OJA production sandbox (training, experimentation)09/2020 Porting of Cedefop system to Commission - Eurostat (as is)12/2020 Experimental statistics related to skills and vacancies12/2021 Adaptation of Cedefop system to the WIP architecture

From 2022 Production of statisticsUse case 2 (SmartData4MNEs)

12/2021 Release use caseAdditional use cases …

Page 18: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

4.10 SYNERGIES AND INTERDEPENDENCIES

The Web Intelligence Hub is interrelated to the remaining hubs of the Trusted Smart Statistics Centre. Although the hubs specialise in different types of data sources, some components and the governance structure will be shared between the hubs.

5 GOVERNANCE

The WIH is centrally developed by Eurostat to develop and provide important capabilities to be shared between the members of the ESS, and this should be reflected in its future governance..In addition, it will be integrated in the overall governance of the Trusted Smart Statistics Centre (TSSC) and the existing governance of the TSS, including the DIME / ITDG, Working Group Methodology and its TSS Task Force. The WIH will be a statistical infrastructure dedicated to the acquisition of data from the Web, its processing and the extraction of information for producing statistics in several domains. The shared workload of acquiring and processing data from particular data sources for the production of different statistical products (e.g. OJA data for enhanced job vacancies statistics and skills statistics) has great potential for efficiency gains. In order to involve the statistical domains responsible for the production of the statistics, WIH users groups may be created.The WIH can be used for experimentation, collaboration, training and production of statistics at different levels (European, national, regional). Depending on the context, it will be necessary to define a model for covering the costs of the various activities.

Page 19: European Commission€¦  · Web viewIn both approaches, NLTD has to go through a series of operations from sentence segmentation, word tokenization, lemmatization and removal of

Appendix 1: References and Related Documents

ID Reference or Related Document Source or Link/Location

ESSC2018 Bucharest Memorandum on ‘Official Statistics in a datafied society (Trusted Smart Statistics)’

https://ec.europa.eu/eurostat/web/ess/-/dgins2018-bucharest-memorandum-adopted

ESSC2019 Implementation of the Bucharest Memorandum on ‘Official Statistics in a datafied society (Trusted Smart Statistics)’ – Trusted Smart Statistics Strategy and Roadmap, Document ESSC 2019/40/7

https://europa.eu/!bG33tw

EC2019 Political Guidelines for the next Commission 2019 – 2024

https://ec.europa.eu/commission/sites/beta-political/files/political-guidelines-next-commission_en.pdf

EC2016 A New Skills Agenda for Europe https://ec.europa.eu/transparency/regdoc/rep/1/2016/EN/1-2016-381-EN-F1-1.PDF

ESTAT2016

Statistical approaches to the measurement of skills

https://ec.europa.eu/eurostat/documents/3888793/7753369/KS-TC-16-023-EN-N.pdf

ESTAT2019

Eurostat Briefing Book

ESSNET2018

Web scrapping / Job vacancies – Strategy for ongoing engagement

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/e/e0/WP1_SGA2_Deliverable_1_1_1.0docx.pdf