job vacancies experiment boro nikić satellite workshop on big data, ntts 2015
TRANSCRIPT
2
Job Vacancies experiment (1)
- Idea about the experiment: Rome Workshop (May,2014)
- Started with identifying web sites which advertise jobs
- and searching for available APIs for websites - UNECE Task Team consisted of representatives
from Austria, Hungary, Italy, Netherlands, Sweden and Slovenia
3
Job Vacancies experiment (2)
Goals:- Overview of the methodologies of calculation of
JV statistics at NSIs- Identification of possible web scraping tools - Determination of BD methodology of calculation
of JV statistics - Testing the BD quality indicators proposed by
UNECE Quality Task Team
Overview of the methodologies of calculation of JV statistics at NSIs
By EU regulation it is prescribed to publish quarterly statistic on JV data:- Totals of advertised JV on national level- Totals on domains defined by size of units- Totals on domains defined by NACE activity groups
Documents on Wiki: • http://
www1.unece.org/stat/platform/pages/viewpageattachments.action?pageId=100303739&metadataLink=true
,
4
Identification of web scraping tools
Tools:
http://www.irobotsoft.com/
https://www.kimonolabs.com
5
Aim of the Irobot tool• IRobotSoft for Visual Web Scraping • IRobotSoft is a visual Web robot software for Web
scraping and Web automation. With IRobotSoft, you can scrape tons of data from the deep Web with a single click! You don't need to have computer skills to do this! IRobotSoft is for Everyone! Follow our discussions and become a Web geek!
• for novice data collectors • for Web testers • for data experts
Link:http://www.irobotsoft.com/
6
Basic Steps
1. Define the name of the Irobot
2. Define the name of the Task
3. Copy and paste the link of desired website into the URL
4. Start Recording Actions
5. Give names to the „scraped“ variables
6. Save the variables
7. Use the option „Repeat Property“
7
Determination of BD methodology of calculation of JV statistics (1)
- Cleaning of data - Methodology for the replacement of existing statistics (on
the level of NSi)- Methodology for the calculation of new statistics (on the
level of NSi)- Methodology for the calculation of new statistics
(international level)
8
Determination of BD methodology of calculation of JV statistics (2)
All the documentation about the experiment could be found on:
http://www1.unece.org/stat/platform/pages/viewpageattachments.action?pageId=100303739&metadataLink=true
Document:
Information which could be extracted from the Slovenian Websites and the proposed statistics for the job vacancies.doc
10
Determination of BD methodology of calculation of JV statistics (3)
One of the step in the statistical processing of JV data is assigning the ID of the Legal Unit from the Business Register.
Linking the ID to the „scraped“ unit enables us to get the information about the activity and size of LeU (according to number of employees)
11
„Scraped“ data
Name_LeU Tel numb Mob_numb Town Street Streat_numb Postal_code
AR PLANE d.o.o. 03-809-4100 040 383840 Bistrica ob Sotli
Savatech, d.o.o. KranjARENDA d.o.o. LjubljanaKnauf Insulation d.o.o 04 5114 219 Škofja Loka Trata 32 4220AVIAT d.o.o. Trzin
VIP Virant d.o.o Komenda
12
„Matched“ data
iskani Name_LeU Town_BR id complete_nmae nace_code adressVID dist1
1AR PLANE d.o.o. BISTRICAOBSOTLI
1AR PLANE d.o.o. ZAGAJ 3290476000 AR PLANE, korporacijsko upravljanje in pravna pisarna, d.o.o. 70.220 1474238 0
1APLANE d.o.o. SOLKAN 3307611000 Letalska družba APLANE d.o.o. 30.300 1034269 8
1ARTPLANET SLOVENSKABISTRICA 3498417000ARTPLANET, zavod za razvoj umetnosti, kulture in kakovosti življenja, Slovenska Bistrica 72.200 15
1ARTPLAN, d.o.o. KRANJ 6188265000 ARTPLAN, proizvodnja in trgovina d.o.o. 31.010 2429891 21
1ARPLAN, ANŽE REZAR s.p. PROSENIŠKO 3761843000ARPLAN, projektiranje, inženiring, svetovanje in storitve v gradbeništvu, ANŽE REZAR s.p. 71.129 2315474 25
1AL PLANET, Dejan Janež s.p. SEŽANA 3356892000 AL PLANET, Stavbno pohištvo iz aluminija, Dejan Janež s.p. 25.120 930791 26
1AR-AL NET d.o.o. ČENTIBA 6072526000 AR-AL NET, trgovina in posredništvo d.o.o. 47.910 28
1ARTLINE d.o.o. MENGEŠ 5333644000 ARTLINE, studio za oblikovanje, d.o.o. 73.110 1417055 28
2Savatech, d.o.o. KRANJ
2SAVATECH d.o.o. KRANJ 1661205000SAVATECH družba za proizvodnjo in trženje gumenotehničnih proizvodov in pnevmatike, d.o.o. 22.190 2404555 0
2SAITECH d.o.o. CELJE 5311292000 SAITECH podjetje za trgovino in storitve d.o.o. 43.290 1428363 21
2SAVA TMC, d.o.o. LJUBLJANA 1893718000SAVA TURIZEM - TMC, podjetje za upravljanje dejavnosti turizem, d.o.o. 70.100 2585325 21
2ASTECH d.o.o. LOGATEC 1661078000 ASTECH d.o.o., Inženiring in servisiranje strojnih instalacij 43.220 1617965 25
2AVTECH D.O.O. VIDRGA 3282058000 AVTECH, SVETOVANJE, ZASTOPSTVO, PROIZVODNJA, D.O.O. 70.220 284552 25
2SANOTECHNIK d.o.o. MARIBOR 5850908000 SANOTECHNIK trgovsko podjetje d.o.o. 46.730 1490149 27
3ARENDA d.o.o. LJUBLJANA
3ARENDA d.o.o. LJUBLJANA 1629417000 ARENDA, nepremičninska družba, d.o.o. 68.200 1242548 0
3OPTIKA ARENA d.o.o. MARIBOR 1873512000 OPTIKA ARENA, družba za trgovino in storitve d.o.o. 47.781 499981 10
3PEKARNA ARENA d.o.o. LJUBLJANA 3918076000 PEKARNA ARENA, pekarstvo in trgovina, d.o.o. 10.710 2313488 10
3ARENA SERVIS d.o.o. OSLUŠEVCI 6318797000ARENA SERVIS, izposojanje šotorov, šankov in gostinske opreme ter gostinske storitve, d.o.o. 77.390 10
3ADENDA d.o.o. MIREN 5743729000 ADENDA d.o.o. grafične storitve in oblikovanje 18.130 1365580 16
3AGENDA d.o.o. MARIBOR 5656222000 AGENDA komunikacijski in informacijski inženiring d.o.o. 62.020 163187 16
3RANDA d.o.o. LJUBLJANA 6011624000 RANDA gradbeništvo, storitve in prevozi d.o.o. 41.200 1890496 20
3AGENDA 2003 d.o.o. LJUBLJANA 1824775000AGENDA 2003 premoženjsko svetovanje in računovodske storitve d.o.o. 69.200 63849 24
13
Testing the BD quality indicators proposed by Quality Team
Quality framework consists of three quality hyperdimensions: input, throughput and output hyperdimension
http://www1.unece.org/stat/platform/pages/viewpageattachments.action?pageId=101158888&metadataLink=true
14
Conclusions (1)BD could be used as a source: • for new types of statistics• for existing statistics • for validation of existing statistics
In case of scraping of JV data:• Change of mode of collection • Validation of data collected by traditional way
(administrative sources, questionnaire• Flash statistics
15
Conclusions (2)Before the JV BD source is employed in regular statistical production the scraping tools, procedures of manipulation of data and statistics must be carefully tested in period of at least one year in order to ensure stability of sources and statistics.
More about experiment can be found on
http://www1.unece.org/stat/platform/display/BDP/Sandbox+Task+Team
16