interpreting open data

Download Interpreting Open Data

Post on 07-Jul-2015




1 download

Embed Size (px)


Talk from vienna.rb


  • 1. InterpretingOpen DataTom Kramr, @tkramar,

2. Open DataOpen data is the idea that certain data should be freely available to everyone touse and republish as they wish, withoutrestrictions from copyright, patents orother mechanisms of control.Public money was used to fund the work and so it should be universally available. -- Wikipedia 3. Web is fullof Open Data 4. ... its just not theeasiest kind ofdata to work with. 5. Open Data in Slovakia register of companies register of freelancers information from Statistical office procurements governmental contracts tax debts financial reports ... 6. Ideas? 7. Not so fast! 8. Barrier #1: Accessibility 9. You will rarely find structured data 10. Parsing HTML is your best bet 11. Perils of Web scraping Parsing HTML Plenty of edge cases Malformed HTML Crawling whole site ideal case: list of all records with links usually: deep Web, hidden behind a search form 12. Sometimes you can try different IDs 13. Sometimes you need to break a CAPTCHA 14. And sometimes you need to be more creative 15. Barrier #2: Data quality Noisy data Deliberately missing data Linking data within and between datasets 16. Notable OpenData projects 17. - Social network ofslovak companies companies register, public procurements,debts, internet domains and other publicdata. 300K companies and 500K+ people current and historic data about companies inaggregated and usable form visualizes connections of people andcompanies as network (graphs) for deeperinsights. 18. -analyzing contracts 300K+ scanned documents (over 1TB of rawdata) various formats multiple sources fulltext and advanced faceted search automatic analysis of contract problematicity in-browser document viewer, visualannotations of any part of the contract embedded by largest slovak onlinenewspaper ( 19. - API forstructured public data Scraping, cleaning and deduplication ofmany public and private data sourcesscattered through the web. Identifying and linking subjects from differentdatasets. REST-based API for searching 1.2M+subjects in realtime. 20. -transparent court decisions data from Ministry of Justice completely unusable site, search takes minutes to complete in a usable form fulltext search, statistics, visualisations 21. - publicprocurements OLAP tool built on top of Slovakprocurements Slice the data 22. Pattern?Interpret meaningless data.Link it.Find the (hidden) connections. 23. Questions?How open is data in Austria?Whats your idea?