ntc16 - open data and open source data science
TRANSCRIPT
3 Affordable Solutions
Open Data and Open Source Data Science for You
March, 2016
. © TechSoup Global | All rights reserved2
Introduction
Paula Alves @LadyData
Steph Nagoski @InformationChef
Session ID 104 #16NTCopendata
Materials & Collaboration Noteshttp://po.st/opendata-16NTC
Evaluation Link: http://po.st/fUt2gY
** WARNING: This presentation exposes information that you may find disturbing.
. © TechSoup Global | All rights reserved3
Outline
Data Wrangling, Merging
Small Data Problems
Open Data Examples
Online Abuse management - Social and Technical, together
Abusive Community Analysis - Example using Reddit data
Bot Detection & Usage
. © TechSoup Global | All rights reserved4
Data Wrangling / Data Merging Tools
Cleaning and merging multiple data sources :databases, CSV, txt files, JSON, XML, web services & Open Data Files
Trifacta www.trifacta.com/trifacta-wrangler/
OpenRefine - previously Google Refine http://openrefine.org/
Microsoft offerings you might already have: SSIS & Azure Data Factory
Other options include Crowdflower for data cleansing & tagginghttp://crowdflower.com
. © TechSoup Global | All rights reserved5
Data Wrangling - Trifacta
. © TechSoup Global | All rights reserved6
Data Wrangling - OpenRefine
Clean, Merge, and Transform data – for Javascript developers
. © TechSoup Global | All rights reserved7
Crowdflower
Tool to enrich your data through technical and crowdsourced tagging, flagging, manual review.
. © TechSoup Global | All rights reserved8
Big Data? We all hope we grow that big. For now…
. © TechSoup Global | All rights reserved9
Small Data Problem Examples
San Francisco Health Improvement Partnership - Alcohol Policy Partnership Working Group w/Trifacta https://jrnew.shinyapps.io/sfhip-app/ Is neighborhood crime correlated with alcohol sales?
. © TechSoup Global | All rights reserved10
Small Data Problem Examples
Bosnian/Hertzegovinan Electoral data w/Google Refine https://www.youtube.com/watch?v=BcxgAOCFppY
Southern Poverty Law Center Hate group listhttps://www.splcenter.org/hate-map
Conversion Therapy source listhttp://www.truthwinsout.org/ex-gay-consumer-fraud-division/
Govt Data sources - 18F - College Information https://collegescorecard.ed.gov/search/?major=computer&sort=advantage:desc
. © TechSoup Global | All rights reserved11
Outline
Data Wrangling, Merging
Small Data Problems
Open Data Examples
Online Abuse management - Social and Technical, together
Abusive Community Analysis - Example using Reddit data
Bot Detection & Usage
. © TechSoup Global | All rights reserved12
Reusable Open Data Analysis
DataKind - http://www.datakind.org/blog/open-data-in-action-our-top-25
. © TechSoup Global | All rights reserved13
Reusable Open Data Analysis
CivicTech – Trends in Civic Tech Investment toolhttp://knightfoundation.org/features/civictech/
. © TechSoup Global | All rights reserved14
Reusable Open Data Analysis
Data For Good: http://datalook.io/non-techies/ Library of reusable projects, with a focus on Non-Tech Users!
. © TechSoup Global | All rights reserved15
Open Data Formats -> Open Data Services
18F - GSA branch committed to open development & open data https://18f.gsa.gov/
Open Data Maker: convert CSV files to an extensible open API w/analytics https://github.com/18F/open-data-maker
First large example of use of OpenDataMaker API: https://collegescorecard.ed.gov/
. © TechSoup Global | All rights reserved16
Free Speech, and Groups that may disagree w/you
#BlackLivesMatter
Feminist Frequency - Media Criticism from Feminist perspective
Jewish and Islamic communities
Disability Organizations
Reproductive Health and Women’s rights
Any nonprofit that advocates for oppressed minorities
. © TechSoup Global | All rights reserved17
Handling Online Abuse 1
Crowdsourcing support/handling: Online Abuse Prevention Initiative (OAPI) http://onlineabuseprevention.org/
Projects: https://github.com/oapi
Hollaback’s new Heartmob https://iheartmob.org/
Shared Block Lists - https://blocktogether.org/
Hiding blocked users from Twitter Search http://blog.randi.io/2016/01/13/hiding-blocked-users-from-twitter-search/ GoodGame AutoBlocker https://github.com/freebsdgirl/ggautoblocker
. © TechSoup Global | All rights reserved18
Outline
Data Wrangling, Merging
Small Data Problems
Open Data Examples
Online Abuse management - Social and Technical, together
Abusive Community Analysis - Example using Reddit data
Bot Detection & Usage
. © TechSoup Global | All rights reserved19
Reddit Common Terms in Offensive Thread
http://reddit.com/r/WhiteRights
. © TechSoup Global | All rights reserved20
Top 25 Most Frequent Words
. © TechSoup Global | All rights reserved21
Sample from Top 50 Bigrams in Reddit dataset
Word1 Word2 Rank
bin laden 5ann coulter 9jim crow 14hip hop 18pearl harbor 22nelson mandela 27martin luther 39charlie hebdo 40bernie sanders 48anglo saxon 50
. © TechSoup Global | All rights reserved22
Code Examples of Reddit Analysis
Placeholder
. © TechSoup Global | All rights reserved23
Handling Online Abuse : Bots
Bot Detection: http://www.erinshellman.com/bot-or-not/
. © TechSoup Global | All rights reserved24
Handling Online Abuse: Bots
Productized simple analysis of twitter bots: https://www.twitteraudit.com/
. © TechSoup Global | All rights reserved25
Takeaways
Many tools for merging, cleaning & preparing your data for analysis are now accessible to end-users, many of them open source or free for nonprofits.
Accessing Open Data through API-based applications is more efficient, centrally updated, fresher data, better performance, end-user focused.
Lots of tools are available to help monitor and manage Social Media.
Advanced Data Science tools to detect problems are starting to be used in more end-user friendly ways.
26
What do YOU think?
Collaborative Q&A Session