our data ourselves, pydata 2015

35
Our Data, Ourselves -The Data Democracy Deficit Department of Digital Humanities Giles Greenway Tobias Blanke Jenifer Pybus Mark Cote

Upload: kingsbsd

Post on 10-Aug-2015

330 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Our Data Ourselves, Pydata 2015

Our Data, Ourselves

-The Data Democracy Deficit

Department of Digital Humanities

Giles Greenway

Tobias BlankeJenifer PybusMark Cote

Page 2: Our Data Ourselves, Pydata 2015

A “mobile-data commons”?

• Most of us leave behind a data-trail created by our mobile devices.

• Usually, it returns to us as targeted adverts (See Private Eye's “Malgorithms”...)

• How aware of this are mobile device users?

• How else might this data be used?

• Can we build a “mobile data commons”?

Page 3: Our Data Ourselves, Pydata 2015

Can we build a “mobile-data commons”?

• Can we capture the data our devices leak with an app?• No.• This would require rooting the 'phones. An Android

phone is a Linux system, where the end user typically doesn't have admin rights.

• If the app reaches a mass audience, we cannot expect users to root their phones. Some rooting software contains malware, we cannot ensure that users root their devices safely.

• For a technical description of the Android permissions system and Android malware, watch: http://tinyurl.com/weidmandroid

Page 4: Our Data Ourselves, Pydata 2015

What can we do then? -MobileMiner

Log:When apps access the internetCell-tower IDs.Wireless networks.When apps send notifications.

Full description of the app:http://tinyurl.com/miningmobileyouth

Phones with the app pre-loaded were issued to 20 young developers from Young Rewired State.

Page 5: Our Data Ourselves, Pydata 2015

(Young Coders: Attitudes Vary!)

• ~20 Young coders were issued with Android smartphones with our MobileMiner app installed.

• Invited to participate in hack-days and focus-groups.

.“If you have nothing to hide you have nothing to fear...”

“Privacy is attached to other people... so if someone you agree toconnect with is open then you can be accessed through them cause it's kind of herd thing, you've all got to do it otherwise, oneperson is in trouble.”

“People don't realise how large their digital footprint’s actually are...”

“Being of kind of this generation and being tech savvy we havesome control because we know how to have control...”

Page 6: Our Data Ourselves, Pydata 2015

What can we do then? Network usage

• The Android API provides network traffic data on a per-app basis.

• Sample this every half second.• Each app corresponds to a user in the underlying Linux

system and has its own Dalvik virtual machine.• The API can identify the PID of each running app.• Poll /proc/<pid>/net/tcp every half second.• Obtain the port and IP address of each network socket.

sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode12: 4F01A8C0:E1D0 B422C2AD:0050 01 00000000:00000000 02:000003A3 00000000 1000 0 154153 2 0000000000000000 23 4 28 10 -1

Page 7: Our Data Ourselves, Pydata 2015

What can we do then? GSM cells

• Full GPS is too invasive, and consumes power.

• Avoid use of Google location API.

• OpenCellId provides locations of (many) cell towers.

• http://opencellid.org

Page 8: Our Data Ourselves, Pydata 2015

Getting hold of the data: CKAN

Page 9: Our Data Ourselves, Pydata 2015

Getting hold of the data: CKAN

●The “Drupal of data”...●Needs Postgres and Apache Solr.●Based on Pylons.●Datastore plugin provides an API for uploading data.●Runs in a virtualenv.●“Out of the box” solution.●Provides basic search, filtering, plotting and maps.

Page 10: Our Data Ourselves, Pydata 2015

Getting hold of the data: CKAN

Page 11: Our Data Ourselves, Pydata 2015

CKAN: Writing plugins:

import ckan.plugins as plugins

class MobileMinerPlugin(plugins.SingletonPlugin):

plugins.implements(plugins.IAuthFunctions)

plugins.implements(plugins.IActions)

def get_auth_functions(self):

return {'miner_update': miner_auth_update,

'miner_register':miner_auth_register}

def get_actions(self):

return {'miner_update': miner_datastore_update,

'miner_register':miner_datastore_register}

Page 12: Our Data Ourselves, Pydata 2015

CKAN: Writing plugins:

@plugins.toolkit.side_effect_free

def miner_datastore_register(context,data):

missing = [ field for field in ['androidid','version'] if not

data.get(field,False) ]

if missing:

raise plugins.toolkit.ValidationError({'message': 'Not specified: '+',

' '.join(missing)})

newUser = False

while not newUser:

uid = abs(random.getrandbits(32))

newUser = not user_exists(uid)

local = ckanapi.RemoteCKAN(ckan_url,apikey=api_key)

result = local.action.datastore_upsert(resource_id=resources['user'],

records=[{'uid':uid, 'androidid':data['androidid'],

'version':data['version'], 'time':datetime.datetime.now().isoformat()}],

method='insert')

return uid

Page 13: Our Data Ourselves, Pydata 2015

CKAN integrates Celery...

• Celery: a distributed task queue.

• www.celeryproject.org• Compose tasks across

multiple machines.• Monitor tasks with

“Flower”.• “Hooray, we can do

proper data-science!”

Page 14: Our Data Ourselves, Pydata 2015

CKAN integrates Celery...

• Celery: a distributed task queue.

• www.celeryproject.org• Compose tasks across

multiple machines.• Monitor tasks with

“Flower”.• “Hooray, we can do

proper data-science!”• “...unless we default to

SQLalchemy as the broker!”

“Using a database as a message queue is not recommended, but can be sufficient for very small installations.” -Celery documentation

# ckan/lib/celery_app.py

default_config = dict( BROKER_BACKEND='sqlalchey', BROKER_HOST=sqlalchemy_url, CELERY_RESULT_DBURI=sqlalchemy_url, CELERY_RESULT_BACKEND='database', CELERY_RESULT_SERIALIZER='json', CELERY_TASK_SERIALIZER='json', CELERY_IMPORTS=[],)

Page 15: Our Data Ourselves, Pydata 2015

CKAN and Python 3: (We've got 5 years)

• Installation guide specifies v2.6 or 2.7.

• There's a road-map, Python 3 isn't a priority!

• Pyramid supports Python 3.

• A Pylons to Pyramid migration guide was written in 2011.

• How badly will extensions break?

Page 16: Our Data Ourselves, Pydata 2015

The Data: Cell Towers.

• What does the trail of cell towers reveal about users? Can we cluster them?

• Devices connect to towers because of network traffic or cost, not just proximity.

• Not all cells are known to OpenCellId.

• Density varies.

Page 17: Our Data Ourselves, Pydata 2015

Cell Towers: K-Means

• Clusters should be convex.

• Clusters should be compact.

• Space should be of reasonably low dimensionality.

• Euclidean distance should make sense (sklearn enforces this).

Page 18: Our Data Ourselves, Pydata 2015

Cell Towers: K-Means

• Try [lat, lon] as feature vectors.

• Increase K until mean centroid distance is within 90% of the value for the previous K.

• Trails of points from journeys are split across multiple clusters.

Page 19: Our Data Ourselves, Pydata 2015

Cell Towers: K-Means

• Try [lat, lon, d_lat/dt, d_lon/dt] as feature vectors.

• K is reduced, trails of points coalesce.

Page 20: Our Data Ourselves, Pydata 2015

Cell Towers: Spatial-Temporal Clusters?

• Can we localize events in space and time?• Is day-of-the-week (vertical axis) a useful feature?• No cluster that spans all weekdays is credible as a daily

commute.

Page 21: Our Data Ourselves, Pydata 2015

Cell Towers: Spatial-Temporal Clusters?

• Does adding the hour as a feature help?• No cluster that spans 9-5 is found.• Stop abusing K-means with categorical variables!

Page 22: Our Data Ourselves, Pydata 2015

Cell Tower Clusters: Keep it simple.

• On how many distinct days is each cluster visited?• What is the range of days of occupation?• Is the cluster occupied more on weekdays or weekends?• What is the range of times of day when the cluster is

occupied.

• occupied at night all days == home• occupied 9-5 on weekdays == work / school• (OpenStreetMap correctly identified schools, WIFI is also

a clue.)• student, two visits to multiple cities weeks apart ==

university open-days and interviews.

Page 23: Our Data Ourselves, Pydata 2015

Giving back the data.

• Give users a copy of the CKAN instance to play with.• Access data via Ipython notebooks.• Include multiple services, libraries, etc...• Produce a virtual machine that is easy to modify,

document and distribute.

• Dockerfiles specify images that instantiate containers.• “boot2docker” for Mac/Windows is just a re-branded

VirtualBox. -Use Docker in VirtualBox to distribute the container.

Page 24: Our Data Ourselves, Pydata 2015

Giving back the data. -It works!

Page 25: Our Data Ourselves, Pydata 2015

Docker: Criticisms...

• “sudo wget http://notdodqy.org/install.sh | /bin/sh”• Show us the dockerfile!• “That's not proper sysadmin!”

http://iops.io/blog/docker-hype • “What about OpenStack?”

• For distributing canned systems, none of these apply.• But, supervisord doesn't quite work in Python3!

Page 26: Our Data Ourselves, Pydata 2015

The Data: App Activity.

• Is network activity a proxy for app usage?• The more Twitter friends, the more notifications.

0 200 400 600 800 1000 12000

200

400

600

800

1000

1200

Twitter Network Degree vs Notifications

Friends

Followers

Number of Notifications

frie

nd

s / f

ollo

we

rs c

ou

nt

Page 27: Our Data Ourselves, Pydata 2015

The Data: App Activity.

• Is network activity a proxy for app usage?• Some games make sense...

Page 28: Our Data Ourselves, Pydata 2015

The Data: App Activity.

• Is network activity a proxy for app usage?• ...others, not so much:

Page 29: Our Data Ourselves, Pydata 2015

The Line! What is it doing?

Page 30: Our Data Ourselves, Pydata 2015

The Line! AndroidManifest.xml

<receiver android:enabled="true" android:name="com.simplecreator.app.RemoteNotificationReceiver">

<intent-filter>

<action android:name="cn.jpush.android.intent.REGISTRATION"/>

<action android:name="cn.jpush.android.intent.UNREGISTRATION"/>

<action android:name="cn.jpush.android.intent.MESSAGE_RECEIVED"/>

<action android:name="cn.jpush.android.intent.NOTIFICATION_RECEIVED"/>

<action android:name="cn.jpush.android.intent.NOTIFICATION_OPENED"/>

<action android:name="cn.jpush.android.intent.ACTION_RICHPUSH_CALLBACK"/>

<category android:name="com.onetouchgame.TheLine"/>

</intent-filter>

</receiver>

<service android:name="com.umeng.update.net.DownloadingService" android:process=":DownloadingService"/>

<activity android:name="com.umeng.update.UpdateDialogActivity" android:theme="@android:style/Theme.Translucent.NoTitleBar"/>

• The app receives intents from the push notification service jpush.cn. Umeng is a mobile analytics service.

• Is that why it had open sockets on port 3000?

.

apktool d com.onetouchgame.TheLine.apk

Page 31: Our Data Ourselves, Pydata 2015

The Line! Examining the source-code:

Look for PhoneStateListeners and LocationListeners: if (paramLocation != null) { d1 = paramLocation.getLatitude(); d2 = paramLocation.getLongitude(); boolean bool1 = d1 < 29.999998211860657D; ...Classes provided by tencent.com (a mobile ad service) reference latitude and longitude.Classes provided by jpush.cn and umeng.com also reference LocationListeners.

dex2jar.sh com.onetouchgame.TheLine

Page 32: Our Data Ourselves, Pydata 2015

Docker: The Droid Destruction Kit!

• Can we put Android reversal and traffic capture tools into the hands of beginners?

• Many tools require building from source.

• “docker-ubuntu-vnc-desktop” puts an LXDE desktop in the user's browser.

• “Masterclass” on app reversal held by Darren Martyn (http://insecurety.net/) of Xiphos Research: http://www.xiphosresearch.com

Page 33: Our Data Ourselves, Pydata 2015

Docker: The Droid Destruction Kit!

Page 34: Our Data Ourselves, Pydata 2015

Docker: The Droid Destruction Kit!

Page 35: Our Data Ourselves, Pydata 2015

Download our app: http://kingsbsd.github.io/MobileMiner

Follow us on Twitter: @KingsBSD

Read our blog:http://big-social-data.net/

Slideshare:http://www.slideshare.net/kingsBSD/•