analysing github commits with r

26
Analysing GitHub commits with R Barbara Fusinska @BasiaFusinska barbarafusinska.com

Upload: barbara-fusinska

Post on 27-Jul-2015

339 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Analysing GitHub commits with R

Analysing GitHub commits with R

Barbara Fusinska@BasiaFusinska

barbarafusinska.com

Page 2: Analysing GitHub commits with R

About me

ProgrammerManager

Sweet tooth

Page 3: Analysing GitHub commits with R

Goals

AnalysingGitHub data

Methods of dataexploration & analysis

Learn R

Page 4: Analysing GitHub commits with R

Agenda

• Data analysis• R basics• Data exploration & processing• Big Data• Outside GitHub:– Data visualisation– Libraries & tools

Page 5: Analysing GitHub commits with R

Data analysis process

Raw Data Processed Data

Data Analysis & Visualization

Exploratory Data Analysis

Page 6: Analysing GitHub commits with R

What do we want to find out?

Page 7: Analysing GitHub commits with R

GitHut Visualisation

http://githut.info/

Page 8: Analysing GitHub commits with R

GitHub Archive

https://www.githubarchive.org/

Page 9: Analysing GitHub commits with R

Google BigQuery

https://cloud.google.com/bigquery/what-is-bigquery

Page 10: Analysing GitHub commits with R

GitHub API

https://developer.github.com/v3/

Page 11: Analysing GitHub commits with R

DEMO: R basics

• RStudio• Data types• Data filtering

Page 12: Analysing GitHub commits with R

DEMO: Active repositiories

• What is an active repository?• Where can we get data from?• How to get useful information?• How to handle missing data?

Page 13: Analysing GitHub commits with R

GitHub Archive CreateEvent

Page 14: Analysing GitHub commits with R

GitHub Archive PushEvent

Page 15: Analysing GitHub commits with R

GitHub Archive PullRequestEvent

Page 16: Analysing GitHub commits with R

DEMO: Processing various data sources

• Active repositories – from Create, Push and PullRequest events

• Missing language information

Page 17: Analysing GitHub commits with R

Big Data in R

• What’s _Big_Data_ anyway?• R processes data in memory

• Bring down only the data you need

• Streaming the data from database

Page 18: Analysing GitHub commits with R

R, Hadoop, and How They Work Together

• Hadoop Streaming – utilities available as R scripts• ORCH (Oracle R Connector for Hadoop) - provides access to a

Hadoop cluster from R• RHIPE (R and Hadoop Integrated Programming Environment)

– techniques designed for analysing large sets of data• Rhadoop - R packages that allow users to manage and analyse

data with Hadoop

https://blog.udemy.com/r-hadoop/

Page 19: Analysing GitHub commits with R

Azure Machine Learning Studio

http://channel9.msdn.com/Blogs/Windows-Azure/R-in-Azure-ML-Studio

Page 20: Analysing GitHub commits with R

DEMO: Processing multiple files

• Reading files line by line• Reading multiple files

Page 21: Analysing GitHub commits with R

DEMO: Plotting in R

http://www.harding.edu/fmccown/r/

Page 22: Analysing GitHub commits with R

Libraries & Tools

• DataScience

• JSON processing• Http requests• Summarising• Plotting

Page 23: Analysing GitHub commits with R

DEMO: Downsampling

https://github.com/javiljoen/LTTB

Page 24: Analysing GitHub commits with R

DEMO: Anomalies detection

https://github.com/twitter/AnomalyDetection

Page 25: Analysing GitHub commits with R

To summarize…

• GitHub Archive, GitHub API

• R language – RStudio, types, data exploration, I/O operations, etc.

• Big Data in R• 3rd party & build-in libraries• Visualization

Page 26: Analysing GitHub commits with R

Thank youBarbara [email protected]

https://github.com/BasiaFusinska/RTalk