mining the dba stackexchange
TRANSCRIPT
Business Intelligence and Big Data Analytics ProjectThe case of Stack Exchange - Data Administration
Lamprini Koutsokera
Alexandros Lattas
Working Space
Data Acquisition
41.779 Posts 22.390 Users 123.697 Posts History 69.185 Comments 148.425 Votes 42.127 Badges
XML to CSV Converter(Online tool)
447.603 rows
Data Cleansing - Adjustment
Comments & Post History & Posts Users without Id but with Display Name -> Guest Users
Post History
Users without Id & Display Name -> 10.039 rows deletedVotes -> 12.207 rows deleted
Badges -> 213 rows deleted -> 73 distinct badges remained
Primary & Foreign keys
5% of data deleted
Varchars to NumericspostHistoryTypes | postTypes | voteTypes
age | reputation | viewsTables/dimensions creation
(1)
(2)
(3)
Star - Snowflake Schema
Fact MetricsTotal Comment Score
Posts EditsUsers Participated
Score View Count
Answer CountComment CountFavorite Count
Cube Creation
Dimensions Users (Age, Reputation, Views)Badge TypesPost Types Post History TypesCreation DateVotes Types και Tags
Measurements
Bridge Tables Posts Post HistoryBridge TagsVotesBadges
Fact Table + Posts
Posts
Bridge Tags Tags Post History Post History Types
Votes Votes Types
Users Badges Badges Types
Dimension Usage
Stack Exchange in Metrics
Top 10 Tags
Wednesday 3:00 p.m. Age Group25-34
Posts through months
#
#
#
Posts through countries
United States3.525 posts
India1.648 posts
United Kingdom1.857 posts
Canada1.473 posts
Data Transformation
postid firebird checkpoint warning oracle-apex aggregation subquery
16956 0 0 0 1 0 0
21733 0 0 0 0 0 0
35756 0 0 0 0 0 0
44484 1 0 0 0 0 0
43484 0 0 0 0 0 0
40422 0 0 0 0 0 0
44726 0 0 0 0 0 0
35932 0 0 0 0 0 1
13.608 Posts – 694 Tags
Tag separation into distinct words
<sql-server><aggregation>
Data Mining
Clustering Association Rules
Scalable EM
30% testing set – 70 % training setdefault 10 number of clusters
min. support 0.01 min. confidence 0.1
3.343 score
6.556 edits
1.035.024 views
609 favorites8.847 users participated
8.700 score13.654 edits
1.695.060 views1.065 favorites
20.637 users participated
7.999 score
12.364 edits2.067.306 views
1.028 favorites
19.521 users participated
2.818.903 views
1.391 favorites18.741 users participated
6.436 score
15.655 edits
5.078score
7.016 edits948.036 views
11.936 users participated1.038 favorites 3.294 score
6.939 edits1.538.607 views
497 favorites8.914 users participatedCluster Mapping – Posts View
13.608 Posts
11.347 badges475.314 reputation
42.600 views
56.657 upvotes2.907 downvotes
29.844 badges1.605.644 reputation
131.913 views205.183 upvotes
9.812 downvotes
177.444 upvotes
6.503 downvotes128.337 views
1.355.876 reputation
27.052 badges
81.750 views
2.308 downvotes75.049 upvotes
25.612 badges
1.005.826 reputation
13.754badges
709.640 reputation55.846 views
3.421 downvotes90.959 upvotes 6.008 downvotes
163.349 upvotes
81.289 views1.332.268 reputation
21.083 badgesCluster Mapping – Users View
6.534 Users
25-34 age group
25-34 age group
25-34 age group
25-34 age group
25-34 age group
25-34 age group
Association Rules
backup
sqlserver
index
mysql
replication
performance
optimization
database-design
Map Reduce
Cleansing
XML FilesPosts & Users
(&).*?(;)^((?!AboutMe=).)*$
Reducer
Mapper #1
Mapper #2
Map Reduce ResultsPosts Users Posts further analysis
Body About Me
• Key• Value• Default• Clustering• Slave• Physical• Node
• Logging• Relationship• C• Dynamic• Language
Tags’ description enhancement
DBs’ problemsolving
Graph DBsProgramming Languages
Visualization
Users’ backgroundexploration
• Developer• Software• Web• Programming• Server• Engineer• SQL
• Java• C#• PHP• Microsoft• Linux
Skills KnowledgeInterests KnowledgeJob recommendation
“without”
• Without Time Zone • Without Restarting • Without using SQL
Timestamp type without losing timezone information.
Related with Oracle and PostregSQL.MySQL automatically deals with it.
Practical Implications
Insights for Solutions & Improvements
Targeted Marketingactions per DB Product
Insights on customer behavior per DB Product
Improve data-driven decision making SE process
Improve descriptivetags quality