"building data foundations and analytics tools across the product" by crystal widjaja...
TRANSCRIPT
BuildingDataFoundationsandAnalyticsToolsAcrossthe
Product
WhoamI?
● StartedatGO-JEKinJuly2015asthefirst“data”hireFirstday:CreatingaDataDictionarywithoutanyreferencetablesYesterday:Discussions foramoreadvancedexperimentationplatform,
prototyping GrowthROIformulas,QAingnewdatamarts
Agenda
● Infrastructure forScale
● DataModelFoundations
● ToolsforBusinessUsers
Infrastructure
GO-JEKDataToday
~27%*Thisisonlybusinessmetricsdatacollected
byBI
GROWINGDATAVOLUMEPERMONTH
>5000METABASE CARDSANDTABLEAUSHEETS
>450AVGDAILYBUSINESSUSERSON
INTERNALDATA TOOLS
4FULLTIMEDATAWAREHOUSEDEVELOPERS
>30BIDATA ANALYSTS
100sOFMICROSERVICES ACROSSGO-JEK
GO-JEKDataToday
“Thechoicesyoumadeweretherightchoicesgiventhefactsthatyouhadatthetime.”
- Ajey Gore,CTOatGO-JEK
Storage
Storage
crontabs are fun
DataModeling
Moredatatomorepeople
StagingLayer
RAWDataset
IntegrationLayer
Fact/Dimensiondataset
AccessLayer
Summaryandroll-up data
DatamartLayer
Product-specialized dataset
CurrentDataArchitecture
StagingLayer
RAWDataset
IntegrationLayer
Fact/DimensionDataset
AccessLayer
Summaryandroll-up data
DatamartLayer
Product-specialized dataset
CurrentDataArchitecture
Why?1. Transparency2. Standardization
“CanIgetalistofallfull-timedrivers?Iwantto[givethemareward|putthemonabeta
group|interviewthem|…]”
Whatqualitiesmakeadrivera“full-timedriver”?
#ofdaysthedriver logsintotheappinaweek#ofminutesadriverspendsonabooking#ofbookings adriverdoesperdayonavginthepastXweeks#ofminutesadriverspends logged intotheappperday#ofcompletedbookings adriverdoesinaparticularservicemostcommonhour thedriverlogsintotheappinthepastmonth
KeeptheFirstDataLayerFactual
● Star Schema
● Advantages
○ Clean and structured model
MerchantDimension
id nama kategori_merchant
1 WarungBuIis TRADISIONAL
CustomerDimension
id nama nomor_telepon
123 Jo 628112345678
DriverDimension
id nama jenis_kelamin
456 Asep M
457 Doni M
458 Siti F
OrderFact
id id_customer id_driver id_merchant
10001 123 458 1
ItemFact
id id_order nama_item harga
101 10001 NasiGoreng 30000
102 10001 EsTehManis 5000
DriverSearchFact
id id_driver nama status
1 456 Asep Rejected
2 457 Doni Rejected
3 458 Siti Accepted
● Disadvantages
○ Difficulttododatadiscoveryfornon-technicalusers
○ Needs alotofjoins,resultinginhighcomputationalresourceneeds
AppLoginData BidData CompletedBookingData IncomeData DriverProfileData
FactualActivityData
DailyPartitionofDriverActivityand
ProfileDatainDenormalized&NestedForm
TheDataModel
avg_minutes_online_past_3_days total_minutes_online_past_3_days
avg_minutes_online_past_7_days total_days_active_past_3_days
avg_minutes_online_past_30_days total_orders_completed_past_7_days
avg_income_past_3_days total_orders_completed_past_30_days
avg_income_past_7_days total_services_completed_past_7_days
total_completed_ride_past_7_days total_completed_send_past_7_days
foreachdriver_id...
…and+200otherdatapoints
ToolsforScale
LifecycleofaDataPointOneWeekOld
OneMonthOld
3MonthsOld
LetAnalystsDefineEvents
SampleEventstoSaveonCosts
Better samplethatdatapoint...
TakeAway● Buildfor theinfrastructureyouhave,notwhatyouthinkyou’llhave
● Buildsimplestep-by-stepdatamodelswithtransparency
● Buildtoolsthatworkforallthedifferent stagesofthecompany