teradata big data london seminar
DESCRIPTION
Unified Data Architecture - Teradata presentation on the topic of Big Data and Apache Hadoop.TRANSCRIPT
UNIFIED DATA ARCHITECTURE
Chris Hillman Teradata Principal Data Scientist
2 4/23/12 Teradata Confidential
Need for a Unified Data Architecture for New InsightsEnabling Any User for Any Data Type from Data Capture to Analysis
Java, C/C++, Python, R, SAS, SQL, Excel, BI, Visualization
Discover and ExploreReporting and Execution
in the Enterprise
Capture, Store and Refine
Audio/Video
Images Docs TextWeb & Social
Machine Logs
CRM SCM ERP
Confidential and proprietary. Copyright © 2012 Teradata Corporation.3
AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP
DISCOVERY PLATFORM
INTEGRATED DATA WAREHOUSE
UNIFIED DATA ARCHITECTURE
LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS
Engineers
Data Scientists
Business Analysts
Front-Line WorkersCustomers / PartnersQuants
Operational SystemsExecutives
Confidential and proprietary. Copyright © 2012 Teradata Corporation.4
• Single View of Your Business
• Cross-Functional Analysis
• Shared Source for Analytics
• Load Once, Use Many Times
• Highest Business Value
• Lowest Total Cost of Ownership
• Fastest Time-to-Market For New Apps
Requirements for an Integrated Data Warehouse
Business Analysts
Knowledge Workers
DATA MININGBUSINESS INTELLIGENCE APPLICATIONS
Customers/Partners
Marketing
ExecutivesFront-line Workers
Operational Systems
INTEGRATED DATA WAREHOUSE
Confidential and proprietary. Copyright © 2012 Teradata Corporation.5
Requirements of a Discovery Platform
DATA SOURCES
Structured Data
Multi-Structured
Data
Non- Relational
Data
DISCOVERY DISCOVERY TOOLS USERS
Discovery Platform Data
Scientist
BusinessAnalyst
SQL
MapReduce
Statistical Functions
OLTPDBMS’s
• Structured and multi-structured data
• Doesn’t require extensive data modeling
• Doesn’t balance the books
• Data completeness can be good enough
• No stringent SLAs
• Fraud patterns• Customer behavior• Digital marketing
optimization• Supply chain and
supply line sensors
Confidential and proprietary. Copyright © 2012 Teradata Corporation.6
AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP
DISCOVERY PLATFORM
CAPTURE | STORE | REFINE
INTEGRATED DATA WAREHOUSE
UNIFIED DATA ARCHITECTURE
Big Data Analytics
Big Data Management
LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS
Engineers
Data Scientists
Business Analysts
Front-Line WorkersCustomers / PartnersQuants
Operational SystemsExecutives
E-MAIL STORE SVP SURVEY ON-LINE BRANCH DATA CALL CENTER ATM PROFILE
Golden Path Application SubmitFraud Sentiment Analysis
Multi-Channel Customer BehaviorChannel HopingAttrition Paths
Fraudulent PathsDigital Marketing Attribution
ProductionizeAnalytic Score with Path Variable
Event TriggersMarketing Integration
Customer Behavior AnalysisMySpending Report
Customer SegmentationCredit Risk Analysis
Customer profitabilityPortfolio Analysis
DISCOVERY PLATFORM
CAPTURE | STORE | REFINE
INTEGRATED DATA WAREHOUSE
TERADATA UNIFIED DATA ARCHITECTURE
LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS
Engineers
Data Scientists
Business Analysts
Front-Line WorkersCustomers / PartnersQuants
Operational SystemsExecutives
ConsumerizationSessionization
Cross Platform Aggregation
Confidential and proprietary. Copyright © 2012 Teradata Corporation.8 AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP
DISCOVERY PLATFORM
CAPTURE | STORE | REFINE
INTEGRATED DATA WAREHOUSE
LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS
Engineers
Data Scientists
Business Analysts
Front-Line WorkersCustomers / PartnersQuants
Operational SystemsExecutives
SQL-H
TERADATA UNIFIED DATA ARCHITECTURE
Confidential and proprietary. Copyright © 2012 Teradata Corporation.9
SQL-H In ActionJoin Teradata, Hadoop, Aster tables; feed into Map ReduceSELECT qrd_focus_area, count(*)
FROM nPath(
ON (
SELECT * FROM
( SELECT * FROM load_from_teradata(
ON mr_driver TDPID(‘dbc’)
USERNAME(‘name1’) PASSWORD(‘password1’)
QUERY(‘SELECT * FROM owner.prod_own_fact’) ) ) AS td
JOIN owner.prod_dim proddim ON td.prod_id = proddim.product_id
JOIN
( SELECT * FROM load_from_hadoop(
ON mr_driver SERVER ('10.10.3.139')
USERNAME (‘name2') DBNAME (‘repair')
TABLENAME ('transaction') ) ) AS sqlh
ON sqlh.prod_ident_nbr = proddim.id )
PARTITION BY party_id, prod_id ORDER BY repair_dt
MODE (OVERLAPPING)
PATTERN ( ‘REPAIR{3}' )
SYMBOLS ( event = ‘REPAIR’ AS REPAIR )
RESULT (ACCUMULATE(qrd_focus_area OF ANY(REPAIR)) AS qrd_focus_area_path )
) n
GROUP BY 1 ORDER BY 2 desc ;
SQL manipulation for calculation
TD Connector to get OWNERSHIP data
Any path you want, specified with the power of regular expressions!
Hadoop Connector to get WARRANTY data
Include local Aster tables in JOIN
Confidential and proprietary. Copyright © 2012 Teradata Corporation.10 AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP
DISCOVERY PLATFORM
CAPTURE | STORE | REFINE
INTEGRATED DATA WAREHOUSE
LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONSVIEWPOINT SUPPORT
Engineers
Data Scientists
Business Analysts
Front-Line WorkersCustomers / PartnersQuants
Operational SystemsExecutives
TERADATA UNIFIED DATA ARCHITECTURE
Aster Connector for Hadoop
Teradata Connector for Hadoop
Aster Teradata Connector
SQL-H
Aster Loader Teradata Loader
Confidential and proprietary. Copyright © 2012 Teradata Corporation.11
When to Use Which? The best approach by workload and data type
Processing as a Function of Schema Requirements and Stage of Data Pipeline
Low Cost Storage and Fast Loading
Data Pre-Processing,
Refining, Cleansing
“Simple math at scale”
(Score, filter, sort, avg., count...)
Joins, Unions,
Aggregates
Analytics (Iterative and data mining)
Reporting
Stable Schema
Evolving Schema
Aster(SQL +
MapReduce Analytics)
Format, No Schema
Hadoop Hadoop Hadoop Aster AsterAster
(MapReduce Analytics)
Teradata/Hadoop Teradata Teradata Teradata Teradata Teradata
Hadoop Aster / Hadoop
Aster /Hadoop Aster Aster Aster
Hadoop Hadoop Hadoop Aster Aster Aster
Financial Analysis, Ad-Hoc/OLAPEnterprise-Wide BI and Reporting
Spatial/TemporalActive Execution
Interactive Data DiscoveryWeb Clickstream, Set-Top Box Analysis
CDRs, Sensor Logs, JSON
Social Feeds, Text, Image ProcessingAudio/Video Storage and Refining Storage and Batch Transformations
Confidential and proprietary. Copyright © 2012 Teradata Corporation.12
When to Use Which? The best approach by workload and data type
Processing as a Function of Schema Requirements and Stage of Data Pipeline
Low Cost Storage and Fast Loading
Data Pre-Processing,
Refining, Cleansing
“Simple math at scale”
(Score, filter, sort, avg., count...)
Joins, Unions,
Aggregates
Analytics (Iterative and data mining)
Reporting
Stable Schema
Evolving Schema
Aster(SQL +
MapReduce Analytics)
Format, No Schema
Hadoop Hadoop Hadoop Aster AsterAster
(MapReduce Analytics)
Teradata/Hadoop Teradata Teradata Teradata Teradata Teradata
Hadoop Aster / Hadoop
Aster /Hadoop Aster Aster Aster
Hadoop Hadoop Hadoop Aster Aster Aster
UDA IN PRACTICEIPTV QUALITY OF SERVICE
Confidential and proprietary. Copyright © 2012 Teradata Corporation.14
Starting point: Complaints Data
Confidential and proprietary. Copyright © 2012 Teradata Corporation.15
Churners – and data quality
Confidential and proprietary. Copyright © 2012 Teradata Corporation.16
CREATE dimension table wrk.npath_reboot_5eventsAS SELECT path, COUNT(*) AS path_countFROM nPath
(ON wrk.w_event_f PARTITION BY srv_id ORDER BY evt_ts desc MODE (NONOVERLAPPING ) PATTERN ('X{0,5}.reboot') SYMBOLS
(true as X, evt_name = 'REBOOT' AS reboot) RESULT (FIRST( srv_id OF X) AS srv_id, ACCUMULATE (evt_name OF ANY (X,reboot))
AS path) ) GROUP BY 1 ;
SELECT * FROM GraphGen (ON
(SELECT * from wrk.npath_reboot_5events ORDER BY path_count LIMIT 30 )PARTITION BY 1ORDER BY path_count descitem_format('npath')item1_col('path') score_col('path_count') output_format('sankey')justify('right'));
Note number of paths with a reboot,
following another reboot!
What events lead up to a reboot?
Confidential and proprietary. Copyright © 2012 Teradata Corporation.17
Looks like an issue with the data on the 30th September and beyond, the Reboot data for October seems to have been aggregated and added to September the 30th
View events data in Tableau
Confidential and proprietary. Copyright © 2012 Teradata Corporation.18
• Remove paths will all reboots and exclude data from 30th September
Would appear that events with suffix 1 and 2 can be added together
Address data quality
Confidential and proprietary. Copyright © 2012 Teradata Corporation.19
Size of Node = number of customersWidth of Edge = number of errors
SELECT * FROM graphgen (ON
(SELECT DISTINCT dmt_act_dslam, nra_id,
nbr_of_srvid, errorspersrv, nbr_of_dslam
FROM wrk.srvid_dslam_err) PARTITION BY 1 ORDER BY errorspersrv item_format('cfilter') item1_col('dmt_act_dslam') item2_col('nra_id') score_col('errorspersrv') cnt1_col('nbr_of_srvid') cnt2_col('nbr_of_dslam') output_format('sigma') directed('false') width_max(10) width_min(1) nodesize_max (3) nodesize_min (1));
Visualise as a Graph using Aster GraphGen
Confidential and proprietary. Copyright © 2012 Teradata Corporation.20
Synch Issues by Hub Type
Confidential and proprietary. Copyright © 2012 Teradata Corporation.21
Error and Complaint rates by equipment type
UDA IN PRACTICE PREDICTIVE MODELS
Confidential and proprietary. Copyright © 2012 Teradata Corporation.24
create table wrk.cih_dshb_ads asSELECT srv_id, sav_flag, offer, inseecode, code_postal, libelle, nom_dep, nom_region, longitude, latitude, coalesce(topo_nra, 'Unknown') as topo_nra, topo_dslam, coalesce(iad_hardwareversion, 'Unknown') as iad_hardwareversion, coalesce(iad_manufacturer, 'Unknown') as iad_manufacturer, coalesce(iad_modelname , 'Unknown') as iad_modelname, coalesce(iad_modemfirmwareversion , 'Unknown') as iad_modemfirmwareversion, coalesce(iad_productclass , 'Unknown') as iad_productclass, coalesce(iad_provisioningcode , 'Unknown') as iad_provisioningcode, coalesce(iad_softwareversion , 'Unknown') as iad_softwareversion, coalesce(iad_vendorconfigfiledescription_1 , 'Unknown') as iad_vendorconfigfiledescription_1, coalesce(iad_vendorconfigfilename_1 , 'Unknown') as iad_vendorconfigfilename_1, coalesce(iad_vendorconfigfilenumbofentries , 0) as iad_vendorconfigfilenumbofentries, coalesce(iad_vendorconfigfileversion_1 , 'Unknown') as iad_vendorconfigfileversion_1, coalesce(iad_x_000e50_boardversion , 'Unknown') as iad_x_000e50_boardversion, coalesce(stb_description , 'Unknown') as stb_description, coalesce(stb_devicestatus , 'Unknown') as stb_devicestatus, coalesce(stb_gwinfoproductclass , 'Unknown') as stb_gwinfoproductclass, coalesce(stb_hardwareversion , 'Unknown') as stb_hardwareversion, coalesce(stb_manufacturer , 'Unknown') as stb_manufacturer, coalesce(stb_productclass , 'Unknown') as stb_productclass, coalesce( stb_softwareversion, 'Unknown') as stb_softwareversion, dev_iad_uptime_diff,dsl_showtime_diff,dev_stb_uptime_diff, kpi_iad_uptime,kpi_iad_synctime,kpi_stb_uptime, dev_iad_uptime,dsl_showtime,dev_stb_uptime, dsl_downstr_att,dsl_downstr_cur,dsl_downstr_max, kpi_voip_nb_dropped_calls_diff,kpi_voip_nb_dropped_calls,kpi_dsl_nb_crc,kpi_dsl_dscurrate_ratio_qualite, kpi_voip_tx_appels_coupes,kpi_voip_qualite,kpi_voip_qualite_diff,kpi_iptv_plr_nb_bon,kpi_iptv_plr_nb_moyen, ,kpi_iptv_conso_heures,kpi_iptv_packetslosts,kpi_iptv_packetsreceived, kpi_dsl_dscurrate_before,kpi_dsl_dscurrate_after, FROM wrk.cih_dshb_bis where network = 'BYT' and stb_manufacturer is not null and topo_dslam is not null
Input Data
Confidential and proprietary. Copyright © 2012 Teradata Corporation.25
SELECT * FROM forest_drive(ON (SELECT 1) PARTITION BY 1 DATABASE('beehive') USERID('beehive') PASSWORD('beehive') INPUTTABLE('wrk.cih_dshb_tree_in') OUTPUTTABLE('wrk.cih_dshb_tree_out') RESPONSE('sav_flag') NUMERICINPUTS(‘KPI_SIGNAL') CATEGORICALINPUTS('offer', 'nom_dep', 'nom_region', 'topo_nra','topo_dslam' , 'iad_modemfirmwareversion','iad_vendorconfigfiledescription_1', 'iad_x_000e50_boardversion', 'stb_description', 'stb_productclass', 'stb_softwareversion', 'topo_dslam_brand') NUMTREES(4))
Decision Trees
Confidential and proprietary. Copyright © 2012 Teradata Corporation.26
CREATE TABLE wrk.cih_dshb_model (PARTITION KEY(class)) ASSELECT * FROM naiveBayesReduce( ON(SELECT * FROM naiveBayesMap( ON (select * from wrk.cih_dshb_ads_in_11 where kpi_iad_uptime is not null) RESPONSE('sav_flag') NUMERICINPUTS('dev_iad_uptime','dsl_showtime','dev_stb_uptime','dsl_downstr_att','dsl_downstr_cur','dsl_downstr_max','kpi_voip_nb_dropped_calls_diff','kpi_voip_nb_dropped_calls','kpi_dsl_nb_crc','kpi_dsl_dscurrate_ratio_qualite','kpi_voip_tx_appels_coupes','kpi_voip_qualite','kpi_voip_qualite_diff','kpi_iptv_plr_nb_bon','kpi_iptv_plr_nb_moyen','kpi_iptv_plr_nb_mauvais','kpi_iptv_packetslosts','kpi_iptv_packetsreceived','kpi_stb_uptime','kpi_iad_synctime','kpi_iad_uptime') CATEGORICALINPUTS('offer', 'nom_dep', 'nom_region', 'topo_nra','topo_dslam' , 'iad_modemfirmwareversion','iad_vendorconfigfiledescription_1','iad_x_000e50_boardversion', 'stb_description', 'stb_productclass', 'stb_softwareversion', 'topo_dslam_brand') ) )PARTITION BY class);
Naïve Bayes
Confidential and proprietary. Copyright © 2012 Teradata Corporation.27
create table wrk.cih_svm_train2 distribute by hash(srv_id) as select srv_id, 'topo_nra_insee' as attr, topo_nra_insee::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'code_postal' as attr, code_postal::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'kpi_iad_uptime_avg' as attr, kpi_iad_uptime_avg::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'dev_iad_uptime_diff_avg' as attr, dev_iad_uptime_diff_avg::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'kpi_voip_nb_dropped_calls_diff_avg' as attr, kpi_voip_nb_dropped_calls_diff_avg::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'sav_nb_contacts' as attr, sav_nb_contacts::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'nb_tr' as attr, nb_tr::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union allselect srv_id, 'kpi_dsl_nb_crc_avg' as attr, kpi_dsl_nb_crc_avg::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train;/*Run SVM*/
CREATE TABLE wrk.cih_svm_model3 (PARTITION KEY(vec_index)) ASSELECT vec_index, avg(vec_value) as vec_value FROMsvm( ON wrk.cih_svm_train2PARTITION BY srv_idOUTCOME( 'sav_flag' )ATTRIBUTE_NAME( 'attr' )ATTRIBUTE_VALUE( 'attr_value' ))GROUP BY vec_index;
Support Vector Machine
Confidential and proprietary. Copyright © 2012 Teradata Corporation.28
Lift Chart to View Predictive Model Performance