data mining la plata 11 nov 2002 - census

64
Census Data Analysis & Data Mining Data Mining Mar ía del Rosar io Br uer a I BM Scholar s Pr ogr am Census Data Analysis & Data Mining Preguntas y respuestas Pr egunt as: ¢&XiOHVHOYDORUGHORVFOLHQWHV" ¢&XiOHV VRQ ORV FOLHQWHV TXH WLHQHQ PD\RU SUREDELOLGDGGHGHVHUWDU" ¢&XiOHV VRQ ORV SURGXFWRV TXH VH YHQGHQ HQ IRUPDFRQMXQWD"« Respuest as: (VWiQHQORVGDWRVGHOXVXDULR 6HQHFHVLWDQKHUUDPLHQWDVHVSHFLDOHVSDUD HQFRQWUDUODV

Upload: others

Post on 24-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Dat a Min ing

Mar ía del Rosar io Bruer a

I BM Scholar s Pr ogram

Census Data Analysis & Data Mining

Pregunt as y respuest as

Pr egunt as:· ¢&XiO�HV�HO�YDORU�GH�ORV�FOLHQWHV"· ¢&XiOHV� VRQ� ORV� FOLHQWHV� TXH� WLHQHQ� PD\RUSUREDELOLGDG�GH�GHVHUWDU"

· ¢&XiOHV� VRQ� ORV� SURGXFWRV� TXH� VH� YHQGHQ� HQIRUPD�FRQMXQWD"���«

Respuest as:· (VWiQ�HQ�ORV�GDWRV�GHO�XVXDULR· 6H�QHFHVLWDQ�KHUUDPLHQWDV�HVSHFLDOHV�SDUDHQFRQWUDUODV

Page 2: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Business In t e l l igenc e

S´(V� XQ� SDUDJXDV� EDMR� HO� TXH� VHLQFOX\H� XQ� FRQMXQWR� GH� FRQFHSWRV� \PHWRGRORJtDV�FX\D�PLVLyQ�FRQVLVWH�HQPHMRUDU� HO� SURFHVR� GH� WRPD� GHGHFLVLRQHV�HQ�ORV�QHJRFLRV�EDViQGRVHHQ� KHFKRV� \� VLVWHPDV� TXH� WUDEDMDQFRQ�KHFKRVµ

���+RZDUG�'UHVQHU�*DUWQHU�*URXS������

Census Data Analysis & Data Mining

B.I .: rec ursos y her ram ient as

S)XHQWHV� GH� GDWRV� �� ZDUHKRXVHV�GDWD�PDUWV��HWF

S+HUUDPLHQWDV� GH� DGPLQLVWUDFLyQ� GHGDWRV

S+HUUDPLHQWDV� GH� H[WUDFFLyQ� \FRQVXOWD

S+HUUDPLHQWDV�GH�PRGHOL]DFLyQ��'DWD0LQLQJ�

Page 3: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

¿Qué es Dat a Min ing? (1997)

·'DWD� 0LQLQJ� �� es el pr oceso deexplor ación y análisis - de maner aaut omát ica o semiaut omát ica - de losdat os par a obt ener pat r onessignif icat ivos y r eglas de negocio.

· 0LFKDHO�%HUU\��*RUGRQ�/LQRII'DWD�0LQLQJ�IRU�PDUNHWLQJ�VDOHVDQG�FXVWRPHU�VXSSRUW�:LOH\��86$������

Census Data Analysis & Data Mining

Ref lex iones (2000)

S��� QRV� JXVWD� OD� QRFLyQ� GH� TXH� ORV� SDWURQHVGHEHQ�VHU�VLJQLILFDWLYRV�«

S���6L�KD\�DOJR�TXH�UHFKD]DPRV�HV�OD�IUDVH�´SRUPHGLRV� DXWRPiWLFRV� R� VHPLDXWRPiWLFRVµ�� QRSRUTXH� QR� VHD� FLHUWR� �� VLQ� DXWRPDWL]DFLyQ� HVLPSRVLEOH�PLQDU� JUDQGHV� FDQWLGDGHV� GH� GDWRV� �VLQR� SRUTXH� HQWHQGHPRV� TXH� VH� KD� SXHVWRGHPDVLDGR� pQIDVLV� HQ� OD� DXWRPDWL]DFLyQ� \� QRVXILFLHQWH� HQ� ODV� HWDSDV� GH� H[SORUDFLyQ� \DQiOLVLV

S���'DWD�0LQLQJ�HV�XQ�SURFHVR��� ����������� � ��������� ���������� ����� ��!��"$#��� � ��%'&(��#��)�*� ��� ��%�+�,-� �����/.10�243�5565

Page 4: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Qué NO es Dat a Min ing

S1R� HV� XQ� SURGXFWR� TXH� VH� FRPSUDHQODWDGR� VLQR� XQD� GLVFLSOLQD� TXHGHEH�VHU�GRPLQDGD�

S1R�HV�XQD�VROXFLyQ�LQVWDQWiQHD�D�ORVSUREOHPDV�GH�QHJRFLR�

S1R� HV� XQ� ILQ� HQ� Vt� PLVPR� VLQR� XQSURFHVR� TXH� D\XGD� D� HQFRQWUDUVROXFLRQHV�D�SUREOHPDV�GH�QHJRFLR�

Census Data Analysis & Data Mining

Pi lares de l proc esode Dat a Min ing

S 'DWRVS $OJRULWPRV�\�WpFQLFDVS 3UiFWLFDV�GH�PRGHOL]DFLyQ�

Page 5: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Disc ip l inas que se in t egran

S,QWHOLJHQFLD�$UWLILFLDOS(VWDGtVWLFDS7HFQRORJtDV� GH� VRSRUWH� GHGHFLVLRQHV���2/73�

S7HFQRORJtDV�GH�KDUGZDUH�\�VRIWZDUH

Census Data Analysis & Data Mining

Perspec t iva h is t ór ic a

Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data

Page 6: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data

Census Data Analysis & Data Mining

Et apas en e l proc esode Dat a Min ing

·,GHQWLILFDU�HO�SUREOHPD�GH�QHJRFLR·7UDQVIRUPDU� ORV� GDWRV� HQLQIRUPDFLyQ

·$FWXDU�D�SDUWLU�GH�ORV�UHVXOWDGRV·0HGLU�ORV�UHVXOWDGRV�GH�ODV�DFFLRQHV

Page 7: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

The Mining

Pr ocess

Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data

Census Data Analysis & Data Mining

El Anal is t a de Dat os

S(V� HO� YtQFXOR� HQWUH� ODV� iUHDV� GHWHFQRORJtD� LQIRUPiWLFD� \� ODV� iUHDV� GHQHJRFLRV

S7UDGXFH� ORV� UHTXHULPLHQWRV� GHLQIRUPDFLyQ�HQ�SUHJXQWDV�DSURSLDGDV�SDUDVX� DQiOLVLV� FRQ� ODV� KHUUDPLHQWDV� GHPLQHUtD�

S5HDOLPHQWD�HO�'DWD�:DUHKRXVH�GH�ODFRPSDxtD�FRQ�QXHYRV�FULWHULRV�GH�GDWDFOHDQLQJ�\�GDWD�YDOLGDWLRQ�

Page 8: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

El Anal is t a de Dat os

7HFQRORJtDLQIRUPiWLFD

�8VXDULRVGH�QHJRFLR

Census Data Analysis & Data Mining

7�8�96: ;�<�9�9>=?@A 9B6@�96CD�: 9�E�?�F�<G H

D�@�I @�J�@G <K�?�8�9�<

D�@�I @D�: 9�E�?�F�<G H

LM: ;6: ;�=ND�@�I @

D�@�I @POQA <�@;6: ;�=D�@�I @MBRG @;�9�ST?UGTV>@�IW: ?U;

D�@�I @YX�;�@A H�96: 9D�@�I @*LZ?[<A A : ;�=

\ ;�96: =�K�I 9]�;�?�^MA <�[6=�<D�: 9�E�?�F�<G H

El Anal is t a de Dat os

Page 9: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Habi l idades requer idas

S'DWD�PDQLSXODWLRQ��64/�S&RQRFLPLHQWR�GH�ODV�WpFQLFDV�GHPLQHUtD�\�DQiOLVLV�H[SORUDWRULR

S+DELOLGDG� GH� FRPXQLFDFLyQ�LQWHUSUHWDFLyQ��GH�ORV�SUREOHPDV�GHQHJRFLR

S&UHDWLYLGDG

Census Data Analysis & Data Mining

Dat a Min ing Team

Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data

Page 10: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Cost os de proyec t o

Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data

Census Data Analysis & Data Mining

Origen de los dat os

S%DVHV�GH�'DWRV�5HODFLRQDOHVS'DWD�:DUHKRXVHVS'DWD�0DUWV�DQG�2/$3S2WURV�IRUPDWRV���([FHO��DUFKLYRV$6&,,��HQFXHVWDV��GDWRV�FHQVDOHV�HWF�

Page 11: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Tipos de fuent es de dat os

S7UDQVDFFLRQDOHV��HM��ODV�RSHUDFLRQHVUHDOL]DGDV�FRQ�WDUMHWD�GH�FUpGLWR

S5HODFLRQDOHV���HM��OD�HVWUXFWXUD�GHORV�SURGXFWRV�TXH�RIUHFH�HO�%DQFR

S'HPRJUiILFRV��HM��FDUDFWHUtVWLFDVGHO�JUXSR�IDPLOLDU

Census Data Analysis & Data Mining

La form a de los dat ospara Dat a Min ing

S6H�RUJDQL]DQ�HQ�IRUPD�GH�XQD�WDEODSODQD� FRPSXHVWD� SRU� ILODV� \FROXPQDV�

S/DV� )LODV� �� XQLGDG� GH� DQiOLVLV�3RUHMHPSOR��XQD�FXHQWD��XQ�WLFNHW

S/DV� FROXPQDV� �� ORV� DWULEXWRV� GHFDGD� XQLGDG� GH� DQiOLVLV�3RU� HMHPSOR��IUHFXHQFLD�GH�XVR�GH�OD�WDUMHWD�GHFUpGLWR

Page 12: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Carac t er ís t ic as de las t ab las dedat os para Dat a Min ing

S7RGRV�ORV�GDWRV�GHEHQ�HVWDU�HQ�XQD�VRODWDEOD�R�´YLVWDµ�GH�OD�%DVH�GH�'DWRV

S&DGD� ILOD� GHEH� FRUUHVSRQGHU� D� XQDLQVWDQFLD�UHOHYDQWH�DO�QHJRFLR

S/DV� &ROXPQDV� VLQ� YDULDELOLGDG� GHEHQ� VHULJQRUDGDV

S/DV� &ROXPQDV� FRQ� YDORUHV� ~QLFRV� SDUDFDGD� FDVR� GHEHQ� VHU� LJQRUDGDV� �1UR� GHFXHQWD�

Census Data Analysis & Data Mining

La c a l idad de los dat os

·(O� p[LWR� GH� ODV� DFWLYLGDGHV� GH� Dat aMining VH� UHODFLRQD� GLUHFWDPHQWH� FRQ� ODCALI DAD�GH�ORV�GDWRV�

·6H� GHEH� LGHQWLILFDU� ORV� GDWRV� IDOWDQWHV“missings” R�IXHUD�GH�UDQJR�“out lier s”�

Page 13: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

La c a l idad de los dat os

·0XFKDV�YHFHV�UHVXOWD�QHFHVDULR�SUH�SURFHVDU�ORVGDWRV�DQWHV�GH�GHULYDUORV�DO�PRGHOR�GH�DQiOLVLV�(O�SUH�SURFHVDPLHQWR�SXHGH�LQFOXLUWUDQVIRUPDFLRQHV��UHGXFFLRQHV�R�FRPELQDFLRQHVGH�ORV�GDWRV�

· /D�VHPiQWLFD�GH�ORV�GDWRV�GHEH�D\XGDU�SDUD�ODVHOHFFLyQ�GH�XQD�FRQYHQLHQWH�r epr esent ación \ODV��ERQGDGHV�GH�OD�UHSUHVHQWDFLyQ�HOHJLGDJUDYLWDQ�GLUHFWDPHQWH�VREUH�OD�FDOLGDG�GHOPRGHOR�\�GH�ORV�UHVXOWDGRV�SRVWHULRUHV�

Census Data Analysis & Data Mining

Problem as c on los dat os

· 'HPDVLDGRV�GDWRV_ GDWRV�FRUUXSWRV�R�FRQ�UXLGR_ GDWRV�UHGXQGDQWHV��UHTXLHUHQ�IDFWRUL]DFLyQ�_ GDWRV�LUUHOHYDQWHV_ H[FHVLYD�FDQWLGDG�GH�GDWRV��PXHVWUHR�

· 3RFRV�GDWRV_ DWULEXWRV�SHUGLGRV��PLVVLQJV�_ YDORUHV�SHUGLGRV_ SRFD�FDQWLGDG�GH�GDWRV

· 'DWRV�IUDFWXUDGRV_ GDWRV�LQFRPSDWLEOHV_ P~OWLSOHV�IXHQWHV�GH�GDWRV

Page 14: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

6HOHFW 7UDQVIRUP 0LQH

`ba cTadea�f$g�h�i�j�klg m g�n g�o$cTg�p`ba cTa

$VVLPLODWH

$VVLPLODWHG,QIRUPDWLRQ

([WUDFWHG,QIRUPDWLRQ

7UDQVIRUPHG�'DWD

Preparac ión de los dat os

Census Data Analysis & Data Mining

Dat a Warehouse

S'DWD� :DUHKRXVH� LV� D� VXEMHFW�RULHQWHG�LQWHJUDWHG�� WLPH�YDULDQW�� QRQ� YRODWLOHFROOHFWLRQ� RI� GDWD� LQ� VXSSRUW� RIPDQDJHPHQW�GHFLVLRQV

%LOO�,QPRQ�������S$�FRS\�RI�WUDQVDFWLRQ�GDWD�VSHFLILFDOO\VWUXFWXUHG�IRU�TXHU\�DQG�DQDO\VLV�

�������������������������5DOSK�.LPEDOO

Page 15: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Dat a Mart s

S7pFQLFDPHQWH�HV�XQ�VXEFRQMXQWR�GHO':� RULHQWDGR� D� XQD� ILQDOLGDGHVSHFtILFD� GH� QHJRFLR� �� PDUNHWLQJ�ILQDQ]DV��SURGXFFLyQ��HWF

S(O� WpUPLQR� VH� XWLOL]D� WDPELpQ� SDUDLGHQWLILFDU� VROXFLRQHV� DOWHUQDWLYDV� DXQ�':�FRUSRUDWLYR�PiV�UHGXFLGDV�\GH� PHQRU� FRVWR� \� WLHPSR� GHLPSODQWDFLyQ�

Census Data Analysis & Data Mining

Arqu i t ec t ura de lDat aw arehouse

q�rts uWvtwWwTx ylz�{$wT| }Wvlz~���� wWv$uT�tvR��}��vls �t�

DW

Metadata

Datosoperacionales y

externos

ReportQuery,EIS

OLAP

DataMining

Page 16: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Herram ient as deex plo t ac ión de l DW

S+HUUDPLHQWDV�GH�YLVXDOL]DFLyQS5HSRUWLQJS2/$3S'DWD�0LQLQJ

Census Data Analysis & Data Mining

OLAP

S2Q�/LQH�$QDO\WLFDO�3URFHVVLQJS3HUPLWHQ� OD� HODERUDFLyQ� GH� YLVWDVPXOWLGLPHQVLRQDOHV� GHO� ':� SDUDRSWLPL]DU�SHUIRUPDQFH

S(VWiQ� VRSRUWDGDV� SRU� PRWRUHV� GHDGPLQLVWUDFLyQ� GHO� ':� TXH� DGPLWHQOD�FRQVWUXFFLyQ�GH�HVWRV�´FXERVµ

Page 17: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

OLAP

S+HUUDPLHQWDV�~WLOHV�\�SRGHURVDVSDUD�DFFHGHU�D�%DVHV�GH�'DWRV�\'DWD�:DUHKRXVHV�\�REWHQHU´UHSRUWHVµ�GH�LQIRUPDFLyQ�

S/D� WHFQRORJtD� 2/$3� FRPPSOHPHQWDODV� DFWLYLGDGHV� GH� 'DWD� 0LQLQJ� \VXSHUD�ODV�SRVLELOLGDGHV�GHO�64/

Census Data Analysis & Data Mining

Dat a Min ing y OLAP

S/DV� KHUUDPLHQWDV� GH� UHSRUWLQJ�2/$3� \� FRQVXOWD� UHVSRQGHQHIHFWLYDPHQWH� SDUD� OD� FRQVWUXFFLyQGH� PRGHORV� GHVFULSWLYRV� \UHWURVSHFWLYRV� SDUD� FRQILUPDU� RUHFKD]DU� KLSyWHVLV� SUHYLDV� GHOXVXDULR

Page 18: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Dat a Min ing y OLAP

S/DV� KHUUDPLHQWDV� GH� 'DWD� 0LQLQJSHUPLWHQ� HQFRQWUDU� SDWURQHV� QRHYLGHQWHV� HQ� ORV� JUDQGHV� YRO~PHQHVGH� LQIRUPDFLyQ� GHO� ':� \� SURSRQHUPRGHORV�SUHGLFWLYRV

Census Data Analysis & Data Mining

Qué es la Est adíst ic a

S(V� OD� GLVFLSOLQD� TXH� H[WUDHLQIRUPDFLyQ� JHQHUDO� D� SDUWLU� GHGDWRV�HVSHFtILFRV�

S(V�HO�HVWXGLR�GH�OD�HVWDELOLGDG�HQ�ODYDULDFLyQ

S(V� HO� DUWH� GH� H[DPLQDU�� VXPDUL]DU\� H[WUDHU� FRQFOXVLRQHV� D� SDUWLU� GHORV�GDWRV�

Page 19: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Dat a Min ing y Est adís t ic a

S/RV� PpWRGRV� HVWDGtVWLFRV� VRQ� HOFRUD]yQ� GH� PXFKDV� GH� ODV� WpFQLFDVGH�PLQHUtD�GH�GDWRV�

S2ULJLQDOPHQWH� PXFKDV� GH� HVWDVWpFQLFDV� IXHURQ� GLVHxDGDV� FRQSURSyVLWRV�FRQILUPDWRULRV�

S/D�HVWDGtVWLFD�H[SORUDWRULD�DSDUHFHHQ� ORV� ��� FRQ� ORV� DSRUWHV� GH-�7XFNH\

Census Data Analysis & Data Mining

Dat a Min ing y Est adís t ic a

S(Q� OD� 0LQHUtD� GH� 'DWRV� QR� VH� KDFHQVXSXHVWRV�D�SULRUL�VREUH� OD�QDWXUDOH]DGH� ODV� YDULDEOHV� \� GH� ODV� UHODFLRQHVHQWUH� HOODV� �QRUPDOLGDG�� OLQHDOLGDG�HWF��

S/RV�DOJRULWPRV�HVWDGtVWLFRV�VH�DGDSWDQ�� SDUD� 0LQHUtD� GH� 'DWRV� �� DOSURFHVDPLHQWR� GH� JUDQGHV� YRO~PHQHVGH�GDWRV

Page 20: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Dat a Min ing e IA

S/D�,QWHOLJHQFLD�$UWLILFLDO�VH�LQWHJUDD� OD� 0LQHUtD� GH� 'DWRV� D� SDUWLU� GHODV�UHGHV�QHXURQDOHV�DUWLILFLDOHV

S6H� XWLOL]DQ� SDUD� FRQVWUXLU� PRGHORVSUHGLFWLYRV�QR�OLQHDOHV�TXH�DSUHQGHQD�WUDYpV�GH�HQWUHQDPLHQWR�\�TXH�VHDVLPLODQ��D�ORV�PRGHORV�GH�UHGHV�GHQHXURQDV�ELROyJLFDV�

Census Data Analysis & Data Mining

Redes neuronales

S/DV�UHGHV�QHXURQDOHV�VRQ�DGHFXDGDVSDUD�SUREOHPDV�GH�WLSR�SUHGLFWLYR�

S8Q�SUREOHPD�DSURSLDGR�SDUD�XQD�UHGQHXURQDO�WLHQH�WUHV�FDUDFWHUtVWLFDV�� Se compr enden clar ament e los I NPUTS� Se compr ende clar ament e el OUTPUT� Exist en ej emplos (exper iencia)

suf icient es par a ent r enar a la r ed

Page 21: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Los m odelos neuronales

S/D� UHG� QHXURQDO� QR� SURGXFH� UHJODVH[SOtFLWDV�TXH�GHVFULEDQ�HO�PRGHOR

S8Q�PRGHOR�QHXURQDO�HV�WDQ�EXHQR�FRPR�ORHV� HO� VHW� GH� GDWRV� XVDGR� SDUD� HQWUHQDUOD�UHG

S(O� PRGHOR� HV� HVWiWLFR� \� GHEH� VHUH[SOtFLWDPHQWH� DFWXDOL]DGR� DJUHJDQGRHMHPSORV� UHFLHQWHV� \� UH�HQWUHQDQGR� ODUHG�SDUD�DVHJXUDU�VX�YLJHQFLD�\�XWLOLGDG

Census Data Analysis & Data Mining

Los m odelos neuronales

S&RQ�PRGHORV� QHXURQDOHV� VH� SXHGH� DWDFDUXQD� JUDQ� YDULHGDG� GH� SUREOHPDV� \SURGXFLU� EXHQRV� UHVXOWDGRV� D~Q� HQGRPLQLRV� FRPSOHMRV� FRQ� YDULDEOHVFRQWLQXDV�\�FDWHJyULFDV

S6RQ� DSURSLDGRV� SDUD� WDUHDV� GHFODVLILFDFLyQ� \� SUHGLFFLyQ� FXDQGR� ORVUHVXOWDGRV� GHO� PRGHOR� VRQ� PiVLPSRUWDQWHV� TXH� FRPSUHQGHU� FyPRIXQFLRQD�HO�PRGHOR�

Page 22: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Cust om er Rela t ionsh ipManagem ent

S(V� HO� SURFHVR� TXH� DGPLQLVWUD� ODUHODFLyQ� HQWUH� OD� FRPSDxtD� \� VXVFOLHQWHV

S3DUD� TXH� UHVXOWH� H[LWRVR� UHVXOWDQHFHVDULR� LGHQWLILFDU� ORV� SDWURQHVGH�FRQVXPR�\�FRPSRUWDPLHQWR�GH�ORVFOLHQWHV

Census Data Analysis & Data Mining

Dat a Min ing - CRM

S'DWD� 0LQLQJ� VH� XWLOL]D� SDUDVLVWHPDWL]DU� ORV� SURFHVRV� GHE~VTXHGD� GH� ORV� SUHGLFWRUHV� GHFRPSRUWDPLHQWR� GH� ORV� FOLHQWHV� HQODV�HWDSDV�GH�GLVHxR�GH�FDPSDxDV

S7DPELpQ� VH� DSOLFD� SDUD� OD�PHGLFLyQGH�ORV�UHVXOWDGRV�GH�OD�FDPSDxD�\�ODUHDOLPHQWDFLyQ�GHO�&50

Page 23: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Problem as t íp ic os de Dat a Min ing

S&ODVLILFDFLyQS(VWLPDFLyQS3UHGLFFLyQS$JUXSDPLHQWR�D�SDUWLU�GH�UHJODV�GHDVRFLDFLyQ

S&OXVWHULQJS'HVFULSFLyQ�\�YLVXDOL]DFLyQ��HWF

Census Data Analysis & Data Mining

Problem a de Clust er ing

$JUXSDU�D�ORV�FOLHQWHV�VHJ~Q�VXV�LQGLFDGRUHV5�5HFHQF\��� �)�)UHFXHQFLD���0� �0RQWR��� HWFHQ�VHJPHQWRV�GH�FRPSRUWDPLHQWR�KRPRJpQHR�5HVXOWDGR� ��&OLHQWHV�+HDY\��0HGLXP��/LJKW�HWF

��(O�����GH�OD�IDFWXUDFLyQ�VH�FRQFHQWUD�HQ�HOFOXVWHU�+HDY\������GH�ORV�FOLHQWHV��

� � /RV� FOLHQWHV�+HDY\� VRQ� FDVDGRV�� FRQ� KLMRV�WUDEDMDGRUHV� DXWyQRPRV� FRQ� XQ� LQJUHVRVXSHULRU�D�������

Page 24: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Problem a de Clas i f ic ac ión

&ODVLILFDU� XQ� QXHYR� � FOLHQWH� �� GHDFXHUGR� D� VX� SHUILOVRFLRGHPRJUiILFR� �� FRPR� SRWHQFLDOFOLHQWH�+HDY\��0HGLXP��/LJKW�

Census Data Analysis & Data Mining

Problem a de Est im ac ión

(VWLPDU� HO� FRQVXPR� GH� XQGHWHUPLQDGR� UXEUR� GH� DUWtFXORV� GHXQ� JUXSR� � FOLHQWHV� HQ� HO� SUy[LPRWULPHVWUH�

(VWLPDU� HO� /79� �/LIH� 7LPH� 9DOXH�SRWHQFLDO�GH�XQ�QXHYR�FOLHQWH

Page 25: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Problem a de Predic c ión

3UHGHFLU�HO�DEDQGRQR�GH�XQ�FOLHQWH�FKXUQLQJ��DWULWWLRQ�

��3DUD�XQD�FRPSDxtD�GH�WHOHIRQtDFHOXODU��3DUD�XQD�$)-3��3DUD�XQD�WDUMHWD�GH�FUpGLWR

Census Data Analysis & Data Mining

Problem a de Asoc iac ión

(QFRQWUDU�ODV�UHJODV�TXH�GHWHUPLQDQHO�FURVV���WUDIILF�HQWUH�SURGXFWRVSDUD�ORV�FOLHQWHV�GH�XQ�%DQFR��3RUHMHPSOR�´&XDQGR�XQ�FOLHQWH�VH�DFWLYD�HQ�&DMDGH�$KRUURV���HO�VLJXLHQWH�SURGXFWRHQ�GRQGH�VH�DFWLYD�HV�3UpVWDPRVSHUVRQDOHV���(VWH�SDWUyQ�RFXUUH�HQHO�����GH�ORV�FDVRV�µ

Page 26: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Problem a de v isual izac ión

� � 5HSUHVHQWDU� PHGLDQWH� XQ� VRIWZDUHGH� JHRORFDOL]DFLyQ� �*,6�� ODGLVWULEXFLyQ� GH� ORV� FOLHQWHV� HQ� OD]RQD�GH�LQIOXHQFLD�GH�ODV�VXFXUVDOHVGH�XQ�FRPHUFLR�

Census Data Analysis & Data Mining

Problem as usuales

S&DUDFWHUL]DFLyQ� GH� SHUILOHV� GHFOLHQWHV�SDUD�GHILQLU�DFFLRQHV�GH�8SVHOOLQJ�\�&URVV�VHOOLQJ

S7UDFNLQJ� GH� FDPSDxDV� \� SUHGLFFLyQGH�UHVSXHVWD���QR�UHVSXHVWD

S&DQDVWD�GH�FRQVXPR�GH�WDUMHWDV�GHFUpGLWR�\�SUHYHQFLyQ�GH�IUDXGHV

S0RGHORV�GH�SUHGLFFLyQ�GH�DEDQGRQR

Page 27: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Problem as usuales

S3URJUDPDV�GH�PLOODMH�\�ILGHOL]DFLyQS&RQVROLGDFLyQ�GH�%DVHV�GH�'DWRVSURSLDV�FRQ�IXHQWHV�H[WHUQDV

S:HE�PLQLQJ�\�DQiOLVLV�GH�WUiILFR�\XVR�GH�UHFXUVRV�GH�H�EXVLQHVV

S'HILQLFLyQ� GH� PDUFRV� PXHVWUDOHVSDUD� LQYHVWLJDFLRQHV� GH� PHUFDGR� \HQFXHVWDV�GH�FXVWRPHU�VDWLVIDFWLRQ�

Census Data Analysis & Data Mining

La e lec c ión de l m odelopara Dat a Min ing

·3ULQFLSDOHV�REMHWLYRV�GHO�SURFHVR�GH�'DWD0LQLQJ� pr edicción� descr ipción

·(O� PpWRGR� D� XWLOL]DU� GHSHQGH� GH� ORVREMHWLYRV�SHUVHJXLGRV�SRU�HO�DQiOLVLV�SHURWDPELpQ� GH� OD� FDOLGDG� \� FDQWLGDG� GH� ORVGDWRV�GLVSRQLEOHV

Page 28: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data

Census Data Analysis & Data Mining

Cóm o selec c ionar unapot enc ia l ap l ic ac ión de DM

&RQVLGHUDFLRQHV�SUiFWLFDV�·3RWHQFLDO�LPSDFWR�VLJQLILFDWLYR��5HODFLyQFRVWR���EHQHILFLR��

·1R�KD\�RWUD�DOWHUQDWLYD·([LVWH�VRSRUWH�LQVWLWXFLRQDO·1R�H[LVWHQ�LPSHGLPHQWRV�OHJDOHV�GH�XVRGH�OD�LQIRUPDFLyQ

Page 29: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Cóm o selec c ionar unapot enc ia l ap l ic ac ión de DM

Consider aciones t écnicas:·'LVSRQLELOLGDG�VXILFLHQWH�GH�GDWRV·5HOHYDQFLD�GH�DWULEXWRV·%DMRV�QLYHOHV�GH�UXLGR�HQ�ORV�GDWRV·3UHFLVDU�HO�QLYHO�GH�FRQILDQ]D�SDUD�ORVUHVXOWDGRV

·&RQRFLPLHQWR�DQWHULRU�H[LVWHQWH

Census Data Analysis & Data Mining

La evaluac ión de los m odelos

·&XiQ�DMXVWDGR�HV�HO�PRGHOR"·(V�FRUUHFWD�VX�GHVFULSFLyQ�GH�ORVGDWRV�REVHUYDGRV"

·&XDQWD�FRQILDQ]D�VH�SXHGH�WHQHU�HQVXV�SUHGLFFLRQHV"

·&XiQ�FRPSUHQVLEOH�HV�HO�PRGHOR"

Page 30: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Las m edidas

·/D� FRQFRUGDQFLD� GH� XQ�PRGHOR� SUHGLFWLYRFRQ�OD�UHDOLGDG�VH�PLGH�FRQ�UHODFLyQ�D�ODWDVD�GH�HUURU��HV�GHFLU��HO�SRUFHQWDMH�GHFDVRV� FODVLILFDGRV� R� FX\D� SUHGLFFLyQ� IXHLQFRUUHFWD�

·3DUD� HOOR� VH� GLVSRQH� GH� GDWRV� GHYDOLGDFLyQ� \� WHVWLQJ� VREUH� ORV� TXH� GHEHDSOLFDUVH� SHULyGLFDPHQWH� HO� PRGHOR� DPRGR�GH�FRQWURO�

Census Data Analysis & Data Mining

Las m edidas

·(Q� HO� FDVR� GH� ORV� PRGHORV� GHVFULSWLYRV�XQD�EXHQD�UHJOD��HV�OD�TXH�SURSRUFLRQD�ODLQIRUPDFLyQ� PiV� FRPSUHQVLEOH� FRQ� ODPHQRU� ´ORQJLWXGµ� GH� H[SUHVLyQ� GH� ODUHJOD�

·(Q� GHILQLWLYD�� OD�PHGLGD�PiV� LPSRUWDQWHGH� HIHFWLYLGDG� HV� HO� UHWRUQR� GH� ODLQYHUVLyQ

Page 31: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Un proyec t o ex i t oso

S8Q�~QLFR�SURMHFW�OHDGHUS8Q�HTXLSR�PXOWLGLVFLSOLQDULR� LQWHJUDGR�SRUSHUVRQDV�GH�ODV�iUHDV�GH�,7�\�GH�QHJRFLR

S/DV� XQLGDGHV� GH� QHJRFLR� HVWiQLQYROXFUDGDV�GHVGH�HO�FRPLHQ]R

S(O�iUHD�GH�,7�HVWi� LQYROXFUDGD�GHVGH�HOFRPLHQ]R

S8Q� SHTXHxR� SUR\HFWR� SLORWR� TXH� PXHVWUHODV�YHQWDMDV�GH�'DWD�0LQLQJ

Census Data Analysis & Data Mining

Las nuevas t ec nologías

Page 32: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Web Min ing

S(V� HO� GHVFXEULPLHQWR� GH� SDWURQHVVLJQLILFDWLYRV�D�SDUWLU�GHO�DQiOLVLV�GHOD�HVWUXFWXUD��FRQWHQLGRV�\�XVR��GHOD�:HE

Census Data Analysis & Data Mining

Web Min ing Tax onom y

:HE�FRQWHQW :HE�6WUXFWXUH :HE�XVDJH

:HE�0LQLQJ

Page 33: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Resul t ados Web m in ing

S(O� ���� GH� ORV� YLVLWDQWHV� TXHDFFHGHQ� D� ZZZ�LEP�FRP�UHGERRNVDFFHGHQ�D�ZZZ�LEP�FRP�VRIWZDUH�GDWD�LPLQHU�IRUGDWD�

S(QWU\�DQG�([LW�SRLQWV

Census Data Analysis & Data Mining

Resul t ados Web m in ing

S/LQN� DQDO\VLV� \� SDWURQHVVHFXHQFLDOHV�GH�HQODFHV�GH�SiJLQDV

S6HJPHQWDFLyQ�GH�FOLHQWHV�GH�H�FRPPHUFH

S&DQDVWD�GH�SURGXFWRVSHWF��HWF��HWF

Page 34: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Tex t Min ing

S6RQ�QXHYDV�KHUUDPLHQWDV�GHVWLQDGDVD� H[WUDHU� LQIRUPDFLyQ� GHGRFXPHQWRV� ´QR� HVWUXFWXUDGRVµ�RUJDQL]DUORV�� VHJPHQWDUORV�LQGH[DUORV�

Census Data Analysis & Data Mining

Problem as de Tex t Min ing

S'LUHFFLRQDPLHQWR� DXWRPiWLFR� GHHPDLOV�VHJ~Q�VX�FRQWHQLGR

S&ODVLILFDFLyQ� DXWRPiWLFD� GHGRFXPHQWRV�GH�XQD�LQWUDQHW

S%~VTXHGD� GH� LQIRUPDFLyQ� HQGRFXPHQWRV� GH� GLVWLQWRV� LGLRPDVVLPXOWiQHDPHQWH�

Page 35: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Problem as de Tex t Min ing

S$QiOLVLV� GH� FRQWHQLGRV� GH� SiJLQDV:HE

S2UJDQL]DFLyQ� GH� VHUYLFLRV� GHE~VTXHGD�HQ�OD�:HE

S([WUDFFLyQ�GH�FRQFHSWRV�GH�VtQWHVLVHQ� GRFXPHQWRV� UHIHULGRV� DO� PLVPRDVXQWR�

Census Data Analysis & Data Mining

Conc lus iones

Page 36: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Para qué Miner ía de Dat os

S/D� 0LQHUtD� GH� 'DWRV� HV� XQDKHUUDPLHQWD�HILFD]�SDUD�GDU�UHVSXHVWDSUHJXQWDV�FRPSOHMDV�GH�,QWHOLJHQFLD�GH1HJRFLRV

S/DV�KHUUDPLHQWDV�GLVSRQLEOHV�SHUPLWHQDXWRPDWL]DU� SDUWH� GH� OD� WDUHD� GHHQFRQWUDU� ORV� SDWURQHV� GHFRPSRUWDPLHQWR�RFXOWRV�HQ�ORV�GDWRV

S3HUR�«�

Census Data Analysis & Data Mining

Qué no puedeaut om at izarse (t odavía)

S/D� HOHFFLyQ� GH� ORV� SUREOHPDV� GH� QHJRFLRFDQGLGDWRV�SDUD�WDUHDV�GH�'DWD�0LQLQJ

S/D� LGHQWLILFDFLyQ� \� UHFROHFFLyQ� GH� ORVGDWRV� TXH� FRQWLHQHQ� OD� LQIRUPDFLyQEXVFDGD

S(O� PDVDMHR� \� WUDWDPLHQWR� GH� ORV� GDWRVTXH�SRVLELOLWD�OD�E~VTXHGD�GH�SDWURQHV

S(O�GLVHxR�\�FiOFXOR�GH�YDULDEOHV�GHULYDGDV

Page 37: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Qué no puedeaut om at izarse (t odavía)

S(O�SODQ�GH�DFFLRQHV�TXH�DSR\iQGRVH�HQ�ORVUHVXOWDGRV�GHO�PRGHOR�SURGX]FD�HO�52,

S/D� PHGLFLyQ� GHO� p[LWR� GH� ODV� DFFLRQHVUHDOL]DGDV� D� SDUWLU� GH� ORV� UHVXOWDGRVSURSRUFLRQDGRV�SRU�'DWD�0LQLQJ

Census Data Analysis & Data Mining

Conc lus iones

S&RQYLHUWD� D� 'DWD� 0LQLQJ� HQ� XQDSDUWH�GH�VX�SUR\HFWR�GH�QHJRFLR�

S,QFOX\D� D� 'DWD� 0LQLQJ� HQ� OD´FXOWXUDµ�GH�VX�RUJDQL]DFLyQ�

Page 38: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Ejem plos c onDB2 Int e l l igent Miner for

Dat a

Census Data Analysis & Data Mining

Téc nic as ut i l i zadas

S&OXVWHULQJ��VHJPHQWDFLyQ�S&DQDVWD�GH�SURGXFWRVS$UERO�GH�GHFLVLyQS5HG�QHXURQDO�FRPR�PRGHORSUHGLFWLYR

Page 39: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

¿Qué es “ c lust er ing”?

S(V� OD� SDUWLFLyQ� GHO� FRQMXQWR� GHLQGLYLGXRV� HQ� VXEFRQMXQWRV� OR� PiVKRPRJpQHRV�SRVLEOHV�

S(O�REMHWLYR�HV�PD[LPL]DU�OD�VLPLOLWXGGH� ORV� LQGLYLGXRV� GHO� FOXVWHU� \PD[LPL]DU� ODV� GLIHUHQFLDV� HQWUHFOXVWHUV�

Census Data Analysis & Data Mining

Apl ic ac iones de la t éc nic a

S6HJPHQWDFLyQ�GH�OD�EDVH�GH�GDWRVS'HWHFFLyQ�GH�IUDXGHVS'HWHFFLyQ�GH�GHIHFWRV

Page 40: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Objet ivos

S'HWHUPLQDU�HO�Q~PHUR�ySWLPR�GHFOXVWHUV

S$VLJQDU�D�FDGD�LQGLYLGXR�D�XQ�~QLFRFOXVWHU

S(YDOXDU�HO�LPSDFWR�GH�ODV�YDULDEOHVHQ�OD�IRUPDFLyQ�GHO�FOXVWHU

S&RPSUHQGHU� HO� ´SHUILOµ� GH� FDGDFOXVWHU

Census Data Analysis & Data Mining

Medidas de s im i lar idad

S9DULDEOHV� FDWHJyULFDV� �HVFDODVQRPLQDOHV� \� RUGLQDOHV�� �� VRQVLPLODUHV�VL�VRQ�LJXDOHV�

S9DULDEOHV� QXPpULFDV� �HVFDODVPpWULFDV�� �� HO� DOJRULWPR� GHWHUPLQDVX�GLIHUHQFLD�H[SUHVDGD�HQ�XQLGDGHVGH�GHVYLDFLRQHV�VWDQGDUG�

Page 41: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Ejem plo s im i lar idad

1RPEUH 6H[R (VW��&LYLO /XJDU 6LPLODULGDGJuan M C Cap.Fed 0.33Maria F C GBA 0.33No evaluado Diferente Igual Diferente

Census Data Analysis & Data Mining

Cri t er io Condorc et

S(V�XQD�PHGLGD�GH�VLPLODULGDG�TXH�YDUtDHQWUH���\��

S9DOH� �� �� ORV� LQGLYLGXRV� HVWiQ� XELFDGRVDOHDWRULDPHQWH�HQ�ORV�FOXVWHUV

S9DOH� �� �� 7RGRV� ORV� LQGLYLGXRV� GH� ORVFOXVWHUV�VRQ�LGpQWLFRV�\�QR�KD\�LQGLYLGXRVFRQ� HVDV� FDUDFWHUtVWLFDV� IXHUD� GH� FDGDFOXVWHU�

S&RQGRUFHW�PtQLPR�XVXDO� �����

Page 42: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

El problem a

6H�WUDWD�GH�VHJPHQWDU��OD�%DVH�GH'DWRV�GH�ORV�FOLHQWHV�GH�XQD�WDUMHWDGH� FUpGLWR� D� SDUWLU� GH� VXVLQGLFDGRUHV� GH� FRQVXPR� SDUDLGHQWLILFDU� DO� VHJPHQWR� GH� PD\RUYDORU�

Census Data Analysis & Data Mining

Los dat os d isponib les

S$�SDUWLU�GH� OD�%DVH�GH�'DWRV�GH�WUDQVDFFLRQHVGHO�~OWLPR�DxR�GH� ORV�FOLHQWHV�VH�REWLHQHQ�FRPRYDULDEOHV�� )UHFXHQFLD�GH�XVR�GH�OD�WDUMHWD : calculada

como media de días ent re t r ansacciones.� 6DOGR�SURPHGLR�PHQVXDO�GH�WUDQVDFFLRQHV�HQ��

� 0RQWR�SURPHGLR�SRU�WUDQVDFFLyQ� &DQWLGDG�GH�VHUYLFLRV�SRU�GpELWR�DXWRPiWLFR� 'DWRV�VRFLRGHPRJUiILFRV���VH[R��HGDG�HVWDGR�FLYLO��RFXSDFLyQ��KLMRV

Page 43: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

La preparac ión de dat os

S'HILQLU�OD�XQLGDG�GH�DQiOLVLV��¢FXHQWD�R�WDUMHWD"

S'HILQLU�TXp�HV�XQD�WUDQVDFFLyQ�HM��¢FyPR�VH�FRQVLGHUDQ�ORV�DMXVWHV�PRQWRV�QHJDWLYRV�"

S'HILQLU�YDULDEOHV�GHULYDGDV���HQ�ODIUHFXHQFLD�¢FyPR�LQWHUYLHQHQ�ORVGpELWRV�DXWRPiWLFRV"

Census Data Analysis & Data Mining

La preparac ión de dat os

S'HVFULELU�ODV�YDULDEOHV�D�LQFOXLU�HQHO�PRGHOR�SDUD�� Calcular medidas de posición y disper sión� I dent if icar dist r ibuciones asimét r icas� I dent if icar missings� I dent if icar valor es incor r ect os o f uer a

de r ango� I dent if icar out lier s

Page 44: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

E s ta d is tic a s C lu s te r 0 1 0 0 ,0 0 % d e p o b la c ió n

s c io s e d a d e s ta d o _c ivil

D ivo rc ia d o /Viud oCa s a d o

S o lte ro

o c up

C ue nta P ro p iaR e la c io n d e p e n d e nc iaN o tra b a ja

s e xo

F e m e nin oMa s c ulino

h ijo s

N oS i

a vg tc kt fre c u p e s o s

'HVFULSWLYRV�JHQHUDOHV

Census Data Analysis & Data Mining

Cri t er ios de segm ent ac ión

S6H� WRPDQ� FRPR� YDULDEOHV� ´DFWLYDVµODV� TXH� FRUUHVSRQGHQ� DOFRPSRUWDPLHQWR�GH�FRQVXPR�

S6H� WRPDQ� FRPR� YDULDEOHVVXSOHPHQWDULDV� ORV� DWULEXWRVVRFLRGHPRJUiILFRV�

Page 45: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Credit Card

55

1

27

2

18

0

s cios [s e xo]

FemeninoMasculino

[es tad o_ civil]

Divo rciado /Viud oCasado

So ltero

[ocup]

Cuenta Pro piaRelacio n d ep end enciaNo trab aja

[hijos ]

NoSi

fre cu pe s os a vgtckt [e da d]

s cios [es tad o_ civil]

Divo rciado /Viud oCasado

So ltero

[ocup]

Cuenta Pro piaRelacio n d epend enciaNo trab aja

[s e xo]

FemeninoMasculino

pe s os [hijos ]

NoSi

fre cu [e da d] a vgtckt

s cios fre cu pe s os [es tad o_ civil]

Divo rciado /Viud oCasado

So ltero

[ocup]

Cuenta Pro piaRelacio n d ep end enciaNo trab aja

[hijos ]

NoSi

[s e xo]

FemeninoMasculino

a vgtckt [e da d]

Census Data Analysis & Data MiningCredit Ca rd Clus ter 2 27,21% de pobla ción

s cios [e s tado_civil]

Divo rc ia d o /Viud oCa s a d o

S o lte ro

[ocup]

Cue nta P ro p iaR e la c io n d e p e nd e nc iaNo tra b a ja

[s e xo]

Fe me ninoMa s c ulino

pe s os [hijos ]

NoS i

fre cu [e dad] a vg tckt

Tienen 4 o másdébitos automáticos

Casados Trabajo Cta Propia

Varones

Saldo >>>

Con hijos

Uso frecuenteEdad 40-45 Ticket >>>

Page 46: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Paret o

0

20

40

60

80

100

120

% Cuentas % Suma Saldo

Cluster 0Cluster 1Cluster 2

Census Data Analysis & Data Mining

Arboles de dec is ión

S6RQ� WpFQLFDV� TXH� VH� XWLOL]DQ� FRQILQDOLGDG�SUHGLFWLYD�\�GH�FODVLILFDFLyQ�

S6H� REWLHQH� FRPR� UHVXOWDGR� ´UHJODVµTXH�H[SOLFDQ�HO�FRPSRUWDPLHQWR�GH�XQDYDULDEOH� �7$5*(7�� FRQ� UHODFLyQ� DRWUDV��35(',&725$6��

S(Q� HVWH� HMHPSOR� VH� XWLOL]DQ� SDUD´H[SOLFDUµ�ORV�FOXVWHUV�

Page 47: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Algor i t m os

S&+$,'��&KL�6TXDUHG�$XWRPDWLF'HWHFWLRQ��

S&57� �� &ODVVLILFDWLRQ� DQG5HJUHVVLRQ�7UHH�

S&�����4XHVW�\�RWURVS,QWHOOLJHQW� 0LQHU� XWLOL]D� XQDYDULDQWH�GH�&57

Census Data Analysis & Data Mining

Arbol de c om port am ient o

Si tiene 4 o másdébitos automáticos yun saldo > $ 727entonces suprobabilidad depertenecer al cluster 2es del 99%

Page 48: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Arbol soc iodem ográf ic o

Census Data Analysis & Data Mining

Mark et Bask et Analys is

S(O�SUREOHPD��6H�WUDWD�GH�HQFRQWUDUODV� UHJODV� GH� DVRFLDFLyQ� TXHRUJDQL]DQ� ORV� SHGLGRV� GH� ´WRSSLQJVµH[WUD�GH�XQD�SL]]HUtD��D�SDUWLU�GHODQiOLVLV� GH� XQ� FRQMXQWR� GH� ����WLFNHWV�GH�YHQWD�

Page 49: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

La t abla de Dat a Min ing

S,G�WLFNHWS&yGLJR�GH�SURGXFWR�

� ��+RQJRV� ��3HSSHURQL� ��4XHVR� ��&HUYH]D� ��*DVHRVD� ��2WUD�EHELGD

Census Data Analysis & Data Mining

Propósi t o de MBA

S*HQHUDU�UHJODV�GHO�WLSR�� I F (SI ) condición ENTONCES (THEN)

r esult ado

S(MHPSOR�� 6L�pr oduct o A y pr oduct o C

ENTONCES pr oduct o B

Page 50: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Tipos de reglas

S8WLOHV� �� DSOLFDEOHV� �� UHJODV� TXHFRQWLHQHQ� EXHQD� FDOLGDG� GHLQIRUPDFLyQ� TXH� SXHGHQ� WUDGXFLUVHHQ�DFFLRQHV�GH�QHJRFLR�

S7ULYLDOHV���UHJODV�\D�FRQRFLGDV�HQ�HOQHJRFLR�SRU�VX�IUHFXHQWH�RFXUUHQFLD

S,QH[SOLFDEOHV� �� FXULRVLGDGHVDUELWUDULDV�VLQ�DSOLFDFLyQ�SUiFWLFD

Census Data Analysis & Data Mining

Problem as del MBA

S/D�H[LVWHQFLD�GH�PXFKRV�LWHPV�HQ�HOVHW� GH� DQiOLVLV� FRPSOLFDH[SRQHQFLDOPHQWH� HO� WLHPSR� GHFiOFXOR

S5HVXOWD� QHFHVDULR� GHILQLU� FULWHULRVSDUD�VHOHFFLRQDU�ODV�PHMRUHV�UHJODV

S(V� LPSRUWDQWH� OD� FRQVWUXFFLyQ� GHXQD�WD[RQRPtD�GH�SURGXFWRV

Page 51: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

¿Cuán buena es una regla?

S0HGLGDV�TXH�FDOLILFDQ�D�XQD�UHJOD�� Sopor t e� Conf ianza� Lif t (I mpr ovement )

Census Data Analysis & Data Mining

Sopor t e

S(V�OD�FDQWLGDG�����GH�WUDQVDFFLRQHVHQ�GRQGH�VH�HQFXHQWUD�OD�UHJOD�� Ej : “Si A ent onces B” est á pr esent e en

4000 de 10000 t r ansacciones.� Sopor t e (A/ B) : 40%

Page 52: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Conf ianza

S&DQWLGDG�����GH�WUDQVDFFLRQHV�TXHFRQWLHQHQ�OD�UHJOD�UHIHULGD�D�ODFDQWLGDG�GH�WUDQVDFFLRQHV�TXHFRQWLHQHQ�OD�FOiXVXOD�FRQGLFLRQDO� Ej : Par a el caso ant er ior , si A est á

pr esent e en 6000 t r ansacciones (60%)� Conf ianza (A/ B) = 40% / 60% = 66%

Census Data Analysis & Data Mining

Mejora (Im provem ent )

S&DSDFLGDG�SUHGLFWLYD�GH�OD�UHJOD�� Mej or a = p(A/ B) / p(A) * p(B)� Ej : p(A/ B) = 40% ; p(A) = 60%; p(B) = 30%

I mpr ov (A/ B) = 40% / (60% * 30%) = 2.22

Mayor a 1 : la r egla t iene valor pr edict ivo

Page 53: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Ejem plo de c á lc u lo

Census Data Analysis & Data Mining

Dat os básic os

+RQJRV 3HSSHURQL 4XHVR &DQWLGDGSi Si Si 100Si Si No 400Si No Si 300Si No No 100No Si Si 200No Si No 150No No Si 200No No No 550TOTAL 2000

Page 54: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Reglas

�U�6��� � �������R����� � � ������ � �U� ���������W�6� �Hongos 900 0.45Pepperoni 850 0.43Queso 800 0.40Hongos --> Pepperoni 500 0.25 0.56 1.31Hongos --> Queso 400 0.20 0.47 1.18Queso --> Pepperoni 300 0.15 0.38 0.88Hongos + Pepperoni --> Queso 100 0.05 0.20 0.80Hongos + Queso --> Pepperoni 100 0.05 0.25 0.59Queso + Pepperoni --> Hongos 100 0.05 0.33 0.74

Pueden descartarse por bajo soporteReglas significativas

Census Data Analysis & Data Mining

Ot ro e jem plo de MBA

S/D� DVRFLDFLyQ� VH� SODQWHD� HQWUH� ORVWRSSLQJV� GH� ODV� SL]]DV� \� ODVEHELGDV�

S/RV� JUiILFRV� GH� UHJODV� SHUPLWHQYLVXDOPHQWH� LGHQWLILFDU� UHJODV� FRQEXHQ�VRSRUWH��FRQILDQ]D�\�OLIW

Page 55: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Census Data Analysis & Data Mining

Soporte (%)Confianza(%) Tipo Elevación Cuerpo de regla Cabecera de regla 3.1746 80.0000 + 1.7800 [Hongos]+[Otra bebida] ==> [Pepperoni]16.6667 81.8200 + 1.7200 [Cerveza]+[Pepperoni] ==> [Hongos]13.0688 78.4100 + 1.6500 [Cerveza]+[Queso] ==> [Hongos]16.6667 63.0000 . 1.5400 [Hongos]+[Pepperoni] ==> [Cerveza]29.8413 72.8700 + 1.5300 [Cerveza] ==> [Hongos]29.8413 62.6700 + 1.5300 [Hongos] ==> [Cerveza]13.0688 61.7500 . 1.5100 [Hongos]+[Queso] ==> [Cerveza] 9.0476 57.0000 + 1.4000 [Pepperoni]+[Queso] ==> [Gaseosa] 3.0159 57.0000 . 1.3900 [Hongos]+[Pepperoni]+[Queso] ==> [Cerveza] 6.9312 56.9600 . 1.3500 [Hongos]+[Gaseosa] ==> [Queso] 9.0476 56.4400 . 1.3300 [Gaseosa]+[Pepperoni] ==> [Queso]

Reglas

Page 56: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Web Min ing

S(O� SUREOHPD� �� VH� WUDWD� GH� DQDOL]DUODV� WUDQVDFFLRQHV� \� HO� SHUILO� GH� ORVXVXDULRV� GH� XQ� :HE� VLWH� GH� XQFRPHUFLR�GH�YHQWD�SRU�LQWHUQHW�

Census Data Analysis & Data Mining

Modelos apl ic ados

S$VRFLDFLyQ�GH�SiJLQDV�YLVLWDGDV��FDQDVWD�GH�SURGXFWRV

S3HUILO�GH�XVXDULRV���FOXVWHULQJGHPRJUiILFR

S3RWHQFLDOHV�FRPSUDGRUHV���iUERO�GHGHFLVLyQ

Page 57: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Asoc iac ión de páginas���� R�¡�¢�£¥¤��¦U§�¨ª©�«�£�¬�­6®¯�°�¦ ±6£²U«�­³³ ¨t´µR¶�·¸¶¶�¨�¶�¹·6·ºµ�¨ ³ ´6·· »¼§���½b¨T¾(¢�¿R«RÀÂÁÁUà »l®�¾�­U¢Q¨T¾(¢�¿R«RÀ³³ ¨t´µR¶�·>Ä´�¨�¶�µ6·6·ºµ�¨ ³ ´6·· »l®�¾�­U¢Q¨T¾(¢�¿R«RÀÂÁÁUà »¼§���½b¨T¾(¢�¿R«RÀ³ ·Q¨ ³³ ´6Å>´�Æ�¨$·�µ6·6· ³ ¨t´µ6·· »WÇ6��¿U£�¦RÈUÉ�§­UÈ�¾R¯���¦Z¨T¾(¢�¿R«RÀ�ÁÁUà »Ê¿Ë�È�¯®�¨W¾�¢�¿R«RÀÆ�¨tÄÄ6·�¶ÌÅÅQ¨$¹Å·6· ³ ¨t´6Í·· »l®�¾�­U¢Q¨T¾(¢�¿R«RÀÂÁÁUÃb»WÇ6��¿U£�¦RÈ6É�§­UÈ�¾R¯��¦Z¨T¾�¢�¿R«RÀÆ�¨tÄÄ6·�¶Î´6¹Q¨$Å�´6·6· ³ ¨t´6Í·· »WÇ6��¿U£�¦RÈUÉ�§­UÈ�¾R¯���¦Z¨T¾(¢�¿R«RÀ�ÁÁUà »l®�¾�­U¢Q¨T¾(¢�¿R«RÀ

Census Data Analysis & Data Mining

High r evenueLow COMMUNI CATI ON

Low FUN

High AGE

High r at e in REGI ON 6 = Fr ankf ur t

Most ar e male

Clust er ing r esult :Business clust er

Page 58: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

10% of all user s

Low r evenue

High COMMUNI CATI ON

High FUN

Low AGE

High r at e in REGI ON 5 = Cologne

Clust er ing r esult : Funclust er

Most ar e f emale

Census Data Analysis & Data Mining

,) t he int er est inI NFORMATI ON is ver y low(near ly 0) $1' inCOMMUNI CATI ON high(wit h at least an access rat eof 5) 7+(1 visit or willpr obably not buy (95.5%).

Classif icat ion result

Page 59: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Sec uenc ia de c l ic k s

Ï6Ð�Ñ ÒÓ�Ô�Ó�Ò�Ñ Õ�Ñ�Ö ×�Ð�Ø in 17.2% (of all t r ansact ions) t he

user goes t o GOURMET.ht ml ; he t hen sends t woemails out .

Ï6Ð�Ñ ÒÓ�Ô�Ó�Ò�Ñ Õ�Ñ�Ö ×�Ð�Øin 56.9% (of all t r ansact ions) t he user

goes f ir st SPORTS.ht ml ; he t hen uses t he chat as acommunicat ion medium; f inally, he f ocus his at t ent ion t oFashion.

Ï6Ð�Ñ ÒÓ�Ô�Ó�Ò�Ñ Õ�Ñ�Ö ×�Ð�ØI n 25.9% (of all t r ansact ions) t he

user goes f ir st t o womens-f ashion.ht ml ; he t hensends a post car d, and goes t o womens-f ashion.ht mlback again.

Census Data Analysis & Data Mining

Det ec c ión t em prana dem ora

S(O�SUREOHPD�6H�WUDWD�GH�LGHQWLILFDUDQWLFLSDGDPHQWH� ORV� FOLHQWHV� FRQPD\RU�SRVLELOLGDG�GH�HQWUDU�HQ�PRUDSDUD� DQWLFLSDU� ODV� DFFLRQHVSUHYHQWLYDV�GH�FREUDQ]D�\�UHFXSHUR�

Page 60: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Las soluc iones posib les

S5HJODV� SDUD� LGHQWLILFDU� D� ORVVHJPHQWRV� GH� FOLHQWHV� FRQ� PD\RUSURSHQVLyQ�D�PRUD

S6FRULQJ�GH�ULHVJR�GH�PRURVLGDG

Census Data Analysis & Data Mining

Modelos apl ic ables

S3DUD� ODV� UHJODV� �� iUERO� GHFODVLILFDFLyQ�

S3DUD�HO�VFRULQJ���PRGHOR�QHXURQDO

Page 61: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Arbol m uest ra 50/50

Census Data Analysis & Data Mining

Morosos

Mo ra 6 0 d ia s R e g ió n 9 0 -9 8 9 ,3 9 % d e p o b la c ió n

MO R A 6 0 VIP C U S TO ME R

YN

LATE F E E S P AID

1

3 0 D AYS

NY

O VE R C R E D IT LIMIT

NY

C R E D IT S C O R E C U S TO ME R AG E C R E D IT LIMIT

IN C O ME ME M B E R (MO N T H S ) # P U R CH AS E S / W E E K C AS H LIMIT

Page 62: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

No MorososMo ra 6 0 d ia s Re g ió n 0 -2 6 ,8 1% de p ob la c ió n

MO RA 60 LATE F E E S P AID

0

3 0 DAYS

NY

VIP CUS TOME R

Y

OVE R CR E DIT LIMIT

NY

CUS TOME R AG E CR E DIT S COR E ME MBE R (MO NT HS )

INCO ME CAS H LIMIT C R E DIT LIMIT # P UR CHAS ES / W EEK

Census Data Analysis & Data Mining

Ver i f ic ac ión

2582947N =

Mora real

SINO

Sco

ring

pred

icho

1.2

1.0

.8

.6

.4

.2

0.0

-.2

El scoring que predice la red está netamente diferenciadopara morosos y pagadores

Page 63: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Referenc ias

· 'DWD�0LQLQJ�7HFKQLTXHV�IRU�0DUNHWLQJ��6DOHVDQG�&XVWRPHU�6XSSRUW��0LFKDHO�%HUU\��*RUGRQ/LQRII��:LOH\��86$������

· 'DWD�0LQLQJ�ZLWK�1HXUDO�1HWZRUNV��-RVHSK%LJXV��0F�*UDZ�+LOO��86$������

· 'DWD�0LQLQJ��D�KDQGV�RQ�DSSURDFK�IRU�EXVLQHVVSURIHVVLRQDOV��5REHUW�*URWK��3UHQWLFH�+DOO�86$������

·0DVWHULQJ�'DWD�0LQLQJ��0LFKDHO�%HUU\��*RUGRQ/LQRII��:LOH\�86$������

Census Data Analysis & Data Mining

Referenc ias

S'DWD�SUHSDUDWLRQ�IRU�'DWD�0LQLQJ��'RULDQ�3\OH�0RUJDQ�.DXIPDQQ�3XEOLVKHUV�,QF��6DQ�)UDQFLVFR�86$������

S$QiOLVLV�0XOWLYDULDQWH�+DLU��$QGHUVRQ��7DWKDP�%ODFN��3UHQWLFH�+DOO��0DGULG������

S%XLOGLQJ�'DWD�0LQLQJ�DSSOLFDWLRQV�IRU�&50��$�%HUVRQ��6��6PLWK��.��7KHDUOLQJ��0F�*UDZ�+LOO�����

Page 64: Data Mining La Plata 11 Nov 2002 - CENSUS

Census Data Analysis & Data Mining

Referenc ias

· ,%0�Ù ZZZ�LEP�FRP�VRIWZDUH�GDWD�LPLQHU�IRUGDWD�Ù ZZZ�GPJ�RUJÙ ZZZ�LEP�FRP�UHGERRNV

· 7KH�'DWD�0LQH��ZZZ�WKH�GDWD�PLQH�FRP�· .''�0LQH��ZZZ�NGQXJJHWV�FRP�· FKE#FHQVXV�FRP�DU