distributed data mining in credit card fraud detection

Upload: pankaj-gorasiya

Post on 17-Oct-2015

41 views

Category:

Documents


0 download

DESCRIPTION

Distributed Data Mining in Credit Card Fraud Detection

TRANSCRIPT

Presentation on Distributed Data Mining in Credit Card Fraud Detection

Presentation OnDistributed Data Mining in Credit Card Fraud DetectionINTRODUCTIONData: Data are any facts, numbers, or text that can be processed by a computer.E.g. sales, cost, inventory, forecastInformation: The patterns, associations, or relationships among all this data can provide information.E.g. analysis of retail point of sale transaction data can yield information on which products are selling and when.ContinueKnowledge: Information can be converted into knowledge about historical patterns and future trends.E.g. summary information on retail supermarket sales can be analysed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.

What is Data Mining ??Generally, Data Mining (sometimes called data or knowledge discovery) is the process of analysing data from a different perspectives and summarizing it into useful information.And that information that can be used to increase productively.Technically, Data Mining is the process of finding correlations or patterns among dozens of fields in large relational databases.How Does Data Mining Work ??While large-scale information technology has been evolving separate transaction and analytical systems, data-mining provides the link between the two. Data mining software analyses relationships and patterns in stored transaction data based on open-ended user queries.ContinueGenerally, any of four types of relationships are sought:Classes: Stored data is used to locate data in groups.E.g. a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order.Clusters: Data items are grouped according to logical relationships or consumer preferences.e.g. data can be mined to identify market segments or consumer affinities.

ContinueAssociations: Sequential Patterns: Data is mined to anticipate behaviour patterns and trends.E.g. an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Data Mining Consists of Five Major Elements:Extract, transform, and load transaction data onto the data warehouse system.Store and manage the data in a multidimensional database system.Provide data access to business analysts and information technology professionals.Analyse the data by application software.Present the data in a useful format, such as a graph or table.Data Mining Process

ContinueUnderstand the application domainCollect and create the target datasetClean and transform the target datasetSelect features, reduce dimensionsApply data mining algorithmsInterpret, evaluate, and visualize patterns

DISTRIBUTED DATA MININGThe continuous developments in information and communication technology have recently led to the appearance of distributed computing environments which comprise several, and different sources of large volumes of data and several computing units. The most prominent example of a distributed environment is the Internet, where increasingly more databases and data streams appear that deal with several areas, such as meteorology, oceanography, economy and others.ContinueDistributed Data Mining (DDM) is concerned with the application of the classical Data Mining procedure in a distributed computing environment trying to make the best of the available resources (communication network, computing units and databases).Data Mining takes place both locally at each distributed site and at a global level where the local knowledge is fused in order to discover global knowledge.

The first phase normally involves the analysis of the local database at each distributed site. Then, the discovered knowledge is usually transmitted to a merger site, where the integration of the distributed local models is performed. The results are transmitted back to the distributed databases, so that all sites become updated with the global knowledge.In the latter case the attributes differ among the distributed databases. In certain applications a key attribute might be present in the heterogeneous databases, which will allow the association between tuples. In other applications the target attribute for prediction might be common across all distributed databases.One trend that can be noticed during the last years is the implementation of DDM systems using emerging distributed computing paradigms such as Web services and the application of DDM algorithms in emerging distributed environments, such as mobile networks, sensor networks, grids and peer-to-peer networks.ASPECTS OF DATA MININGUncertainty handlingDealing with missing valueDealing with noisy dataEfficiency of algorithm usedConstraining Knowledge Discovered to only useful or interesting knowledgeSize and complexity of dataData selectionUnderstandability of discovered knowledgeConsistency between Data and Discovered Knowledge

LOSSES DUE TO FRAUD

Bank-wise Cyber Fraud DataICICI Bank customers have been the biggest victims of Cyber Frauds. In last 4 years (from 2009 to 2012) ICICI Bank alone reported 34918 cases amounting to 74.25 crore rupees.

American Express ranked 2nd based on the value of cyber frauds with 4 years (2009 to 2012) amounting to 26 crore rupees nearly 3 times less than ICICI Bank.

Citibank came in at 3rd reporting 24 crore worth of cyber frauds followed by Axis (15.9 crore) and HSBC (13.8 crore).

Credit Card & Debit Card Fraud Statistics World WideBetween July 2005 and mid-January 2007, a breach of systems at TJX Companiesexposed data from more than 45.6 million credit cards. Albert Gonzalezis accused of being the ringleader of the group responsible for the thefts.In 2012, about 40 million sets of payment card information were compromised by a hack of Adobe Systems.In August 2009 Gonzalez was also indicted for the biggest known credit card theft to date information from more than 130 million credit and debit cards was stolen at Heartland Payment Systems, retailers 7-Elevenand Hannaford Brothers, and two unidentified companies.

In July 2013, press reports indicated four Russians and a Ukrainian were indicted in New Jersey for what was called the largest hacking and data breach scheme ever prosecuted in the United States.

Between Nov. 27, 2013 and Dec. 15, 2013 a breach of systems at Target Corporationexposed data from about 40 million credit cards. The information stolen included names, account number, expiry date and Card security code.

From 16 July to 30 October 2013, a hacking attack compromised about a million sets of payment card data stored on computers at Neiman-Marcus.Largest Credit Card Data Breaches

CountryCardholders Affected (Overall)Cardholders Affected (Last 5 Years)United States42%37%Mexico44%37%United Arab Emirates36%33%United Kingdom34%31%Brazil33%30%Australia31%30%China36%27%India37%27%Singapore26%23%Italy24%22%South Africa25%20%Canada25%19%France20%18%Indonesia18%14%Sweden12%11%Germany13%10%The Netherlands12%8%APPLICATION OF DATA MININGFinancial Data Analysis:E.g. loan payment prediction, customer credit policyRetail Industry:E.g. sales, customer, product, region, effectiveness of sales campaign, customer loyaltyIt enables companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics.

Telecom Industry:Biological Industry:Identifying co-occuring gene sequences and linking genes to different stages of deases developmentScientific Application: Accumulation of huge volumes of high-dimensional data, stream data and heterogeneous dataIntrusion Detection: FAMOUS CREDIT FRAUD ATTACKSBetween July 2005 and mid-January 2007, a breach of systems atTJX Companiesexposed data from more than 45.6 million credit cards.Albert Gonzalezis accused of being the ringleader of the group responsible for the thefts.In August 2009 Gonzalez was also indicted for the biggest known credit card theft to date information from more than 130 million credit and debit cards was stolen atHeartland Payment Systems, retailers7-ElevenandHannaford Brothers, and two unidentified companies.In 2012, about 40 million sets of payment card information were compromised by a hack ofAdobe Systems.

ContinueIn July 2013, press reports indicated four Russians and a Ukrainian were indicted in New Jersey for what was called the largest hacking and data breach scheme ever prosecuted in the United States.Between Nov. 27, 2013 and Dec. 15, 2013 a breach of systems atTarget Corporationexposed data from about 40 million credit cards. The information stolen included names, account number, expiry date andCard security codeFrom 16 July to 30 October 2013, a hacking attack compromised about a million sets of payment card data stored on computers atNeiman-Marcus.Modeling StrategiesData mining strategies fall into two broad categories: supervised learning and unsupervised learning.Supervised learning when there exists a target variable with known values and about which predictions will be made by using the values of other variables as input.Unsupervised learning there does not exist a target variable with known values, but for which input variables do exist.Modeling Objectives and Data Mining TechniquesModeling ObjectiveSupervisedUnsupervisedPredictionRegression andLogistic regressionNeural NetworksDecision TreesNot feasibleNote: Targets can be binary,interval, nominal, or ordinal.

Prediction algorithms determine models or rules to predict continuous or discrete target values for given input data. For example, a prediction problem could attempt to predict the value of the S&P 500 Index, given some input data such as a sudden change in a foreign exchange rate.Modeling Objectives and Data Mining TechniquesModeling ObjectiveSupervisedUnsupervisedClassificationDecision TreesNeural NetworksDiscriminant AnalysisClustering (K-means, etc)Neural NetworksSelf-Organizing Maps(Kohonen Networks)Note: Targets can be binary,nominal, or ordinal.

Classification algorithms determine models to predict discrete values for given input data.A classification problem might involve trying to determine if transactions represents fraudulent behavior based on some indicators such as, the type of establishment at which the purchase was made, the time of day the purchase was made, and the amount of the purchase.Modeling Objectives and Data Mining TechniquesModeling ObjectiveSupervisedUnsupervisedExplorationDecision TreesNote: Targets can be binary,nominal, or ordinal.Principal ComponentsClustering (K-means, etc)

Exploration uncovers dimensionality in input data. For example, trying to uncover groups of similar customers based on spending habits for a large, targeted mailing is an exploration problem.Modeling Objectives and Data Mining TechniquesModeling ObjectiveSupervisedUnsupervisedAffinityNot applicable AssociationsSequencesFactor Analysis

Affinity analysis determines which events are likely to occur in conjunction with one another. Retailers use affinity analysis to analyze product purchase combinations.Techniques for fraud detection

If-Then rules (Expert rules) Purpose is to use facts and rules, taken from the knowledge of many human experts, to help make decisions. Example of rules More than 4 ATM transactions in one hour? More than 2 transactions in 5 minutes? Magnetic stripe transaction then internet transaction?

If-Then rules (Expert rules) Problems with rules New fraud patterns are not detected Only simple rules can be created Advantages of rules Easy to implement Very easy to interpret

Predictive modeling Predictive modeling is the use of statistical and mathematical techniques to discover patterns in data in order to make predictions

forecasting probabilities and trends. A predictive model is made up of a number ofpredictors, which are variable factors that are likely to influence future behavior or results. In marketing, for example, a customer's gender, age, and purchase history might predict the likelihood of a future sale.39

Need of Data Mining In field of Information technology we have huge amount of data available that need to be turned into useful information. This information further can be used for various applications such as market analysis, fraud detection, customer retention, production control, science exploration etc. Identify unexpected shopping patterns in supermarkets.

Optimize website profitability by making appropriate offers to each visitor.Predict customer response rates in marketing campaigns.Defining new customer groups for marketing purposes.Predict customer defections: which customers are likely to switch to an alternative supplier in the near future.Distinguish between profitable and unprofitable customers.Improve yields in complex production processes by finding unexpected relationships between process parameters and defect rates.Identify "wedge issues" and target political campaigns.Identify suspicious (unusual) behavior, as part of a fraud detection process.Application of Data Mining Financial Data Analysis:E.g. loan payment prediction, customer credit policyRetail Industry:e.g. sales, customer, product, region, effectiveness of sales campaign, customer loyaltyIt enables companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics.

Telecom Industry:Biological Industry:Identifying co-occuring gene sequences and linking genes to different stages of deases developmentScientific Application: Accumulation of huge volumes of high-dimensional data, stream data and heterogeneous dataIntrusion Detection: Advantages of Data MiningData mining helps people to answer questions that they might not have even thought about.The information extracted from raw data would be usually in hidden form and could go unrevealed if proper data mining techniques are not been used.It helps companies to get information that they can use effectively to stand out from competition. Quick and correct access to useful information which makes companies to concentrate more on decision making and other important processes made data mining so efficient and popular.

Different industries or organizations use data mining to its maximum strength and what they try to get from their data could be market trends, industry research, sales promotion, competitor analysis, medical research etc.Retailers can get to know about useful and correct trends about their customers and their purchasing behavior.This knowledge can be utilized to market the product in a better way, attract targeted customers more, come up with products that would be liked by customers, manage super market shelves or space in a better way, introduce coupons or discount offers on certain products, increase sales, set price strategies and so on.All retailers or organizations that concentrate more on customer satisfaction go for data mining techniques.In law enforcement, data mining is helpful to identify criminal suspects by analyzing crime type, behaviour, habits etc. of other criminals who are already in the list. In healthcare, data mining techniques are used to identify certain diseases and to decide the treatment methods that are effective.

Financial institutions like banks and credit companies use data mining for identifying fraudulent customers, fraud medical claims, and risk management and so on.Weather forecast is an area where data mining is widely used. In the identification and classification and age determination of sky objects, data mining plays a major role. In development of new medicines, data mining is used to foresee the effectiveness of the developed medicines.Limitation of Data Mining Quality of data is the most important challenge faced in case of data mining. As everything is done on data, the outcome is mostly affected by the quality of the data.Completeness, reliability and accuracy of data contribute to the data quality. As thousands of records are usually analyzed and summarized for decision making, if anything wrong happens in data, then all steps in the knowledge discovery process would be badly affected. Presence of duplicate records, missing data values, presence of unneeded data fields, lack of proper data standards and lack of timely data updates, human errors etc could affect the quality of data and thus data mining process.Removal of duplicate records, entering appropriate values for missing records (0 rather than making an entry null), removal of unneeded data fields, identifying and removing logically wrong values (200 as age, 01/01/1100 as birth date etc), standardizing data formats, updating data fields in a timely manner etc are completed as part of data cleaning process.Interoperability is another major data mining issue. As data could be collected from heterogeneous resources, types of data would be different and it would be practically impossible to standardize all these different kinds of data.Different databases or data mining software need to be interoperable so that data could be analysed and summarized correctly to make the best use of data mining.Suppose, government comes with a mission to share the information of different government departments in order to improve inter department collaboration.

As existing databases of different departments would be of different type, the project would have to overcome the issues of interoperability.As larger amounts of private and sensitive information about companies or individuals would have to be stored and used for different data mining activities, security and privacy have become a major issue to be addressed before data mining becomes completely mature.This could also lead to illegal access of confidential data and also to disclosure of implicit details of individuals or companies which they do not actually want to come out.The correct selection of data mining method is very important to get correct results.Performance issues are also to be resolved as performance is the most expected factor in case of data mining.THANK YOU