Download - DATA MINING_SAMPLE SET OF CARS
DATA MINING : SAMPLE SET OF CARSNAME: TOSHAN MAJUMDAR
Temp: data frame with 35 cars on 11 characteristics.
mpg : Miles/(US) galloncyl : Number of cylindersdisp : Displacement (cu.in.)hp : Gross horsepowerdrat : Rear axle ratiowt : Weight (1000 lbs)qsec : 1/4 mile time
vs : V/Sam : Transmission (0 = automatic, 1 = manual)gear : Number of forward gearscarb : Number of carburetors
Steps of the Knowledge Discovery in Databases Process
STEP I: Data cleaning Data cleaning: Data cleaning is the step where noise and irrelevant data are removed from the large data set. This is a very important preprocessing step because your outcome would be dependent on the quality of selected data. As part of data cleaning, you might have to remove duplicate records, enter logically correct values for missing records, remove unnecessary data fields, standardize data format, update data in a timely manner and so on.
Explanation: in our data table of cars we replaced the missing values(NA) as well as redundant tuples which are considered as noise and irrelevant data with the appropriate mean of the column.
SCREENSHOT:
1) After clearing NA values:
2) After removing redundant car details:
Data transformation: With the help of dimensionality reduction or transformation methods, the number of effective variables is reduced and only useful features are selected to depict data more efficiently based on the goal of the task. In short, data is transformed into appropriate form making it ready for data mining step.
It is possible to find correlations between columns by either manually checking for any noticable relationships or applying the correlation function : cor( ) to the entire table
In order to apply the cor () function, following steps must be implemented:
1) Convert entire table into data frame2) Remove all non numeric columns3) Pass the reduced data frame as parameter of cor ( )
A matrix is formed between each and every column/ variable of the table.The magnitude of the value corresponds to the strengthof the correlation( higher magnitude more correlation)And the sign of the value corresponds to positive(direct) or negative(inverse) correlation
SCREENSHOT:
1) Find correlation of all cars with respect to their chaarcteristics
2) Dimensionality Reduction
We can apply dimensionality reduction to the table by removing columns which are strongly correlated (high magnitude)
CODE:
SMOOTHING BY MEANS:
Binning is a technique used to handle noisy data. In binning by means the data set is first sorted and then partitions into equal bins.
SCREENSHOT:
STEP II: Data RepresentationExplanation:
In our code we have applied the nrow(), ncol(), dim() and str() functions which give us information about the number of rows, number of columns, dimension of the table and some useful information about each column respectively.
SCREENSHOT:
Statistical Description of data:
Explanation: In our code we have applied the summary() function by which the mean, median, mode, and quantile of every characteristic(mpg,wt,etc) to every car have been displayed
Mining Interesting Patterns:
The distance value is a prominent tool in data analysis and pattern matching as it quantifies the dissimilarity between sample data for numerical computation. The following is the formula for calculating Euclidian distance:
Explanation: We have applied the distance computation between all pairs of cars from the table. Thus the result is a 32X32 matrix with the i-th row and j-th column giving information about the dissimilarity between the i-th car model and the j-th car model.
SCREENSHOT:
STEP III: DATA CLASSIFICATION AND CLUSTERING:Naive Bayes classifier:
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.
Explanation: in our data table we use some values and find the vales which fits i
CODE:Temp<-teem
np<-count(teem[1:33,10])vs0am0=0vs1am0=0vs0am1=0vs1am1=0gr1am0=0gr1am1=0gr2am0=0gr2am1=0gr3am0=0gr3am1=0gr4am0=0gr4am1=0gr5am0=0gr5am1=0for(i in 1:33){if(teem[i,10]=="1"){
if(teem[i,9]=="0"){Vs0am1=vs0am1+1}if(teem[i,9]=="1"){Vs1am1=vs1am1+1}if(teem[i,11]=="1"){gr1am1= gr1am1+1}if(teem[i,11]=="2"){gr2am1= gr2am1+1}if(teem[i,11]=="3"){gr3am1= gr3am1+1}if(teem[i,11]=="4"){gr4am1= gr4am1+1}if(teem[i,11]=="5"){gr5am1= gr5am1+1}
}
else{
if(teem[i,9]=="0"){vs0am0= vs0am0+1}if(teem[i,9]=="1"){vs1am0= vs1am0+1}if(teem[i,11]=="1"){gr1am0= gr1am0+1}if(teem[i,11]=="2"){gr2am0= gr2am0+1}if(teem[i,11]=="3"){gr3am0= gr3am0+1}if(teem[i,11]=="4"){gr4am0= gr4am0+1}if(teem[i,11]=="5"){gr5am0= gr5am0+1}
}}vs0am1=vs0am1/np[2,1]
vs0am0=vs0am0/np[1,0]
vs1am0=vs1am0/np[1,0]vs1am1=vs1am1/np[2,1]gr1am0=gr1am0/np[1,0]gr1am1=gr1am1/np[2,1]gr2am0=gr2am0/np[1,0]gr2am1=gr2am1/np[2,1]gr3am0=gr3am0/np[1,0]gr3am1=gr3am1/np[2,1]gr4am0=gr4am0/np[1,0]gr4am1=gr4am1/np[2,1]gr5am0=gr5am0/np[1,0]gr5am1=gr5am1/np[2,1]
1prob=np[2,1]/32
0prob=np[1,0]/32
1xprob=1prob*vs1am1*gr1am1* gr2am1* gr3am1* gr4am1* gr5am10xprob=0prob* vs1am0*gr1am0* gr2am0* gr3am0* gr4am0* gr5am0
if(0xprob>1xprob){ classifier<-c("0")}if(0xprob<1xprob){ classifier<-c("1")}classifier
SCREENSHOT:
HIRARCHIAL CLUSTER ANALYSIS
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters.
In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.
In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.
Explanation: In order to perform hierarchical cluster analysis, we must follow the following steps:
1) Construct the distance matrix of the entire data table2) Construct a dendrogram which displays the hierarchical relationship among the various
vehicles using plot of hclust
SCREENSHOT:
STEP IV: KNOWLWDGE CONSOLIDATION AND REPRESENTAION:
Knowledge consolidation: This is the final step in Knowledge Discovery in Databases (KDD). The knowledge discovered is consolidated and represented to the user in a simple and easy to understand format. Mostly, visualization techniques are being used to make users understand and interpret information
Explanation: We have represented the data regarding the properties of various cars in boxplot format as well as various histograms.
Plotting Representations:
SCREENSHOT:
With mpg and disp:
With mpg, disp and gear:
Boxplot of all variables: