bayesian visual analysis of the indian labour...

Bayesian Visual Analysis of the Indian Labour Market

Kaushal PaneriTCS Research New Delhi

[email protected]

Karamjit SinghTCS Research New Delhi

[email protected]

Aditeya PandeyTCS Research New Delhi

[email protected]

Geetika SharmaTCS Research New Delhi

[email protected]

1. INTRODUCTIONThe IKDD CODS Data challenge presents an opportu-

nity to explore the dynamics that might influence the Indianlabour market by analysing employment data. The datachallenge expects to learn a prediction model for salaries,understand the key dependencies and present our insightsthrough visualizations.We are using TCS Research’s iFuse platform to derive in-

sights and make predictions using the data provided. iFuse isa web-based visual analytics platform with built-in machine-learning capabilities based on Bayesian graphical models.We use iFuse to learn new models using domain knowledge,statistically validate hypotheses and analyse data as well asmodels and model-predictions using a variety of visualiza-tion techniques. In particular we have used the followingiFuse features to address the data challenge:1) Bayesian Network Models: Bayesian model learning

in iFuse first learns which attributes are most relevant topredict a target, which in our case is the salary of each in-dividual. Next, an efficiently executable Bayesian networkis learned on this feature subset (via an MST embedded ina graph derived from pair-wise mutual information values).The iFuse platform uses exact inference accelerated by anSQL-engine (which internally performs query optimizationwhich is analogous to many poly-tree based exact inferencetechniques). Running model-inference for each record of testdata can be used to generate salary predictions.2) Visual Model Inference: Even before applying a Bayesian

model on test data, the iFuse platform allows for visual exe-cution of model inference on training and validation sets tounderstand the impact and sensitivity of each feature withthe target attribute (salary in this case). User can makea query to examine which ranges/values of features effectssalary, aiding in a qualitative understanding of the causes ofhigher or lower salary values. Note that such visual queriesuse model inference and so, in the case of large data vol-umes these are more efficient than directly querying the data

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

CoDs Data Challenge 2016 Pune, India

c© 2016 ACM. ISBN 123-4567-24-567/08/06. . . $15.00

DOI: 10.475/123 4

(though this is not an issue in the case of the CoDS datachallenge.)

3) Visual Analytics: the iFuse Platform possesses visualanalytics capabilities to provide a more comprehensive viewand generate meaningful insights. We provide interactivevisualizations such as motion charts for multi variate time-series data which can visually represent up to 4 temporaldata attributes using position, size and colour of circles.Bubble map charts which visualize data about specific loca-tions on a map; in our case on a map of India. Parallel co-ordinates which can precisely depict multi-dimensional datain one view, and linked views with probabilistic queryingto explore the data and obtain answers to important ques-tions. These additional visualizations lend further insighton factors affecting the labour market.

We summarise the key findings from our analysis: First,the designations with the highest head counts are softwareengineer, software developer and systems engineer. Second,there is a positive correlation between salary and 12th per-centage and CGPA. Third, through Bayesian network learn-ing, we discovered a strong correlation between high salariesand high English score, while logical test score and CGPAdid not seem to affect salary as much. Further, we foundno correlation between English scores and Domain scores.Fourth, we found a decreasing trend in salaries being offeredto candidates in recent years.

We motivate the use of Bayesian networks for attemptingthe challenge in section 2 and describe our iFuse platform insection 3. In section 4 we report insights from an exploratoryvisual analysis of the dataset and describe our predictionresults in section 5. Finally, we present recommendations insection 6 and conclude in section 7.

2. BAYESIAN GRAPHICAL MODELSA Bayesian network is a graphical structure that allows

representation and reasoning about an uncertain domain. Itis a representation of joint probability distribution (JPD)which consist of two components. The first component G isdirected acyclic graph whose vertices correspond to randomvariables. The second component, the conditional probabil-ity table (CPT), describes the conditional distribution foreach variable given its parents. A CPT of a node indicatesthe probability that the each value of a node can take givenall combinations of values of its parents. Considering a BNconsisting of N random variables X = (X1, X2, ..., XN ), thegeneral form of joint probability distribution of a Bayesian

network can be represented as in Equation 1, which encodesthe BN property that each node is independent given itsparents, where Pa(Xi) is the set of parents of Xi.

P (X1, X2, ..., Xn) =n∏

i=1

P (Xi | Pa(Xi)) (1)

A Bayesian network (BN) fills a role very similar to othermachine learning algorithm such as Artificial neural network(ANN), Decision tree, or Support vector machine(SVM).However, a BN has several unique advantages over some ofthe other machine learning algorithms. First, a BN can han-dle missing values very well, whereas, many machine learn-ing algorithms require incomplete data to be eliminated orextrapolated. Second, a BN can be queried unlike other al-gorithms like SVM or ANN. e.g. what is the expected salaryof an individual given his/her degree CGPA, 10th marks etc.To address the problem of analysing labour markets, posedby the Challenge, we use our iFuse platform which has thecapability to learn efficient Bayesian networks on data at-tributes which are most relevant to target variable, salary.Further, iFuse can run model inference on each record oftest data to generate expected value of salary as predictionas well as perform conditional queries on the network toprovide profile recommendations.

3. IFUSE: A VISUAL BAYESIAN FUSION

PLATFORMIn this section, we describe iFuse, our visual Bayesian

data fusion platform, through its three main features. Themotivation for building a visual analytics platform usingBayesian networks as the sole machine learning engine camefrom the success, both research and business-wise, of our ear-lier platform Fusion Workbench [4] which was built for visualanalytics over sensor data using multiple machine learningtechniques.

3.1 Keyword-based SearchOur platform provides keyword-based search over datasets.

When datasets are added to the platform, tags based on col-umn headers are automatically generated and used for in-dexing files. Users may add their own tags as well and enterthem as search keywords for retrieval later. Datasets arerepresented as tiles on the search page, as shown in figure1, with tag clouds providing a sense of what’s in the data.The tiles may be ‘flipped’ by double-clicking on them to seea complete list of attribute names, figure 2.

3.2 Exploratory Data VisualizationExploratory data analysis is required to get a better un-

derstanding data before Bayesian modelling can be done.Further, it can lead to insights not derivable from Bayesianreasoning. iFuse provides many visualizations for exploratoryanalysis as described below.

• Motion Charts: A multi-dimensional visualizationwhich can visually represent up to 4 temporal dataattributes using position, size and colour of circles asshown in figure 4. Motion animation of the circles isused to depict changes in data over time.

• Parallel Coordinates: Another multi-dimensionaldata visualization which allows a larger number of data

attributes to be visualised together. Attributes arerepresented as multiple parallel vertical or horizontalaxes and a data point is represented as a poly line con-necting points on each attribute axis as shown in figure3. The order of axes may be changed by dragging andattributes can be deleted or added to the plot.

• Bubble Map Charts: Plot bubbles or circles at ge-ographical locations on a map with data attributesmapped to properties of the bubble such as size andcolour. A bubble map is shown in figure 7.

• Cartograms: Use maps to visualise data about re-gions such as countries and states. Colour and rubber-sheet distortions in area proportional to data valuesallow easy comparison of spatial data.

Apart from these well-known data charts, we also have ourown visualization designs for specific purposes such as query-ing Bayesian networks described later.

Data tiles display icons for visualizations associated withthem. In case multiple visualizations can be drawn for asingle dataset, an icon for each is displayed and clicking itopens the selected visualization. All visualizations open inthe ‘Compare View’ page. A list of thumbnails is displayedon the left side of the page using which users can re-ordervisualizations vertically, close them or open them in full-screen mode.

3.3 Visual Bayesian FusioniFuse supports and utilises Bayesian models at multiple

levels.

3.3.1 Model Learning

Firstly, users can build Bayesian networks by selectingrelevant attributes from different datasets joined inside thesystem. We provide a visual interface for this as shown infigure 2. Data attributes for model learning can be selectedfrom flipped data tiles and are added to the attribute cart.The user then chooses the ’Request Network’ option, selectstarget variable and triggers the network learning module inthe backend. It returns with the top few networks and theuser can choose which ones to save in the platform.

3.3.2 Model Inferencing

Once a network has been saved, it can be used to per-form visual model inferencing using what we call a ‘LinkedQuery View’, figure 11. This is an interactive linked viewespecially designed to query Bayesian networks. The userselects n attributes from the network to query and these arevisualized in an n × n chart grid with attributes repeatedhorizontally and vertically. Charts along the diagonal, showthe probability distributions of the corresponding attributeas bar charts figure 11. On the upper diagonal are scatterplots of the data with row and column attributes on the x

and y axis of the plot respectively. These provide a viewof the data used to build the network and can bring outpair-wise correlations between attributes.

In order to query the network, users can select ranges formultiple attributes by clicking on appropriate bars in thebar charts. This puts a condition on the attribute to bein the range selected by the user. On hitting the querybutton, a conditional query is executed on the network us-ing Bayesian inference. The conditional distributions of the

other attributes are computed and the bar charts are up-dated accordingly. We provide a comparison view with theinitial and conditional distributions overlayed in the differentcolours so that changes in the distributions can be perceivedeasily.

3.3.3 Model-based Prediction

iFuse provides a visual interface for model-based predic-tion using parallel coordinates. The user selects a networkto be used for prediction via imputation and a dataset withthe target variable missing. We use a horizontal parallel co-ordinates plot so as to differentiate it from the exploratoryparallel coordinates visualization as well as to indicate a net-work structure which is usually drawn in a top-down ordereven though the edges have no directionality in this case.The value of the attribute to be imputed is 0 for all datapoints initially as shown in figure 3 (a). Clicking the ’Im-pute’ button fires the imputation module at the backendand lines for the imputed values are moved to their positionalong the axis, figure 3 (b).

4. EXPLORATORY VISUAL ANALYSIS OF

CODS DATASETIn this section we report the results from an exploratory

visual analysis of the CoDs data. We prepared three datasetsby selecting attributes from the dataset such as Job city, des-ignation, CGPA, Salary and Quant. One of the attributesfrom these was selected as a key to group by and the restof the attributes were averaged. Details of each dataset aregiven below.

DS 1: Key: Job Designation, Attributes: Salary, 10Per, 12Per,CGPA, Domain, English, Quant, Logic.

We further cleaned this data by fixing typos and merg-ing similar designations such as ‘technical lead’ and‘tech lead’. This reduced the number of unique desig-nations to 270.

DS 2: Key: Degree, Year Avg. Attributes: Salary, 10Per,12Per, CGPA, Domain, English, Quant, Logic.

DS 3: Key: Job City, Avg.Attributes: Salary, 10Per, 12Per,CGPA, Domain, English, Quant, Logic.

DS 4: Key: Job City, Avg. Attributes: Salary, Total Jobs,Number of Males, Number of Females, Scores on Com-puter Programming, CSE, ECE, Mechanical, Telecomand so on.

Insights from DS 1 Figure 4 (a) shows a view of DS 1with a circle plotted for each a designation, circle radiusand x axis showing count of test takers with a particulardesignation and salary on y axis. Given that there were 270unique designations it is not surprising that majority of thedesignations had counts below 25.Figure 4 (b) shows the same data with filtered to show

only the popular designations. As expected, software en-gineer, software developer and system engineer have thelargest counts. Further, Web developer, lecturer and cus-tomer care executive are lower on the salary scale, softwareengineers are in the middle and data scientist’s, automationengineers and senior software developers are higher.Figure 4 (c) and (d) map salary to the x axis, designation

count to the size of the circle and on y axis the average 12th

standard percentage and college CGPA, respectively. As isclear from the charts, each of these is positively correlatedwith salary.

Answer to Challenge question: Figure 5 (a) mapssalary to x axis, English score to y axis and Domain scoreto size of the circles. We observe that both English andDomain score is low for it support and customer care ex-ecutives, while English score is high but Domain score islow for people in Business related roles. Also, we observethat in IT jobs, pure development jobs such as Java de-veloper, Software developer have lower English scores thanengineering jobs such as software engineer but there is littlevariation in their Domain scores. Thus, there is no cor-relation between English scores and Domain scores.Finally, in figure 4 (b) we put salary on the y axisfind no obvious correlation with Domain score asboth big and small circles are in the same salaryrange but a slight positive correlation with Englishscore. This answers the specific question posed bythe challenge about whether candidates with highEnglish scores also have high Domain skills and theiraffect on salary.

Insights from DS 2 In figure 6 (a) we plot the averagesalary being offered to candidates (y axis) over the years (xaxis) for which data was available. We observe a decreasingtrend in recent years.

Insights from DS 3 In figure 6 (b) we plot the averagesalary for different cities in a bar chart. We observe thatcities such as Bangalore, Mumbai, Gurgaon and Hyderabadhave higher avg salaries, while Faridabad, Bhubaneswar andCalcutta have lower average salaries.

Insights from DS 4 In figure 7 we show visualizationsusing bubble map charts with data attributes mapping tosize and colour of the circle for each job city in India. Foreach plot colour has been mapped to total jobs clearly show-ing Bangalore having the largest number of jobs, with Noida,Hyderabad, Pune and Chennai in tow. The charts in (a)and (b) map size to number of males and females in thecity. There is a sharp decrease in both the number of citieswhere females are placed and number of females.

Next, we show the average scores on various specializa-tions for each city. As is clear from figure 8 (a) and (b)computer programming score is high all over India whereascandidates with high CSE scores are mostly in the extremenorthern and southern cities. Similarly while ECE scores(c) where high for most job cities, high Telecom scores (d)were found in extreme northern and southern cities

5. PREDICTION FOR TEST DATAWe now describe in detail our technique for prediction

using Bayesian networks.

5.1 Feature SelectionWe select top K features based on the mutual information

of all features with target variable. Mutual information be-tween continuous-continuous, and continuous-discrete vari-ables is calculated using Non-parametric Entropy Estima-tion toolbox(NPEET)[5]. This tool implements [1] to findmutual information estimators, which are based on entropyestimates from k-nearest neighbour distances.

5.2 Bayesian Structure Learning

Once we identify subset of features based on mutual in-formation, we learn efficiently executable Bayesian networkon these top K features including target variables. We callit Minimum Spanning Tree Network (MSTN). We learn thestructure of MSTN with the following approach

1. Given subset of K features including both continuousand discrete variables.

2. We learn the minimum spanning tree(MST) on featuresubset using pairwise mutual information as a thresh-old.

3. Initialize each edge to random direction.

4. Flip each edge direction to compute 2K−1 directedgraphs and calculate the cross entropy of each graph.

5. Choose a graph with least cross entropy.

Once we learn the structure, we learn the CPT of each nodein a network.

5.3 Predicting salaryWe use the MSTN, learned on the relevant feature sub-

set, to predict the salary of each test data using rest of thefeatures in a network as evidence. Fig 10, shows the MSTNlearned on feature subset. We use exact inference acceler-ated by an SQL engine which internally performs query op-timization which is analogous to many poly-tree based exactinference techniques. Apart from MSTN, we also use NaiveBayes network on the same feature subset to predict salary.

5.4 Results and AnalysisTable 1, shows the Root mean square error (RMSE) of

predictions made on the training data by 5 fold cross val-idation. It compare the RMSE of various approaches suchas Naive Bayes, MSTN, Regression Tree, Random forest.It shows that prediction using MSTN in iFuse platform isbetter as compared to other standard approaches like Re-gression tree(split at <10 instances) [2], and Random for-est (Random 5 attributes per split, 10 trees, prune at <5instances)[3], and Naive Bayes. Table 2, shows RMSE onleaderboard using iFuse (MSTN network) and the best RMSE(top ranker) on the leaderboard.A visualization of the predicted salary on test data created

from of the training data in 70-30(%) ratio is shown in figure3. The error in prediction can be visualised in (b) on thelast two axes - actual salary and imputed salary. We showthe predicted salary for test data provided by the Challengein figure 9.We observe that salary is highly skewed, e.g. only 1% is

greater than 10L and a single model cannot handle the skewin the target variable. This is one possible reason for highRMSE. We recommend using an ensemble of Bayesian net-works trained on different salary segments. However, we didnot try this technique because it is not yet implemented inthe iFuse platform and we wanted to use the platform onlyrather than try all possible techniques outside its capabili-ties.

6. BAYESIAN ADVISORWe have used the model inferencing feature of iFuse to

answer the Recommendation question posed by the chal-lenge. As shown in figure 11 a network with target attribute

Table 1: RMSE of five cross validation on trainingdata

Algorithm RMSERandom Forest 197747Regression Tree 212799iFuse-MSTN 163371

iFuse-Naive Bayes 183271

Table 2: RMSE on leaderboardiFuse 13920.8

Best RMSE 13182.6

salary and English score, Logical test score, 12th percentageand college CGPA has been created. The original probabil-ity distribution are shown on the diagonal using bar chartswith yellow bars. Salary is plotted on log10 scale.

We consider the case when a candidate is interested in get-ting a very high salary and wants to know what is the idealprofile for the same. This may be done in iFuse’s LinkedQuery View by selecting the last two bars on the salarydistribution as shown in figure 11 (a) and hitting the querybutton in the menu. This triggers a conditional query at thebackend which is resolved using inferencing and the condi-tional probability distributions of the remaining attributesand computed and displayed with red bars in the linkedquery view. There is a comparison mode available whichhelps understand the exact changes in the distributions. Wehave used this mode in the figure and observe that thereis a significant rise in the probability of the second last binfor English score, indicating that a candidate must have highEnglish score to get a high salary. Additionally, we find thatalthough the distributions of logical test score and CGPA donot change much, probability of 12th percentage increasessignificantly for the higher range bins. Thus, for a very highsalary English score and 12th percentage must be high.

Next, we consider the case when a candidate is willing tolower the salary expectation to mid to high range, figure 11(b). In this case the distribution of English score changesonly slightly for the higher bins, while for 12th percentage,probabilities for the mid to high bins increase significantly.

Finally, we consider the case when a candidate is inter-ested in a mid to high salary but has low English score.Such a query may be performed by selecting the appropriatebars on the distributions of both salary and English score asshown in figure 11 (c). This causes the distribution of logicalability to shift to the middle range while the distribution of12th percentage shifts significantly to the higher bins. Thus,for a high salary with low English score, one must have agood logical test score and very good 12th standard percent-age.

In this manner a candidate may impose conditions on anynumber of the variables in the network and get answers tohow his/her profile should change in order to meet the salarygoal.

7. CONCLUSIONSTo conclude, we have met the objectives set out by the

CoDs Data Challenge using our iFuse platform built visualBayesian data fusion. We have demonstrated how the plat-form may be used to perform exploratory visual analysis on

the raw data and gather useful insights and obtain a deeperunderstanding of the data. Further, we have shown howBayesian network models are utilised in our platform forsalary prediction and providing profile recommendations.

8. REFERENCES[1] A. Kraskov, H. Stogbauer, and P. Grassberger.

Estimating mutual information. Physical review E,69(6):066138, 2004.

[2] R. J. Lewis. An introduction to classification andregression tree (cart) analysis. In Annual Meeting of the

Society for Academic Emergency Medicine in San

Francisco, California, pages 1–14, 2000.

[3] A. Liaw and M. Wiener. Classification and regressionby randomforest. R news, 2(3):18–22, 2002.

[4] G. Sharma, G. Shroff, A. Pandey, B. Singh, G. Sehgal,K. Paneri, and P. Agarwal. Multi-sensor visualanalytics supported by machine-learning models. InICDM Workshop on Data Analytics meets Visual

Analytics, 2015.

[5] G. Ver Steeg. Non-parametric entropy estimationtoolbox (npeet). 2000.

9. APPENDIX - IMAGES

Figure 1: Search page showing CoDs Data Challengedatasets

Figure 2: Data Tile Flipped View and Attribute Se-lection for Network Creation

Figure 10: Minimum Spanning Tree NetworkLearned on Feature Subset

(a) Before Imputation

(b) After Imputation. Last two axes visualise error between actual and predicted salary.

Figure 3: Parallel Coordinates Plot for Salary Prediction using Imputation on test data created from 30%training data.

(a) Complete Data (b) Filtered on popular designations

(c) Positive Correlation of 12th Percentage and Salary (d) Positive Correlation of College CGPA and Salary

Figure 4: Insights from Data set 1 using motion charts

(a) No Correlation between English and Domain score

(b) Positive correlation between Salary and English score but not with Domain score

Figure 5: Answer to Challenge Question about correlation between Domain score, English score and Salary

(a) DS 2 Yearly average salary

(b) DS 3 City wise Salary

Figure 6: Insights from Data sets 2 and 3

(a) Size: Number of Males, Colour: Total Jobs

(b) Size: Number of Females, Colour: Total Jobs

Figure 7: Comparison between number of male and female candidates and placement cities using Data set 4and Bubble Map charts

(a) Size: Avg. Computer Prog., Score Colour: Total Jobs (b) Size: Avg. CSE Score, Colour: Total Jobs

(c) Size: Avg. ECE Score, Colour: Total Jobs (d) Size: Avg. TeleCom Score, Colour: Total Jobs

Figure 8: Insights from Data set 4 using Bubble Map charts

Figure 9: Salary Imputation for Test dataset provided by Challenge

(a) Query for high salary

(b) Query for mid to high salary

(C) Query for high salary and low English score

Figure 11: Recommendation using Linked Query View

bayesian visual analysis of the indian labour...

Documents