a survey on contribution of data mining in social media · networks is put out. also various data...

A Survey on Contribution of Data Mining in Social Media Prajkta P. Chapke

1

, and Dr. Anjali B. Raut 2

1

Research Scholar, Computer Science and Engineering, H.V.P.M‟s, C.O.E.T.,

Amravati(MS), India [email protected]

2

Professor, Computer Science and Engineering, H.V.P.M‟s, C.O.E.T.,

Amravati(MS), India [email protected]

Abstract. Though Data mining emergence is long before but still its

growing with all new faces in almost all the fields of computer sciences,

Health care, Finance, Social media, and so on. Many researchers have work

in Data mining sharpening the edges with different tools and techniques.

Various algorithms of Data Mining have been used and which improves the

work over the previous one. Nowadays Social media has become one of

important way of expressing ideas, views and emotions regarding a topic,

situation or scenario. This paper includes Data Mining development and

growth in various fields especially in Social Media; how it has become

booming researcher‟s.

Keywords: Data mining, algorithms, social media, data mining technique.

1 Introduction

The computerization of our society has substantially enhanced our capabilities for

both generating and collecting data from diverse sources [Han and Kamber].

Hourly from every sector tremendous amount of data gets generated. This

explosively grown data has generated need for new techniques and automated

tools that can intelligently assist in transforming the vast amounts of data into

useful knowledge. We are surrounded with various type of data generated from

multiple fields like finance, medical, marketing, education, scientific etc. This all

data is in its raw format which may not be useful directly to individual, researcher

or a company. But its summarised, classified, distributed way might help many

users of this information.

Today, the use of social networks is growing ceaselessly and rapidly. More

alarming is the fact that these networks have become a substantial pool for

unstructured data that belong to a host of domains, including business,

governments and health. The increasing reliance on social networks calls for data

Journal of Seybold Report ISSN NO: 1533-9211

VOLUME 15 ISSUE 9 2020 Page: 632

2

mining techniques that is likely to facilitate reforming the unstructured data and

place them within a systematic pattern [5].

P. Ristoski, H. Paulheim [6] has shown how Linked Open Data can be used at

various stages for building content-based recommender systems. The survey

shows that, while there are numerous interesting research works performed, the

full potential of the Semantic Web and Linked Open Data for data mining and

KDD is still to be unlocked.

In order to extract the text information of the webpage, the position of the text

information can be accurately located by using the multiple features of the text

and the rules of the webpage design. According to the above characteristics,

Chongjun Wang, Peng Wei [11] proposed a method for extracting webpage text

information based on multi-feature fusion.

The importance of users‟ sentiments has been realized by the business sector in

the last decade. Since then social media platforms and other websites are used to

extract users‟ opinions about products. Such phenomenon is called sentiment

analysis or opinion mining. Opinion mining is identifying, extracting and

understanding the user‟s attitude or opinion by analysing the text. This process

usually involves natural language processing, statistical analysis and machine

learning techniques for sentiment analysis [18].

2 Techniques in Data Mining

Nowadays we all have access to more data than ever had before. But to make use

of structured and unstructured data to implement improvements, can be extremely

challenging because of the sheer amount of information which in turn can

minimize the benefits of all the data.

Data mining is the process which detects patterns in data for insights relevant to

ones needs. It is essential for both business intelligence and data science. There

are many data mining techniques that can be used to turn raw data into actionable

insights. These involve everything from cutting-edge artificial Intelligence to the

basics of data preparation, which are both key for maximizing the value of data

investments.

1. Tracking patterns

2. Classification



https://www.talend.com/resources/what-is-data-mining/

https://www.talend.com/resources/what-is-business-intelligence/

https://www.talend.com/resources/what-is-artificial-intelligence/

https://www.talend.com/resources/what-is-data-preparation/

3

3. Association

4. Outlier detection

5. Clustering

6. Regression

7. Prediction

8. Sequential patterns

9. Decision trees

10. Statistical techniques

11. Visualization

12. Neural networks

13. Machine learning and artificial intelligence

3 Algorithms in Data Mining

As mentioned in earlier sections, every day a voluminous data is generating from

every sector with different type such as audio, video, 3D, social media post,

geospatial, complex type, etc. and this sheer volume is one of the issues for

processing it. Such a mentioned data is diverse, unstructure and fast changing.

Such a type of data is not easy to categorized or organized. To meet such

challenge, a range of automatic methods for extracting information are there,

which are nothing but data mining algorithms.

Fig. 1. Data Mining Algorithms

3.1 C4.5 Algorithm



4

C4.5 is a data mining algorithms which is used to generate a classifier in the form

of a decision tree from a set of data that has already been classified. Classifier here

refers to a data mining tool that takes data that is needed to classify and tries to

predict the class of new data.

Every data point will have its own attributes. The decision tree created by C4.5

poses a question about the value of an attribute and depending on those values, the

new data gets classified. The training dataset is labelled with lasses making C4.5 a

supervised learning algorithm. Decision trees are always easy to interpret and

explain making C4.5 fast.

3.2 K-mean Algorithm

It is clustering algorithms, where k-means works by creating a k number of groups

from a set of objects based on the similarity between objects. It may not be

guaranteed that group members will be exactly similar, but group members will be

more similar as compared to non-group members. As per standard

implementations, k-means is an unsupervised learning algorithm as it learns the

cluster on its own without any external information.

3.3 Support Vector Machines

In terms of tasks, Support vector machine (SVM) works similar to C4.5 algorithm

except that SVM doesn‟t use any decision trees at all. SVM learns the datasets and

defines a hyper plane to classify data into two classes. A hyperplane is an equation

for a line. SVM exaggerates to project data to higher dimensions. Once projected,

SVM defined the best hyper plane to separate the data into the two classes.

3.4 Apriori Algorithm

Apriori algorithm works by learning association rules which is a data mining

technique that is used for learning correlations between variables in a database.

Once the association rules are learned, it is applied to a database containing a large

number of transactions. Apriori algorithm is used for discovering interesting

patterns and mutual relationships and hence is treated as an unsupervised learning

approach. Thought the algorithm is highly efficient, it consumes a lot of memory,

utilizes a lot of disk space and takes a lot of time.

3.5 Expectation-Maximization Algorithm



5

Expectation-Maximization (EM) is used as a clustering algorithm, just like the k-

means algorithm for knowledge discovery. EM algorithm work in iterations to

optimize the chances of seeing observed data. Next, it estimates the parameters of

the statistical model with unobserved variables, thereby generating some observed

data. Expectation-Maximization (EM) algorithm is again unsupervised learning

since we are using it without providing any labelled class information.

3.6 PageRank Algorithm

PageRank is a link analysis algorithm that determines the relative importance of

an object linked within a network of objects. Link analysis is a type of network

analysis that explores the associations among objects. It determine the relative

importance of a webpage and rank it higher on search engine. PageRank is treated

as an unsupervised learning approach as it determines the relative importance just

by considering the links and doesn‟t require any other inputs.

3.7 AdaBoost Algorithm

AdaBoost is a boosting algorithm used to construct a classifier, which is a data

mining tool that takes data predicts the class of the data based on inputs. Boosting

algorithm is an ensemble learning algorithm which runs multiple learning

algorithms and combines them. Boosting algorithms take a group of weak learners

and combine them to make a single strong learner. A weak learner classifies data

with less accuracy. After the user specifies the number of rounds, each successive

AdaBoost iteration redefines the weights for each of the best learners. This makes

Adaboost a super elegant way to auto-tune a classifier. Adaboost is flexible,

versatile and elegant as it can incorporate most learning algorithms and can take

on a large variety of data.

3.8 kNN Algorithm

kNN is a lazy learning algorithm used as a classification algorithm. A lazy learner

will not do anything much during the training process except for storing the

training data. Lazy learners start classifying only when new unlabeled data is

given as an input. C4.5, SVN and Adaboost, on the other hand, are eager learners

that start to build the classification model during training itself. Since kNN is

given a labelled training dataset, it is treated as a supervised learning algorithm.

3.9 Naive Bayes Algorithm



6

Naive Bayes is not a single algorithm though it can be seen working efficiently as

a single algorithm. Naive Bayes is a bunch of classification algorithms put

together. The assumption used by the family of algorithms is that every feature of

the data being classified is independent of all other features that are given in the

class. Naive Bayes is provided with a labelled training dataset to construct the

tables. So it is treated as a supervised learning algorithm.

3.10 CART Algorithm

CART stands for classification and regression trees. It is a decision tree learning

algorithm that gives either regression or classification trees as an output. In

CART, the decision tree nodes will have precisely two branches. Just like C4.5,

CART is also a classifier. The regression or classification tree model is

constructed by using labelled training dataset provided by the user. Hence it is

treated as a supervised learning technique.

4 Data Mining in Social Media

S.

NO

AUTHOR

PAPERS TITLE ALGORITHM /

TOOLS

ADVANTAGES/

LIMITATIONS

1 Zhang et al Multi-modal Sentiment

Classification via Semi-

supervised Learning

Semi supervised

learning algorithm

Greatly advances the state-

of-the art of multi-modal

sentiment classification by

leveraging unlabelled data.

Do not perform other multi-

modal tasks, such as sarcasm

detection and personality

recognition.

2 Li et al Mining Heterogeneous

Influence and Indirect

Trust for

Recommendation

Collaborative

filtering technique.

Social

recommendation

model ReHI

Proposed method performs

better than state-of-the-art

recommendation models, es-

pecially for cold-start users.

A small number of distrust

relationships have great im-

pact on recommendation.



7

3 Azwa Abdul

Aziz , Andrew

Starkey , Elissa

Nadia Madi

Predicting Supervise

Machine Learning

Performances for

sentiment analysis using

contextual-based

approaches

SML Model Finds the relationship be-

tween words and sources

which can provide a mecha-

nism to predict SML model

performance.

Relationship in CA can be

used to further understand

the structured knowledge of

the data.

4 Monali Bordoloi

,

Saroj Kr. Biswa

s

Keyword extraction

from micro-blogs using

collective weight, Social

Network Analysis and

Mining

A novel

unsupervised

graph-based

keyword extraction

method called

keywords from

collective weights

(KCW)

The most important key-

words extracted by KCW

model are certainly im-

portant for a particular topic

in comparison with other ex-

isting methods.

Semantic methods for key-

word extraction can be ex-

plored along with the pro-

posed model to get better

results.

5 Halgurd S.

Maghdid,

Member, IEEE

Web News Mining Using

New Features: A

Comparative Study

K-Nearest-

Neighbor (k-NN),

decision tree and

deep-learning

recurrent neural

network ( such as

Long Short-Term

Memory „LSTM‟)

Using new feature such as lo-

cation and time information

with the web news mining

techniques.

Combining different classifi-

cation techniques to further

improvements within the low-

er number of web news doc-

uments and it will take less

time.

6 Chongjun

Wang, Peng

Wei

A novel web page text

information extraction

method

WFFTE (Webpage

multi-feature fusion

text extraction)

The method has universality

and high accuracy for the text

information extraction of sin-

gle text and multi-text web

pages.

There are many other features

for webpage texts, and espe-



8

cially the multi-text body has

its own uniqueness

7 Dr. Ritu

Bhargava,

Abhishek

Kumar, Sweta

Gupta

Collaborative

methodologies for

pattern evaluation for

web personalization

using Semantic Web

Mining

DBpedia and

Linked MDB

algorithms

Improves the efficiency of

classification results with

more accuracy comparatively.

Better future prospects in do-

main of data mining especial-

ly in semantic web mining for

better efficiency of rank pre-

dictor algorithm.

8 Rinkal

Sardhara,

Kamaljit I.

lakhataria

Impact of Different

Domain Inlink,Outlink

and Rechability on

Relevance of Web Page

Using Correlation

Pearson correlation

technique

Helps to reduce mutual rein-

forcement effect on web page

ranking.

Reachability parameter can be

use to select one perticular

link from the multiple links

from the same domain.

9 Bharti Pooja M.,

Prof. Tushar J.

Raval

Improving Web Page

Access Prediction using

Web Usage Mining and

Web Content Mining

Least Frequent

Ones Leave

strategy, First

Come First Leave

strategy, Older than

Timeframe Leave

strategy

Three session discarding

strategies is used to reduce

prediction time and improve

prediction accuracy.

Automation is needed in this

process.

10 Roberto Saia

and Salvatore

Carta

Introducing a Vector

Space Model to Perform

a Proactive Credit

Scoring

Linear Dependence

Based (LDB)

approach

The best stateof-the-art ap-

proaches of credit scoring

such as random forests, even

using only past non-default

cases.

Does not include the default

past cases in the training pro-

cess.

11 M. Alshammari

et al.

Mining Semantic

Knowledge Graphs to

Add Explainability to

A novel approach

for latent factor-

based black box

Generates both accurate pre-

dictions and semantically rich

explanations that justify the



9

Black Box

Recommender Systems

recommendation

model

predictions.

12 Nehal Mohamed

Ali, Marwa

Mostafa Abd El

Hamid and

Aliaa Youssif

SENTIMENT

ANALYSIS FOR

MOVIES REVIEWS

DATASET USING

DEEP LEARNING

MODELS

Deep learning,

LSTM, CNN

The proposed deep learning

techniques (MLP, CNN,

LSTM, and CNN_LSTM)

have outperformed SVM, Na-

ïve Bayes, RNTN.

13 S. Kausar et al Sentiment Polarity

Categorization

Technique for Online

Product Reviews

Naive Bayes,

Decision Tree,

Random Forest,

Support Vector

Machine, Gradient

Boosting and

Sequence

Review-level categorization

show promising outcomes as

the accuracy.

Automated sentiment analysis

is helpful for analyzing big

textual information, it still has

limitation.

14 Jundong Chen,

Md Shafaeat Ho

ssain,

Huan Zhang

Analyzing the sentiment

correlation

between regular tweets

and retweets

Sentiment analysis

and machine

learning

Provides instructional infor-

mation for modeling infor-

mation propagation in human

society.

To apply a better sentiment

model for well handling nega-

tive sentiment instead of just

simply summing up the posi-

tive and negative sentiments.

15 Tao You,

Yamin Li,

Bingkun Sun,

and Chenglie

Du

Multi-source Data

Stream Online Frequent

Episode Mining

Episode mining Widely used in the field of

smart transportation and sen-

sors.

16 Harish Kumar,

Anuradha,

A.K.Solanki,

Krishna Kant

Singh

Progressive Machine

Learning Approach with

WebAstro for Web

Usage Mining

Clustering

technique with

WEBASTRO tool

Ideal approach for web min-

ing and predicting users be-

haviour for next visit and au-

tomatic website modification.

To explore the use of these

techniques in automated soft-



10

ware for predicting their next

visit.

17 L. Jain, R.

Katarya and S.

Sachdeva

Recognition of opinion

leaders coalitions in

online social network

using game theory

Game theory-based

Opinion Leader

Detection (GOLD)

algorithm

Strengthen the power of the

coalition and produced the

synergetic outcome.

The innovative computational

intelligence techniques along

with the evolutionary game

theory approach to detect the

promising opinion leader in

social networks.

18 Victoria Kayser

and Erduana

Shala

Scenario development

using web mining for

outlining technology

futures

Web and text

mining

The rapid overview with the

visualizations remarkably re-

duces the reading effort.

Other data sources should be

explored

19 T Mustaqim et

al 2020

Twitter text mining for

sentiment analysis on

government‟s response

to forest fires with vader

lexicon polarity detection

and k-nearest neighbor

algorithm

VADER lexicon

polarity detection

and K-Nearest

Neighbors

The sentiment analysis pro-

cess can almost run automati-

cally.

5 Conclusion

This work provides a much of present updates of social media network analysis. In

this paper Literature works are reviewed and a comparative work of social

networks is put out. Also various Data mining techniques, algorithms and

especially growth and development of data mining in social media have focused.

This work helps to study the relevancy of the techniques and idea of data mining

for social network analysis, and reviews the connected literature concerning web

mining and social networks.



11

References

1. Sherry Y. Chen and Xiaohui Liu, “The contribution of data mining to information

science”, Journal of Information Science, 30 (6) 2004, pp. 550–558, 2004.

2. Salvador García, Julián Luengo, Francisco Herrera, “Tutorial on practical tips of the

most influential data preprocessing algorithms in data mining”,

http://dx.doi.org/10.1016/j.knosys.2015.12.006 0950-7051/© 2015.

3. Shuja Mirza, Dr. Sonu Mittal, Dr. Majid Zaman, “A Review of Data Mining Literature”.

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No.

11, November 2016.

4. Mohammad Noor Injadat, Fadi Salo and Ali Bou Nassif, “Data Mining Techniques in

Social Media: A Survey”, Neurocomputing,

http://dx.doi.org/10.1016/j.neucom.2016.06.045, 2016.

5. Zhang et al.: “Multi-modal Sentiment Classification via Semi-supervised Learning”,

Citation information: DOI 10.1109/ACCESS.2020.2969205,IEEE Access, 2016.

6. P. Ristoski, H. Paulheim, “Semantic Web in data mining and knowledge discovery: A

comprehensive survey”, Web Semantics: Science, Services and Agents on the World Wide

Web (2016), http://dx.doi. org/10.1016/j.websem.2016.01.001. 2016.

7. Li et al: “Mining Heterogeneous Influence and Indirect Trust for Recommendation”

(December 2019) Citation information: DOI 10.1109/ACCESS.2020.2968102,IEEE

Access. 2017.

8. Azwa Abdul Aziz , Andrew Starkey , Elissa Nadia Madi, “Predicting Supervise Machine

Learning Performances for sentiment analysis using contextual-based approaches”, DOI

10.1109/ACCESS.2019.2958702, IEEE Access. 2017.

9. Monali Bordoloi , Saroj Kr. Biswas, “Keyword extraction from micro-blogs using

collective weight, Social Network Analysis and Mining” (2018) 8:58

https://doi.org/10.1007/s13278-018-0536-8. 2018.

10. Halgurd S. Maghdid, Member, IEEE, “Web News Mining Using New Features: A

Comparative Study”, Citation information: DOI 10.1109/ACCESS.2018.2890088, IEEE

Access, 2018.

11. Chongjun Wang, Peng Wei, “A novel web page text information extraction method”,

2019 IEEE 3rd Information Technology,Networking,Electronic and Automation Control

Conference (ITNEC 2019), 2019.

12. Dr. Ritu Bhargava, Abhishek Kumar, Sweta Gupta, “Collaborative methodologies for

pattern evaluation for web personalization using Semantic Web Mining”, Second

International Conference on Smart Systems and Inventive Technology (ICSSIT 2019) IEEE

Xplore Part Number: CFP19P17-ART; ISBN:978-1-7281-2119-2, 2019.

13. Rinkal Sardhara, Kamaljit I. lakhataria, "Impact of Different Domain Inlink,Outlink and

Rechability on Relevance of Web Page Using Correlation", Proceedings of the International

Conference on Intelligent Computing and Control Systems (ICICCS 2019) IEEE Xplore

Part Number: CFP19K34-ART; ISBN: 978-1-5386-8113-8, 2019.

14. Bharti Pooja M., Prof. Tushar J. Raval, "Improving Web Page Access Prediction using

Web Usage Mining and Web Content Mining ", Proceedings of the Third International



http://dx.doi.org/10.1016/j.knosys.2015.12.006

http://dx.doi.org/10.1016/j.neucom.2016.06.045

https://doi.org/10.1007/s13278-018-0536-8

12

Conference on Electronics Communication and Aerospace Technology (ICECA 2019)

IEEE Conference Record # 45616; IEEE Xplore ISBN: 978-1-7281-0167-5, 2019.

15. Roberto Saia and Salvatore Carta, "Introducing a Vector Space Model to Perform a

Proactive Credit Scoring ", Springer Nature Switzerland AG 2019 A. Fred et al. (Eds.):

IC3K 2016, CCIS 914, pp. 125–148, 2019. https://doi.org/10.1007/978-3-319-99701-8_6,

2019.

16. M. Alshammari et al.: “Mining Semantic Knowledge Graphs to Add Explainability to

Black Box Recommender Systems”, 10.1109/ACCESS.2019.2934633, VOLUME 7, 2019.

17. Nehal Mohamed Ali, Marwa Mostafa Abd El Hamid and Aliaa Youssif, "SENTIMENT

ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING

MODELS", International Journal of Data Mining & Knowledge Management Process

(IJDKP) Vol.9, No.2/3, May 2019.

18. S. Kausar et al.: “Sentiment Polarity Categorization Technique for Online Product

Reviews”, DOI 10.1109/ACCESS.2019.2963020, VOLUME 8, 2020.

19. Jundong Chen, Md Shafaeat Hossain, Huan Zhang, “Analyzing the sentiment

correlation between regular tweets and retweets”, https://doi.org/10.1007/s13278-020-

0624-4, © Springer-Verlag GmbH Austria, part of Springer Nature 2020.

20. Tao You, Yamin Li, Bingkun Sun, and Chenglie Du, "Multi-source Data Stream Online

Frequent Episode Mining", DOI 10.1109/ACCESS.2020.2997337,IEEE Access, 2020.

21. Harish Kumar, Anuradha, A.K.Solanki, Krishna Kant Singh, "Progressive Machine

Learning Approach with WebAstro for Web Usage Mining", International Conference on

Computational Intelligence and Data Science (ICCIDS 2019), 10.1016/j.procs.2020.03.351,

2020.

22. L. Jain, R. Katarya and S. Sachdeva, "Recognition of opinion leaders coalitions in

online social network using game theory ", Knowledge-Based Systems 203 (2020) 106158,

2020.

23. Victoria Kayser and Erduana Shala,"Scenario development using web mining for

outlining technology futures", Technological Forecasting & Social Change,

https://doi.org/10.1016/j.techfore.2020.120086 , 2020.

24. T Mustaqim et al 2020, "Twitter text mining for sentiment analysis on government‟s

response to forest fires with vader lexicon polarity detection and k-nearest neighbor

algorithm", J. Phys.: Conf. Ser. 1567 032024, 2020.



https://doi.org/10.1007/978-3-319-99701-8_6

a survey on contribution of data mining in social media · networks is put out. also various data...

Documents