international journal applied sciences & engineeringforecast projects global ip traffic to...

19

International Journal of Applied Sciences & Engineering

www.ijapscengr.com ISSN 2308-5088 [email protected]

RESEARCH ARTICLE

Efficient Data Filtering Algorithm for Big Data Technology in Telecommunication Network 1James Agajo, 2Nosiri Onyebuchi, 3Okhaifoh J and 4Godwin Okpe 1School of Engineering, Federal University of Technology Minna Nigeria; 2Department of Electrical/Electronic Federal University of Technology Owerri Nigeria; 3Federal University of Petroleum Resources Effurun Delta State Nigeria 4Programmes Department National Board for technical Education Kaduna Niger

ARTICLE INFO

ABSTRACT

Received: Revised: Accepted:

February 12, 2015 March 20, 2015 June 22, 2015

Efficient data filtering algorithm for Big Data technology Telecommunication is a concept aimed at effectively filtering desired information for preventive purposes, the challenges posed by unprecedented rise in volume, variety and velocity of information has necessitated the need for exploring various methods Big Data which is simply a data sets that are so large and complex that traditional data processing tools and technologies cannot cope with is been considered. The process of examining such data to uncover hidden patterns in them was evolved, this was achieved by coming up with an Algorithm comprising of various stages like Artificial neural Network, Backtracking Algorithm, Depth First Search, Branch and Bound and dynamic programming and error check. The algorithm developed gave rise to the flowchart, with each line of block representing a sub-algorithm.

Key words: Big data Filtering Variabilty Velocity Volume

*Corresponding Address: James Agajo [email protected]

Cite This Article as: Agajo J, N Onyebuchi, J Okhaifoh and G Okpe, 2015. Efficient data filtering algorithm for big data technology in telecommunication network. Inter J Appl Sci Engr, 3(1): 19-26. www.ijapscengr.com

INTRODUCTION

Background

Since the emergence of satellite technology, the world has experience a massive explosion of data transfer from one point to another, and with an endless demand for increased bandwidth to accommodate some very pressing necessities like i. Facebook ii. Tweeter iii. Skpe iv Voice call v Video Call vi Google plus vii. Flick viii. Linkedln ix. Amazon x etc

All of this add to the number of data transfer between source and sink. This large chunk of data has made it for the present day convention to manage, therefore the need for Big Data has become necessary.

Big data is a broad term for data sets so large or complex that traditional data processing applications cannot handle. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced

methods to extract value from data, and seldom to a particular size of data set. [1]

Accuracy in big data may lead to more confident decision making. And better decisions can mean greater operational efficiency, cost reductions and reduced risk (Big Data, 2015), Big Data is simply a data sets that are so large and complex that traditional data processing tools and technologies cannot cope with. The process of examining such data to uncover hidden patterns in them is referred to as Big Data Analytics (Christopher S., 2014).

There are more to gain when Exploiting big data in telecommunications which increase revenue, reduce customer churn and operating costs

Communications service providers all over the globe are seeing an unprecedented rise in volume, variety and velocity of information ("big data") due to next generation mobile network rollouts, increased use of smart phones and rise of social media (Keith C.C. Chan, 2013).

Service providers who can tackle the big data challenge will differentiate from competitors, gain market share and increase revenue and profits with innovative new services. The essence of this paper is to come up with ways of effectively filtering desired information for preventive purposes like in the area of i. Gang/drug offences ii. Violent crime iii. Cyber crime

Inter J Appl Sci Engr, 2015, 3(1): 19-26.

20

iv Advance fee fraud v. Cultism e.t.c

Crime intelligence extension

A robust e-crime detection technique will Lower risk, detect fraud and monitor cyber security in real time. Big Data will augment and enhance cyber security and intelligence analysis platforms with big data technologies to process and analyze new types (e.g. social media, emails, sensors, Telco) and sources of under-leveraged data to significantly improve intelligence, security and law enforcement insight. Data could be filtered with the aid of keywords.

Keyword is a word that is reserved by a program because the word has a special meaning. Keywords can be commands or parameters. [5]

Instead of chasing and finding criminals on the street Big Data provides a medium where certain characters can be detected and arrested within a telecommunication network with the aid of data.

Successfully harnessing big data can help service providers achieve three critical objectives for telecommunications transformation: Deliver smarter services that generate new sources of revenue; Transform operations to achieve surveillance and service excellence; and Build smarter networks to drive consistent, high-quality data capture.

Unlike Data mining which refers to the activity of going through collection of data with a view to looking for relevant or pertinent information. This type of activity is really a good example of the old axiom "looking for a needle in a haystack." The idea is that businesses collect massive sets of data that may be homogeneous or automatically collected. Decision-makers need access to smaller, more specific pieces of data from those large sets. Data mining can help to uncover pieces of information that will inform transparency and help chart the course for business, Data mining is also relevant in cloud computing. (Amir Gandomi (2015).

Data mining Data mining can involve the use of different kinds of

software packages such as analytics tools. It can be automated, or it can be largely labor-intensive, where individual workers send specific queries for information to an archive or database. Generally, data mining refers to operations that involve relatively sophisticated search operations that return targeted and specific results.

Cloud computing

Cloud computing is a general term for anything that involves delivering hosted services over the Internet as shown in figure 1 and 1.1.

Cloud computing could also be amongst the factor for the increase in volume.

Cloud computing is a general term for anything that involves delivering hosted services over the Internet. These services are broadly divided into three categories: Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS).

For big data we see cloud computing as a veritable too for enhancing data filtering technique Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. Why? More data may lead to more accurate analyses. More accurate search may lead to more confident result. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk. Analysis of data sets can find new correlations, to "spot business trends, prevent diseases, combat crime and so on."[1] Scientists, practitioners of media and advertising and governments alike regularly meet difficulties with large data sets in areas including Internet search, finance and business informatics. Scientists encounter limitations in electronic/ science work, including meteorology, genomics, connectomics, complex physics simulations, and

Figure 1: Network of data transmission


21

Fig. 1.1: Cloud computing

Fig. 1.3: Data range

Fig. 2: current Cisco Visual Networking Index biological and environmental research. [3] Big data is characterised by 3 V'S. Volume: Many factors contribute to the increase in data volume. Transaction-based data stored through the years. Unstructured data streaming in from social media. Increasing amounts of sensor and machine-to-machine data being collected. In the past, excessive data volume was a storage issue. But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data. Velocity: Data is streaming in at unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Reacting quickly enough to deal with data velocity is a challenge for most organizations. Variety: Data today comes in all types of formats. Structured, numeric data in traditional databases. Information created from line-of-business applications. Unstructured text documents, email, video, audio, stock

ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with. Variability: In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data involved. Complexity: Today's data comes from multiple sources. And it is still an undertaking to link, match, cleanse and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control.

Data sets grow in size in part because they are increasingly being gathered by cheap and numerous information-sensing mobile devices, aerial(remote sensing), software logs, cameras, microphones, radiofrequency identification (RFID), readers, and wireless sensor networks. The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×1018) of data were created;[9] The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.

The real issue is not that you are acquiring large amounts of data. It's what you do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable 1) Cost reductions 2) Time reductions 3) New product development and optimized offerings 4) Smarter business decision making. For instance, by combining big data and high-powered analytics, it is possible to: Optimize routes for many thousands of package delivery data in the course of transmission. IP traffic

The current Cisco Visual Networking Index (VNI) forecast projects global IP traffic to nearly triple from 2014 to 2019. See Appendix A for a detailed summary. Overall IP traffic is expected to grow to 168 exabytes per month by 2019, up from 59.9 exabytes per month in 2014, a CAGR of 23 percent as shown in figure 2

The New York time provide the following data on log files

• 50Gb of uncompressed log files • 10Gb of compressed log files • 0.5Gb of processed log files • 50-100M clicks • 4-6M unique users • 7000 unique pages with more then 100 hits • Index size 2Gb • Pre-processing & indexing time ~10min on • workstation (4 cores & 32Gb) • ~1hour on EC2 (2 cores & 16Gb


22

Fig. 3: Intercontinental network for Data route Transmission 1 Source: Marko Grobelnik Jozef Stefan (2012).

Fig. 4: Intercontinental network for Data route Transmission 2 Source: Marko Grobelnik Jozef Stefan (2012)

Fig. 5: Branch of distributed data Sensing device

The diagram in figure 3&4 are optimised route diagram for communication channels where data is been transmitted it is estimated that in one minute a total of One minute on the Internet: 640Tera Byte data transferred, 100k tweets, 204 million e-mails sent.

MATERIALS AND METHODS

This work develops an algorithm which filters specified data in a large chunk of data which involves searching for correct data by carrying out search within a

particular search space and also applying the principle of backtracking procedure, the data to be searched could KEYWORDS numbering more than a million so large and complex, by adopting Depth First Search (DFS), the bank of data is first subjected to data search with the aid of distributed data sensing device(software) which will cover the large pool of data, the sensing device will be distributed within the search space of the pool of data, note that Big Data is simply a data sets that are so large and complex that traditional data processing tools and technologies cannot cope with, we adopt the Backtracking procedure to realising the first round of data filtering within the search space.

In been explicit the following method will be used for this work 1. Neural Network 2. Backtracking Algorithm 3. Depth First Search (DFS) 4. Branch and Bound 5 Dynamic Programming 6. Error detection Branch and bound

The Branch and Bound method will be used to distribute the search space for each data sensing node called distributed data sensing method.

The distributed sensing method will be guided by the principle of branch and bound, A branch-and-bound algorithm consists of a systematic enumeration of Data solutions by means of state space search: the set of candidate solutions is thought of as forming a rooted tree with the full set at the root. The algorithm explores branches of this tree, which represent subsets of the solution set. The branches can well be described from the diagram shown in figure 5.

The diagram shows the branch tag to the various distributed data sensing system, since the Big Data pool is quite large and complex, the need to adopt the bounding method becomes necessary, this will specifically assign sensing Boundaries to the data sensing devices. (Miller et al., 2012).

The bounding method is a design paradigm for discrete and combinatorial optimization problems. A branch-and-bound algorithm consists of a systematic enumeration of the solutions by means of state space search: the set of data solutions is thought of as forming a rooted tree with the full set at the root.

S = ……………………………………..1

S stand for big data bank while

each data sensing device has a specific boundary it can cover and it is represented using recursive statements like

Si= , Sj= , Sk= , SL=

This equation describes the bound for data detecting sensor can cover.

When this device senses the data bank, note not all the data is useful the system is made to get rid of the unwanted data by using the principle of backtracking.


23

The backtracking algorithm enumerates a set of partial result (Data) that, in principle, could be completed in various ways to give all the possible solutions to the given problem. The completion is done incrementally, by a sequence of candidate extension steps.

Conceptually, the partial result is represented as the nodes of a tree structure, the potential search tree. Each partial candidate is the parent of the candidates that differ from it by a single extension step; the leaves of the tree are the partial result that cannot be extended any further.

The backtracking algorithm traverses this search tree recursively, from the root down, in depth-first order. At each node c, the algorithm checks whether c can be completed to a valid solution.

If it cannot, the whole sub-tree rooted at c is skipped. Otherwise, the algorithm (1) checks whether c itself is a valid solution, and if so reports it to the admin and (2) recursively enumerates all sub-trees of c. Therefore, the actual search tree that is traversed by the algorithm is only a part of the potential tree. The total cost of the algorithm is the number of nodes of the actual tree times the cost of obtaining and processing each node. This fact should be considered when choosing the potential search tree and implementing the skipping test.

The first source sensing node starts the sensing process by picking up some particular keywords it can handle, that is done by all the sensors at the source, but since the data bank is so large, accuracy becomes a key issue, therefore the need for an Artificial neural network becomes necessary. a neural network set will to further filter the information so as to get the right kind of data we are looking for,. each source sensing device will relay signal to the next sensing node for further filtering the information to be sensed is X with a weight level W, Design methodology

This work adopts the neural network application using the forward propagation method the sensing nodes are referred to as R this include the source node and the intermediary node while the F is the Final sink node, where all data converges and are collected after filtering and processing.

We come up with a design where the data is a function of weight w and input value x X1- W1 X2 – W2 . . . . Xn– Wn

A forward propagation technique will require a

transition for the neural set for inter-node communication, considering the values of x as input value and w as weight.

We therefore express the process of data transmission between nodes as

X11W1- X12W2– X13 W3 …………....X1nWn X21W1-- X22W2– X23 W3 ....………..,X2nWn X31W1-- X32W2– X13 W3 ……….......X3nWn . . . . . . Xm1W1 – Xm2W2 – Xm3W3 ………………XnWN

In this work we refer to reduced functional device RFD as the source and intermediary node which is the input and the full functional device FFD as the sink, we therefore substitute the input x in the neural set as R while FFD is substituted for F.

R = RFD F = FFD We refer data from RFD as R X = R

R11W1-R12W2–R13W3…………………..R1nWn R21W1 - R22W2 – R23 W3 ….....................R2nWn R31W1 - R32W2 – R13 W3 ………....…….R3nWn

. .

. . Rm1W1 – Rm2W2 – Rm3W3 ….............................RmnWn

Fig. 6: Neural Network Matrix for WSN

The neural set above directly substitutes x for R the

matrix represent the solution for the equation in figure 6 The Matrix in the equation which can then be expressed as logistic regression in figure 7 note this model is only implementable when the addressing mode is perfect for a linear routine process. The neural network will look like this

Fig. 7: Recurrent Neural Network Model for WSN A feed forward , the first column is the input R with its corresponding weight W , followed by the repeater node which is referred to as intermediate node, they could also be termed as hidden layers, which ranges between 1,l and L.

Which can be represented with a black box

This is logistic regression ,they have the same structure like the linear model, where we have the input combined linearly using weight sum up into a signal which passes soft threshold.

The eventual equation becomes


24

F(R) = FNN(R-1), y(R-2),y(R-3)……..y(R-n)..2

Fig. 8: Black box for hidden layers (intermediary node)

SJ = Signal Wij = weight

Fig. 9: Layer representation in NN Model The idea is to evolve an equation to compute for data packet error between transmitted node and receiving node. Data loss occurs when one or more packets of data travelling across the node and fail to reach their destination. the equation draws a relationship between the source node and the sink node. F(R) = ...3 A linear model is proposed, which is logistic regression where we have input combined linearly using weight which sum up to signal, this model is meant to implement a genuine probability, error measure which will be based on likelihood measure given below ∏P(Fn| Rn) = P is the probability of error and is maximised between output F, and input R and then derive error measured as

L = layers d= dimension j= output Note w refer to data weight Rj

(i)= Ө(SJ(i)= Ө(

The operation of the above equation shows how new value of input R is gotten from previous R. The process is recursive. Ө =tanhh(s) All weight are represented by w=wijl

i error between transmitter and receiver can be defined as

= (dij - ....4

i = row for equation j= column

R stands for receiving and transmitting node d is the range between transmitted and received data To find error measurement We can evaluate δ℮(w)/δwi

ij one by one analytically or numerically V℮(w) = δ℮(w)/δwi

ij a trick for efficient commutation is the product rule to solve for equation…. we apply product rule δ℮(w)/δwi

ij= δ℮(w)/δsji x δsji/δwiij

this is the estimate error between two nodes the diagram is an illustration of the transmission process, in between are intermediary nodes though not shown. SJ = Signal Wij = weight

For every data transmitted between two sensor, we can compute the error in data transmitted can be routed between one node from layer L(i=1) to L(i=n), layer L represent various layer of inter-sensor node between the source and the sink. Which can be represented by

To compute for data error for sensors that are inter linked wirelessly from source node through intermediary node to the sink node we advance the formular in equation…3

.... 3 δ℮(w)/δwi

ij= δ℮(w)/δsji x δsji/δwiij

Implementation This work adopt a backtracking method for to realize data recovery and to also ensure that filtering is accurate, P stand for the pool of data and the root C Stand for the data to be retrieved the Backtracking algorithm will be in this form, first the procedure In order to apply backtracking to a specific class of problems, one must provide the data P for the particular instance of the problem that is to be solved, and six procedural parameters, root, reject, accept, first, next, and output. These procedures should take the instance data P as a parameter and should do the following: root(P): return the partial candidate at the root of the search tree. reject(P,c): return true only if the partial candidate c is not worth completing. accept(P,c): return true if c is a solution of P, and false otherwise. first(P,c): generate the first extension of candidate c. next(P,s): generate the next alternative extension of a candidate, after the extension s. output(P,c): use the solution c of P, as appropriate to the application. The backtracking algorithm reduces then to the call bt(root(P)), where bt is the following recursive procedure: procedurebt(c) if reject(P,c)then return if accept(P,c)then output(P,c) s ← first(P,c) while s ≠ Λ do bt(s) s ← next(P,s)

RI(i=1)

Sj (i)

wji(i)

Wj(i)


25

Fig. 10: Map of Intersection delay layers George Hadley (1960).

Fig. 11: Compact representation of the network (Ronald A. Howard (1962)). Dynamic programming

Dynamic programming is an optimization approach that transforms a complex problem into a sequence of simpler problems; it’s essential characteristic is the multistage nature of the optimization procedure. More so than the optimization techniques described previously, dynamic programming provides a general framework for analyzing many problem types.

Within this framework a variety of optimization techniques can be employed to solve particular aspects of a more general formulation. Usually creativity is required before we can recognize that a particular problem can be cast effectively as a dynamic program; and often subtle insights are necessary to restructure the formulation so that it can be solved effectively.

Our first decision (from right to left) occurs with one stage, or intersection, left to go. If for example, we are in the intersection corresponding to the highlighted box in Fig. 11, we incur a delay of three micro-seconds in this intersection and a delay of either eight or two micro-seconds in the last intersection, depending upon whether we move up or down. Therefore, the smallest possible delay, or optimal solution, in this intersection is 3+2 = 5 micro-seconds. Similarly, we can consider each intersection (box) in this column in turn and compute the smallest total delay as a result of being in each intersection. The solution is given by the bold-faced numbers in Fig 10. The arrows indicate the optimal

solution, up or down, in any intersection with one stage, or one intersection, to go Algorithm

The Algorithm for realising this work is as follows i. Start ii. Branch and Bound iii Backtracking iv. Depth First Search (DFS) v. Dynamic Programming vi. Artificial Neural Network vii. Error Check vii. Stop Note that each line in the algorithm is a sub-algorithm on its own which the entire page in this write up cannot contain, however the flowchart in figure 12 summarises the entire procedure

Fig. 12: Flowchart

The general constraint satisfaction problem consists

in finding a list of integers x = (x[1],x[2], ..., x[n]), each in some range {1, 2, ..., m}, that satisfies some arbitrary constraint (Boolean function) F.

For this class of problems, the instance data P would be the integer’s m and n, and the predicate F. In a typical backtracking solution to this problem, one could define a partial candidate as a list of integers c = (c[1],c[2], ... c[k]), for any k between 0 and n, that are to be assigned to the first k variables x[1],x[2], ..., x[k]). The root candidate would then be the empty list (). The first and next procedures would then be function first(P,c) k ← length(c)


26

if k = n then return Λ else return (c[1], c[2],..., c[k],1) function next(P,s) k ← length(s) if s[k]= m then return Λ else return (s[1], s[2],..., s[k-1],1+ s[k]) Here "length(c)" is the number of elements in the list c. Conclusion

With the failing tendencies arisen from the inability of existing system to carry out effective data filtering for event monitoring, the combination of these blocks of seven blocks of flowchart of sub-algorithm is a way out of advancing the use of Big Data to sniff network and get precise information from large chunk of data for predictive and analytic purposes. This Predictive analytics which comprises a variety of techniques that predict future outcomes based on historical and current data will help predict and also forecast event , Filtering can be achieved with the aid of this tools presented which could be used for crime prediction and reduction and much more.

REFERENCES Chan C, 2013. Big Data Analytics for Drug Discovery,

IEEE International Conference on Bioinformatics and Biomedicine,

Marko Grobelnik Jozef Stefan, 2012. Institute Ljubljana, Stavanger, Slovenia

Amir Gandomi, 2015. Murtaza HaiderTed, Beyond the hype:

Big data concepts, methods, and analytics Big Data, may 2015 Http://en.wikipaedia.org/wiki

retrieved 11th. June 2015 Christopher Surdak, "Jump up^ "Data Crush by.

Retrieved 14 February . 2014 George Hadley, 1960. Nonlinear and Dynamic

Programming, Addison-Wesley,, by. Exercise 10 is based on Section 10.6

Keith C S. Miller, S. Lucas, L. Irakliotis, M. Ruppa, T.Carlson and B.Perlowitz, 2012. “Demystifying Big Data: A practical Guide to Transforming the Business of Government”, Washington: Tech America Foundation.

Ronald A. Howard, 1962, Dynamic Programming and Markov Processes, John Wiley & Sons, Exercise 24.

international journal applied sciences & engineeringforecast projects global ip traffic to...

Documents