an automated approach for digital forensic analysis of

Journal of Digital Forensics, Journal of Digital Forensics,

Security and Law Security and Law

Volume 11 Number 2 Article 9

2016

An Automated Approach for Digital Forensic Analysis of An Automated Approach for Digital Forensic Analysis of

Heterogeneous Big Data Heterogeneous Big Data

Hussam Mohammed School of Computing, Electronics and Mathematics, Plymouth

Nathan Clarke School of Computing, Electronics and Mathematics, Plymouth; Edith Cowan University

Fudong Li School of Computing, Electronics and Mathematics, Plymouth

Follow this and additional works at: https://commons.erau.edu/jdfsl

Part of the Computer Engineering Commons, Computer Law Commons, Electrical and Computer

Engineering Commons, Forensic Science and Technology Commons, and the Information Security

Commons

Recommended Citation Recommended Citation Mohammed, Hussam; Clarke, Nathan; and Li, Fudong (2016) "An Automated Approach for Digital Forensic Analysis of Heterogeneous Big Data," Journal of Digital Forensics, Security and Law: Vol. 11 : No. 2 , Article 9. DOI: https://doi.org/10.15394/jdfsl.2016.1384 Available at: https://commons.erau.edu/jdfsl/vol11/iss2/9

This Article is brought to you for free and open access by the Journals at Scholarly Commons. It has been accepted for inclusion in Journal of Digital Forensics, Security and Law by an authorized administrator of Scholarly Commons. For more information, please contact [email protected].

(c)ADFSL

http://commons.erau.edu/jdfsl

http://commons.erau.edu/jdfsl

https://commons.erau.edu/jdfsl

https://commons.erau.edu/jdfsl

https://commons.erau.edu/jdfsl/vol11

https://commons.erau.edu/jdfsl/vol11/iss2

https://commons.erau.edu/jdfsl/vol11/iss2/9

https://commons.erau.edu/jdfsl?utm_source=commons.erau.edu%2Fjdfsl%2Fvol11%2Fiss2%2F9&utm_medium=PDF&utm_campaign=PDFCoverPages

http://network.bepress.com/hgg/discipline/258?utm_source=commons.erau.edu%2Fjdfsl%2Fvol11%2Fiss2%2F9&utm_medium=PDF&utm_campaign=PDFCoverPages







https://doi.org/10.15394/jdfsl.2016.1384

https://commons.erau.edu/jdfsl/vol11/iss2/9?utm_source=commons.erau.edu%2Fjdfsl%2Fvol11%2Fiss2%2F9&utm_medium=PDF&utm_campaign=PDFCoverPages

mailto:[email protected]

http://commons.erau.edu/

http://commons.erau.edu/

/creativecommons.org/licenses/by-nc-nd/4.0/

/creativecommons.org/licenses/by-nc-nd/4.0/

An Automated Approach for Digital Forensic Analysis of… JDFSL V11N2

© 2016 ADFSL Page 137

AN AUTOMATED APPROACH FOR DIGITALFORENSIC ANALYSIS OF HETEROGENEOUS

BIG DATAHussam Mohammed1, Nathan Clarke1,2, Fudong Li1

1School of Computing, Electronics and Mathematics, Plymouth, UK2Security Research Institute, Edith Cowan University, Western Australia

[email protected], [email protected]@plymouth.ac.uk

ABSTRACTThe major challenges with big data examination and analysis are volume, complexinterdependence across content, and heterogeneity. The examination and analysis phases areconsidered essential to a digital forensics process. However, traditional techniques for the forensicinvestigation use one or more forensic tools to examine and analyse each resource. In addition,when multiple resources are included in one case, there is an inability to cross-correlate findingswhich often leads to inefficiencies in processing and identifying evidence. Furthermore, mostcurrent forensics tools cannot cope with large volumes of data. This paper develops a novelframework for digital forensic analysis of heterogeneous big data. The framework mainly focusesupon the investigations of three core issues: data volume, heterogeneous data and theinvestigators cognitive load in understanding the relationships between artefacts. The proposedapproach focuses upon the use of metadata to solve the data volume problem, semantic webontologies to solve the heterogeneous data sources and artificial intelligence models to support theautomated identification and correlation of artefacts to reduce the burden placed upon theinvestigator to understand the nature and relationship of the artefacts.

Keywords: Big data, Digital forensics, Metadata, Semantic Web

INTRODUCTIONCloud computing and big databases areincreasingly used by governments, companiesand users for processing and storing hugeamounts of information. Generally, big datacan be defined with three features or three Vs:Volume, Variety, and Velocity (Li and Lu,2014). Volume refers to the amount of data,Variety refers to the number of types of data,and Velocity refers to the speed of dataprocessing. Big data usually includes manydatasets which are complicated to processusing common/standard software tools,

particularly within a tolerable period of time.Datasets are continuously increasing in sizefrom a few terabytes to petabytes (Kataria andMittal, 2014). According to IDC’s annualDigital Universe study (2014) the overallcreated and copied volume of data in thedigital world is set to grow 10-fold in the nextsix years to 2020 from around 4.4 zettabytes to44 zettabytes. This increasing interest in theuse of big data and cloud computing servicespresents both opportunities for cybercriminals(e.g. exploitation and hacking) and challengesfor digital forensic investigations (Cheng et al.,2013).

JDFSL V11N2 An Automated Approach for Digital Forensic Analysis of…

Page 138 © 2016 ADFSL

Digital forensics is the science that isconcerned with the identification, collection,examination and analysis of data during aninvestigation (Palmer, 2001). A wide range oftools and techniques both commercially or viaopen source license agreement (includingEncase, AccessData FTK, and Autopsy), havebeen developed to investigate cybercrimes(Ayers, 2009). Unfortunately, the increasingnumber of digital crime cases and extremelylarge datasets (e.g. that are found in big datasources) are difficult to process using existingsoftware solutions, including conventionaldatabases, statistical software, andvisualization tools (Shang et al., 2013). Thegoal of using traditional forensic tools is topreserve, collect, examine and analyseinformation on a computing device to findpotential evidence. In addition, each source ofevidence during an investigation is examinedby using one or more forensic tools to identifythe relevant artefacts, which are then analysedindividually. However, it is becomingincreasingly common to have cases thatcontain many sources and those sources arefrequently heterogeneous in nature (Raghavan,2014). For instance, hard disk drives, systemand application logs, memory dumps, networkpacket captures, and databases all mightcontain evidence that belong to a single case.However, the forensic examination and analysisis further complicated with the big dataconcept because most existing tools weredesigned to work with a single or small numberof devices and a relatively small volume ofdata (e.g. a workstation or a smartphone).Indeed, tools are already struggling to dealwith individual cybercrime cases that have alarge size of evidence (e.g. between 200Gigabyte and 2 Terabyte of data) (Casey,2011); and it is common that the volume ofdata that need to be analysed within the bigdata environment can range from severalterabytes up to a couple of petabytes.

The remainder of the paper is structured asfollows: section 2 presents a literature review ofthe existing research in big data forensics, dataclustering, data reduction, heterogeneousresources, and data correlation. Section 3describes the link between metadata in variousresources and digital forensics. Section 4proposes an automated forensic examiner andanalyser framework for big, multi-sourced andheterogeneous data, followed by acomprehensive discussion in section 5. Theconclusion and future works are highlighted insection 6.

LITERATURE REVIEWDigital forensic investigations have faced manydifficulties to overcome the problems ofanalysing evidence in large and big datasets.Various solutions and techniques have beensuggested for dealing with big data analysis,such as triage, artificial intelligence, datamining, data clustering, and data reductiontechniques. Therefore, this section presents aliterature review of the existing research in bigdata forensics and discusses some of the openproblems in the domain.

Big Data Acquisition andAnalytics

Xu et al. (2013) proposed a big dataacquisition engine that merges a rule engineand a finite state automaton. The rule enginewas used to maintain big data collection,determine problems and discover the reason forbreakdowns; while the finite state automatonwas utilised to describe the state of big dataacquisition. In order for the acquisition to besuccessful, several steps are required. Initially,the rule engine needs to be created, includingrules setup and refinement. After that, dataacquisition is achieved by integrating the ruleengine and two data automation processes (i.e.a device interaction module and an acquisitionserver). The device interaction module is



employed to connect directly to a device, andthe acquisition server is responsible for datacollection and transmission. Then, the engineexecutes predefined rules; and finally, theexport process generates results. Generally,this combination gives a flexible way to verifythe security and correctness of the acquisitionof big data.

In attempting to deal with big dataanalysis, Noel and Peterson (2014)acknowledged the major challenges involved infinding relevant information for digital forensicinvestigations, due to an increasing volume ofdata. They proposed the use of LatentDirichlet Allocation (LDA) to minimizepractitioners’ overhead by two steps. First,LDA extracts hidden subjects from documentsand provides summary details of contents witha minimum of human intervention. Secondly, itoffers the possibility of isolating relevantinformation and documents via a keywordsearch. The evaluation of the LDA was carriedout by using the Real Data Corpus (RDC); theperformance of the LDA was also tested incomparison with current regular expressionsearch techniques in three areas: retrievinginformation from important documents,arranging and subdividing the retrievedinformation, and analysing overlapping topics.Their results show that the LDA technique canbe used to help filter noise, isolate relevantdocuments, and produce results with a higherrelevance. However, the processing speed of theLDA is extremely slow (i.e. around 8 hours) incomparison with existing regular expressiontechniques (i.e. approximately one minute).Also, only a selection of keywords that werelikely contained within the target documentwas tested.

In another effort to link deep learningapplications and big data analytics, Najafabadiet al. (2015) reported that deep learningalgorithms were used to extract high-levelabstractions in data. They explained that, due

to the nature of big data, deep learningalgorithms could be used for analysis andlearning from a massive amount ofunsupervised data, which helped to solvespecific problems in big data analytics.However, deep learning still has problems inlearning from streaming data, and hasdifficulty in dealing with high-dimensionaldata, and distributed and parallel computing.

Data ClusteringRecently, data clustering has been studied andused in many areas, especially in data analysis.Regarding big data analysis, several dataclustering algorithms and techniques wereproposed and they will be discussed as below.Nassif and Hruschka (2011), Gholap and Maral(2013) proposed a forensic analysis approachfor computer systems through the applicationof clustering algorithms to discover usefulinformation in documents. Both approachesconsist of two phases: a pre-processing step(which is used for reducing dimensionality) andrunning clustering algorithms (i.e. K-means, K-medoids, Single Link, Complete Link, andAverage Link). Their approaches wereevaluated by using five different datasetsseized from computers in real-worldinvestigations. According to their results, boththe Average Link and Complete Linkalgorithms gave the best results in determiningrelevant or irrelevant documents; whilst K-means and K-medoids algorithms presentedgood results when there is suitableinitialization. However, the scalability ofclustering algorithms may be an issue becausethey are based on independent data. From asimilar perspective, Beebe and Liu (2014)carried out an examination for clusteringdigital forensics text string search output.Their study concentrated on realistic dataheterogeneity and its size. Four clusteringtechniques were evaluated, including K-Means,Kohonen Self-Organizing Map (SOM), LDAfollowed by K-Means, and LDA followed by



SOM. Their experiment result shows that LDAfollowed by K-means obtained the bestperformance: more than 6,000 relevant searchhits were retrieved after reviewing less than0.5% of the search hit result. In addition, whenperformed individually, both K-Means andSOM algorithms, gave a poorer performancethan when they were combined with LDA.However, this evaluation was achieved withonly one synthetic case, which was small insize comparing with real-world cases.

Data Reduction with Hash-sets

With the aim of dealing an on-growing amountof data in forensic investigations, manyresearchers attempted to use hash sets anddata reduction techniques to solve the datasize problem. Roussev and Quates (2012)attempted to use similarity digests as apractical solution for content-based forensictriage as the approach has been widely used inthe identification of embedded evidence,identification of artefacts and cross targetcorrelation. Their experiment was applied tothe M57 case study, comprising 1.5 Terabyteof raw data, including disk images, RAMsnapshots, network captures and USB flashmedia. They were able to examine andcorrelate all the components of the dataset inapproximately 40 minutes, whereas traditionalmanual correlation and examination methodsmay require a day or more to achieve the sameresult. Ruback et al. (2012) developed amethod for determining uninteresting data in adigital investigation by using hash sets withina data-mining application that depends ondata collected from a country or geographicalregion. This method uses three conventionalknown hash databases for the files filtration.Their experimental results show that areduction of known files of 30.69% incomparison with a conventional hash-set,although it has approximately 51.83% hash

values in comparison with a conventional hashset.

Similarly, Rowe (2014) compared nineautomated methods for eliminatinguninteresting files during digital forensicinvestigations. By using a combination of filename, size, path, time of creation, anddirectory, a total of 8.4 million hash values ofuninteresting files were created and the hashescould be used for different cases. By using an83.8-million-file international corpus, 54.7% offiles were eliminated as they were matchedwith two of nine methods. In addition, falsenegative and false positive rates of theirapproach were 0.1% and 19% respectively. Inthe same context, Dash and Campus (2014)also proposed an approach that uses fivemethods to eliminate unrelated files for fasterprocessing of large forensics data during theinvestigation. They tested the approach withdifferent volumes of data collected fromvarious operating systems. After applying thesignatures within the National SoftwareReference Library Reference Data Set (NSRL-RDS) database, an additional 2.37% and 3.4%of unrelated files were eliminated fromWindows and Linux operating systemsrespectively by using their proposed fivemethods.

Heterogeneous Data andResources

The development of information technologyand the increasing use of sources that run indifferent environments have led to difficultiesin processing and exchanging data acrossdifferent platforms. However, a number ofresearchers have suggested potential solutionsfor the problem of the heterogeneity of dataand resources. Zhenyou et al. (2011) studiedthe nature of heterogeneous databases andintegration between nodes in distributedheterogeneous databases. They suggested theuse of Hibernate technology and query



optimization strategy, which have thecapability to link between multi-heterogeneousdatabase systems. Furthermore, Liu et al.(2010) proposed a framework based onMiddleware technology for integratingheterogeneous data resources that come fromvarious bioinformatics databases. Theyexplained that Middleware is independentsoftware that works with distributedprocessing, where it is located on differentplatforms, such as heterogeneous sourcesystems and applications. Their system usedXML to solve the heterogeneity of datastructure issues that describe the data fromdifferent heterogeneous resources whileontology was used to solve the semanticheterogeneity problem. The key benefit of thissystem is that it provides a unified applicationfor users.

Mezghani et al. (2015) proposed a genericarchitecture for heterogeneous big dataanalysis that comes from different wearabledevices, based on the Knowledge as Service(KaS) approach. This architecture extendedthe NIST big data model with a semanticmethod of generating understanding andvaluable information by correlating bigheterogeneous medical data (Mezghani et al.,2015). This was achieved by using WearableHealthcare Ontology which aids theaggregation of heterogeneous data, supportsthe data sharing, and extracts knowledge forbetter decision-making. Their approach waspresented with a patient-centric prototype in adiabetes scenario, and it demonstrated theability to handle data heterogeneity. However,the research aim tended to focus onheterogeneity rather than security and privacythrough data aggregation and transmission. Inthe context of heterogeneity Zuech et al.(2015) reviewed the available literature onintrusion detection within big heterogeneousdata. Their study sought to address thechallenges of heterogeneity within big data and

suggested some potential solutions, such asdata fusion. Data fusion is a technique ofintegration of data from different sources thatcommonly have deferent structures. Most of allthey suggested that big heterogeneous datastill present many challenges in the form ofcyber security threats and that data fusion hasnot been widely used in cyber security analysis.

Data CorrelationAlthough there has already been some work inthe data correlation of digital forensics in orderto detect the relationship between evidencefrom multiple sources, there is a need forfurther research in this direction. Garfinkel(2006) proposed a new approach that usesForensic Feature Extraction (FFE) and CrossDrive Analysis (CDA) to extract, analyse andcorrelate data over many disk images. FFE isused to identify and extract certain featuresfrom digital media, such as, credit cardnumbers and email message IDs. CDA isutilised for the analysis and correlation ofdatasets that span on multiple drives. Theirarchitecture was used to analyse 750 images ofdevices containing confidential financialrecords and interesting emails. In comparison,the practical techniques of multi-drivescorrelation and multi-drives analysis requireimprovements to their performance in order tobe used with large datasets.

Another experiment sought to performforensics analysis and the correlation ofcomputer systems, Case et al. (2008)presented two contributions to assist theinvestigator in “connecting the dots.” First,they developed a tool called Ramparser, whichis used to perform a deep analysis of Linuxmemory dump. The investigator uses this toolto obtain detailed output about all processesthat take place in the memory. Secondly, theyproposed a Forensics Automated CorrelationEngine (FACE), which is used to discoverevidence automatically and to make a



correlation between them. FACE providesautomated parsing over five main objects,namely memory image, network traces, diskimages, log files, and user accounting andconfiguration files. FACE was evaluated with ahypothetical scenario, and the application wassuccessful; however, these approaches onlywork with the small size of data and are nottested on big and heterogeneous data. Inaddition, the correlation capabilities couldleverage existing methods by adding statisticalones.

Raghavan et al. (2009) also proposed afour-layer Forensic Integration Architecture(FIA) to integrate evidence from multiplesources. The first layer (i.e. the evidencestorage and access layer) provides a binaryabstraction of all data acquired during theinvestigation; whilst the second layer (i.e. therepresentation and interpretation layer) hasthe capability to support various operatingsystems, system logs and mobile devices. Thethird layer (i.e. a meta-information layer)provides interface applications to facilitatemetadata extraction from files. The fourthlayer (i.e. the evidence composition andvisualization layer) is responsible forintegrating and correlating information frommultiple sources, and these combined sourcescan serve as comprehensive evidentiaryinformation to be presented to a detective. Asthe FIA architecture was merelyconceptualised via a car theft case study,further investigation would be required for theevaluation of its practicality.

SummaryAs demonstrated above, existing studies haveattempted to only cope with a specific issuewithin the big data domain, including volume,complex interdependence across content, andheterogeneity. From the perspective of thevolume of big data, the current tools of digitalforensics have failed to keep pace with the

increase. For that reason, a number oftechnologies, such as data clustering and datareduction have the potential capacity to savedigital investigators time and effort, wereexamined. Data clustering techniques havebeen widely used to eliminate uninterestingfiles and thus speed up the investigationprocess by determining relevant informationmore quickly. So far, these techniques can beapplied to large volumes of data (incomparison with traditional forensic images)but are not suitable for big data.

Regarding heterogeneity, only a few studiesare available; and they were mainly focused onthe integration of heterogeneous databases ofmuch smaller sizes (e.g. Hard disk, mobile, ormemory dump). Integration technology basedon ontology techniques offer promisingprospects although it has not been testedwithin forensic investigations involving bigdata heterogeneity. Similarly, only limitedresearch on data correlation were conducteddespite the data correlation offers a potentialsolution to heterogeneous data issues; andthese issues have yet to be resolved,particularly those related to big data. As aresult, big data analytics in the context offorensics stands in need of a comprehensiveframework that can handle existing issues suchas volume, variety and heterogeneity of data.

METADATA ANDDIGITAL FORENSICS

Metadata describes the attributes of any filesor applications in most digital resources(Guptill, 1999). It provides accuracy, logicalconsistency, and coherence of files orapplications that they describe. Semanticsearch by metadata is one of the importantfunctions to reduce the noise duringinformation searching (Raghavan andRaghavan, 2014). A number of metadata typesexist and provide some attributes which areimportant in processes as shown in table 1.



These attributes belong to different types ofmetadata, such as from file systems, eventlogs, applications, and documents. Forinstance, file system metadata provides filesummary information that describes the layoutand attributes of the regular files anddirectories, aiding to control and retrieve thatfile (Buchholz and Spafford, 2004); event logmetadata provides significant information thatcan be used for event reconstructions(Vaarandi, 2005).

Document type definition is introduced asemail metadata in Extensible MarkupLanguage (XML) which holds content-featurekeywords about an email (Sharma et al., 2008).A number of research studies employ emailmetadata in order to facilitate dealing withemail list, such as filtration, organization, andsorting based upon reading status and senders(Fisher et al., 2007). As a result, some researchconsiders metadata as an evidentiary basis forthe forensic investigation process because itdescribes either physical or electronic resources(Khan, 2008; Raghavan and Raghavan, 2014).

Metadata aids to identify the associationartefacts that can be used to investigate andverify fraud, abuse, and many other types of

cybercrimes. Indeed, Raghavan and Raghavan(2014) proposed a method to identify therelations of evidence artefacts in a digitalinvestigation by using metadata; their methodwas applied to find the association of metadatafrom collections of image files and wordprocessing documents.

Rowe and Garfinkel (2012) developed atool (i.e. Dirim) to automatically determineanomalous or suspicious files in a large corpusby analysing the directory metadata of files(e.g., the filename, extensions, paths and size)via a comparison of predefined semantic groupsand comparison between file clusters. Theirexperiment was conducted on a corpusconsisting of 1,467 drive images with 8,673,012files. The Dirim approach found 6,983suspicious files based on their extensions and3,962 suspicious files according to their paths.However, the main challenge with thisapproach is its ability to find hidden data.Also, it is not effective at detecting the use ofanti-forensics.

Therefore; these studies illustrate thatmetadata parameters can be used by forensicstools for investigation purposes.



Table 1Some of Input Metadata Parameters for Forensics Tools

Source of Input S/No Input Parameter Data Type

Log File Entries(Security Logs)

1 Event ID Integer2 User Name String3 Date Generated Date4 Time Generated Time5 Machine (Computer Name) String

File SystemMetadata

Structures (NTFS)

6 Modification Time Date & Time7 Access Time Date & Time8 Creation Time Date & Time9 File Size Integer

10 Directory flag (File/Folder) Boolean11 Filename String12 File Type String13 Path String14 File Status (Active, Hidden etc.) Enumeration15 File Links Integer

RegistryInformation

16 Key Name String17 Key Path String18 Key Type String19 Key Data / Value String or integer

Application Logs20 Name string21 Version no. Integer22 Timestamp Date & Time

Network packet

23 Packet length Integer24 Class String25 Priority String26 Source IP Integer27 Destination IP Integer

SEMANTIC WEB-BASED FRAMEWORK

FOR METADATAFORENSIC

EXAMINATION ANDANALYSIS

This proposed system seeks to provide anautomated forensic examination and analysisframework for big, multi-sourced heterogeneous

data. An overview of the proposed frameworkis illustrated in Figure 1, with three layers ofdata acquisition, data examination, and dataanalysis.

Data Acquisition andPreservation

This layer is used to image all availablesuspect resources within the same case. Itincludes creating a bit-by-bit perfect copy forsome of the digital media resource (e.g. hard



disk and mobile) and stores images in a securestorage. The preservation process is used toensure that original artefacts will be preservedin a reliable, complete, accurate, and verifiableway. This preservation is achieved by usinghash functions that can be used to verify theintegrity of all evidence.

Data ExaminationThe examination phase is the core of proposalsystem. In this layer, a number of techniquesare employed to achieve the examination goal,including data filtering and reduction,metadata extraction, the creation of XML filesto represent the metadata files, and semanticweb technology to work with heterogeneousdata. Details of these techniques are presentedbelow:

Data ReductionThe data reduction step has been used widelywith a variety of digital forensic approachesand has provided for a significant reduction indata storage and archive requirements (Quickand Choo, 2014). In addition, most the forensictools and investigating agencies use thehashing library to compare the files which areexamined in the suspect cases against theknown files, separating relevant files frombenign files; as a result, the uninteresting fileswill not be examined and investigator’s timeand effort will be saved.

MetadataExtraction

The functionality of this step provides ametadata extractor according to the sourcetype, including hard disk, logs file, networkpacket, and email. In order to extractmetadata, the metadata extractor determinesthe suitable metadata which should beextracted. For example, it extracts informationthat holds usual meaning from hard disk files,attributes of logs file and network packets canbe used as metadata to reconstruct the events.The output is structured information which

makes data in the digital resources easier forretrieving.

XML File Creation

XML is a method to describe structureinformation which makes them more readable.After metadata are extracted, the XML filewill be utilised to create metadata for each file.As a result, the output of this step is ametadata forensic image based on an XML file.The use of metadata helps to solve the bigdata issues as the size of XML file, whichrepresents metadata of the original file, ismuch smaller than original files.

Metadata Repository

The repository has the capability to supportdata from multiple sources of digital evidencethat related to one case (metadata forensicsimages from various resources). It will be usedas a warehouse for the next step that feeds thesemantic web technology with data.

Ontology-Based ForensicExamination

The major aim of this step is to integrateheterogeneous resources that are located in themetadata repository by using techniques basedupon semantic web technology. The semanticweb enables a machine to extract the meaningof web contents and retrieve meaningful resultswithout/less ambiguity. Additionally, semanticweb technology has not only contributed todeveloping web applications, but also a typicalway to share and reuse data in various sectors(Alzaabi et al., 2013). It also provides asolution for various difficult tasks such assearching and locating specific information inbig heterogeneous resources (Fensel et al.,2002). However, the strength of the semanticweb relies on the ability to interpreting ofrelationships between heterogeneous resources.

The core of semantic web is a model knownas an ontology that is built to share acommand understanding knowledge and reuse



them by linked many resources together inorder to retrieve them semantically (Allemangand Hendler, 2011). Some modelling languagesare introduced by the semantic web standardssuch as Resource Description FrameworkSchema (RDFS) and the Web OntologyLanguage (OWL). The first layer of thesemantic web is based on XML schema tomake sure that a common syntax and aresource metadata exist in a structured format,which is already achieved at the previous stepin this proposal system.

The second layer is a use of RDF forrepresenting information about resources in a

graphical form. RDF relies on triples subject-predicate-object that forms a graph of data. Inaddition, each triple consists of three elements,namely subject, predicate, and object. Thestandard syntax for serializing RDF is XML inthe RDF/XML form. The third layer is OWLthat has found to solve some limitations ofRDF/RDFS because it is derived fromdescription logics, and offers more constructsover RDFS. In the other word, a use ofRDF/XML and OWL forms ease the searchabout relative artefacts semantically frommultiple resources.



Also, the purpose of this layer is to extractconcepts from different files and determine towhich class this concept belongs to base on theontology layer. Examples of such concepts thatcan be obtained from an email are: an emailaddress and a word attachment belong to thecontact class and the document classrespectively. For example, each email holds the

sender and receiver information, andtimestamp. This information can be linked toan email contact instance and becomes thecontact class gives two instances with sametimestamp.

After the ontology has built and thesemantic web technology is ready to use, asearch engine will be designed based on the

Figure 1. Semantic Web-Based Framework for Metadata Forensic Examination and Analysis



ontology. The search engine is used to retrieverelated artefacts based on the investigator'squery.

Data AnalysisThe major aim of this layer is to answer theessential questions in an investigation: what,who, why, how, when and where. In addition,find the relation between the artefacts in orderto construct the event.

Automated ArtefactsIdentification

The output from the previous layer is manyartefacts related to the incident formheterogeneous resources. During this layer,artificial intelligent techniques will be used inorder to find the association between theseartefacts and reconstruct the event based onthat. In addition, in order to determine theevidential artefacts, all the output artefactsfrom the previous layer should be analyzedthat were created, accessed or modified closerto the time of a reported incident that is beinginvestigated; however, to generate a timelineacross multiple resources, it poses variouschallenges (e.g. timestamp interpretation). Asa result, it is necessary to generate a unifiedtimeline between heterogeneous resources;therefore, some tools will be used to cope withthese issues. Of course, the answers will beprovided to the questions that will be raisedduring forensics analysis. Preliminary researchundertaken by Al Fahdi (2016) has shown thatautomated artefact identification is anextremely promising approach during theforensics analysis phase.

DISCUSSIONSA number of studies in the literature sectionpresent comprehensive surveys of existing workin forensics analysis, with different types andsizes of data from various fields, showingsignificant increases in the volume of data andthe amount of digital evidence needing to be

analysed in investigations. Therefore, arequirement on the abandonment ormodification of well-established tenets andprocesses exists. Accordingly, a number ofsolutions have already been suggested to copewith these issues; however, few researchershave proposed technical solutions to mitigatethese challenges in a holistic way. Althoughdata clustering and data reduction techniquesare reasonable solutions to cope thesechallenges, there are few studies in this regard.Also, there is a growing need to optimise thesesolutions in a comprehensive framework so asto enable all the issues to be dealt withtogether. As a result, the semantic web-basedframework for metadata forensic examinationand analysis is proposed to identify potentialevidence across multi-resources in order toconduct the analysis. In order to achieve theacquisition and preservation, varioustechniques will be used base on the resourcetype. For example, the dead acquisition can beused to collect data from certain types ofresources (e.g. hard disk and mobile), but maynot for others (e.g. network traffic and onlinedatabases). Therefore, it requires effectivetechniques to obtain required information.

The second layer goals to achieve theexamination phase in a proper way byapplying a number of processes. The digitalforensics reduction is the first process thatexplained in this layer in order to reduce thevolume of data for pre-processing bydetermining potential relevance data withoutsignificant human interaction. The metadataextraction is an essential process in thisframework because all later processes willdepend upon metadata that has beenextracted. In addition, the metadata will beused to reconstruct the past event. Moreover,the use of the XML file to represent metadatawill help during a semantic web process.Afterward, the using of metadata repository isbeneficial to gathering all metadata from


© 2016 ADFSL 9

heterogeneous sources. Accordingly, semanticweb technology will integrate metadata thatexists in the repository by using ontology inorder to retrieve the associated metadata.After that, a variety of artificial intelligenceand analysis methodologies will apply to obtainpotential evidence in a meaningful way thatcan use in the court. It has therefore beendecided to implement this framework toovercome the three issues of volume,heterogeneity, and cognitive burden.

CONCLUSIONThe proposed framework aims to integrate bigdata from heterogeneous sources and toperform automated analysis of data – whichtackles a new of key challenges that existtoday.

To achieve and validate the approachrequires future research. This effort will befocussed upon the development of experimentsto evaluate the feasibility of utilising metadataand semantic web based ontologies to solve theproblems of big data and heterogeneous datasources respectively. The use of the ontologybased forensics provides a semantic-richenvironment to facilitate evidence examination.Further experiments will also be conducted tofurther evaluate the use of machine learning inthe ability to identify and correlate relevantevidence. A number of scenario-basedevaluations involving a number of stakeholderswill take place to evaluate the effectiveness ofthe proposed approach.


0 © 2016 ADFSL

REFERENCES

Allemang, D., & Hendler, J. (2011). Semanticweb for the working ontologist: effectivemodeling in RDFS and OWL: Elsevier.

Alzaabi, M., Jones, A., & Martin, T. A.(2013). An ontology-based forensic analysistool. Paper presented at the Proceedings ofthe Conference on Digital Forensics,Security and Law.

ALfahdi M 2016. Automated Digital Forensics& Computer Crime Profiling. Ph.D. thesis,Plymouth University.

Ayers, D. 2009. A second generation computerforensic analysis system. Digitalinvestigation, 6, S34-S42.

Benredjem, D. 2007. Contributions to cyber-forensics: processes and e-mail analysis.Concordia University.

Beebe, N. L., & Liu, L. (2014). Clusteringdigital forensic string search output. DigitalInvestigation, 11(4), 314-322.

Buchholz, F., & Spafford, E. (2004). On therole of file system metadata in digitalforensics. Digital Investigation, 1(4), 298-309.

Case, A., Cristina, A., Marziale, L., Richard,G. G., & Roussev, V. (2008). FACE:Automated digital evidence discovery andcorrelation. Digital Investigation, 5, S65-S75.

Casey, E. (2011). Digital evidence andcomputer crime: Forensic science,computers, and the internet: Academicpress.

Chen, M., Mao, S., & Liu, Y. (2014). Big data:A survey. Mobile Networks andApplications, 19(2), 171-209.

Cheng, X., Hu, C., Li, Y., Lin, W., & Zuo, H.(2013). Data Evolution Analysis of VirtualDataSpace for Managing the Big DataLifecycle. Paper presented at the Paralleland Distributed Processing SymposiumWorkshops & Ph.D. Forum (IPDPSW),2013 IEEE 27th International.

da Cruz Nassif, L. F., & Hruschka, E. R.(2011). Document clustering for forensiccomputing: An approach for improvingcomputer inspection. Paper presented atthe Machine Learning and Applicationsand Workshops (ICMLA), 2011 10thInternational Conference on.

Dash, P., & Campus, C. (2014). FastProcessing of Large (Big) Forensics Data.Retrieved fromhttp://www.idrbt.ac.in/PDFs/PT%20Reports/2014/Pritam%20Dash_Fast%20Processing%20of%20Large%20(Big)%20Forensics%20Data.pdf

Fensel, D., Bussler, C., Ding, Y., Kartseva, V.,Klein, M., Korotkiy, M., . . . Siebes, R.(2002). Semantic web application areas.Paper presented at the NLDB Workshop.

Fisher, D., Brush, A., Hogan, B., Smith, M., &Jacobs, A. (2007). Using social metadata inemail triage: Lessons from the field HumanInterface and the Management ofInformation. Interacting in InformationEnvironments (pp. 13-22): Springer.

Garfinkel, S. L. (2006). Forensic featureextraction and cross-drive analysis. DigitalInvestigation, 3, 71-81.



Gholap, P., & Maral, V. (2013). InformationRetrieval of K-Means Clustering ForForensic Analysis. International Journal ofScience and Research (IJSR).

Kataria, M., & Mittal, M. P. (2014). BIGDATA: A Review. International Journal ofComputer Science and Mobile Computing,Vol.3 (Issue.7), 106-110.

Khan, M. N. A. (2008). Digital Forensics usingMachine Learning Methods. PhD thesis,University of Sussex, UK.

Li, H. & Lu, X. (2014) Challenges and Trendsof Big Data Analytics. P2P, Parallel, Grid,Cloud and Internet Computing (3PGCIC),2014 Ninth International Conference on2014. IEEE, 566-567.

Liu, Y., Liu, X., & Yang, L. (2010). Analysisand design of heterogeneous bioinformaticsdatabase integration system based onmiddleware. Paper presented at theInformation Management and Engineering(ICIME), 2010 The 2nd IEEE InternationalConference on.

Mezghani, E., Exposito, E., Drira, K., DaSilveira, M., & Pruski, C. (2015). ASemantic Big Data Platform forIntegrating Heterogeneous Wearable Datain Healthcare. Journal of Medical Systems,39(12), 1-8.

Najafabadi, M. M., Villanustre, F.,Khoshgoftaar, T. M., Seliya, N., Wald, R.,& Muharemagic, E. (2015). Deep learningapplications and challenges in big dataanalytics. Journal of Big Data, 2(1), 1-21.

Noel, G. E., & Peterson, G. L. (2014).Applicability of Latent Dirichlet Allocationto multi-disk search. Digital Investigation,11(1), 43-56.

Palmer, G. (2001). A road map for digitalforensic research. Paper presented at the

First Digital Forensic Research Workshop,Utica, New York.

Patrascu, A., & Patriciu, V.-V. (2013). Beyonddigital forensics. A cloud computingperspective over incident response andreporting. Paper presented at the AppliedComputational Intelligence and Informatics(SACI), 2013 IEEE 8th InternationalSymposium on.

Quick, D., & Choo, K.-K. R. (2014). Datareduction and data mining framework fordigital forensic evidence: storage,intelligence, review and archive. Trends &Issues in Crime and Criminal Justice, 480,1-11.

Raghavan, S. (2014). A framework foridentifying associations in digital evidenceusing metadata.

Raghavan, S., Clark, A., & Mohay, G. (2009).FIA: an open forensic integrationarchitecture for composing digital evidenceForensics in telecommunications,information and multimedia (pp. 83-94):Springer.

Raghavan, S., & Raghavan, S. (2014). Elicitingfile relationships using metadata basedassociations for digital forensics. CSItransactions on ICT, 2(1), 49-64.

Roussev, V., & Quates, C. (2012). Contenttriage with similarity digests: the M57 casestudy. Digital Investigation, 9, S60-S68.

Rowe, N. C. (2014). Identifying forensicallyuninteresting files using a large corpusDigital Forensics and Cyber Crime (pp. 86-101): Springer.

Rowe, N. C., & Garfinkel, S. L. (2012).Finding anomalous and suspicious filesfrom directory metadata on a large corpusDigital Forensics and Cyber Crime (pp.115-130): Springer.



Ruback, M., Hoelz, B., & Ralha, C. (2012). Anew approach for creating forensic hashsetsAdvances in Digital Forensics VIII (pp. 83-97): Springer.

Shang, W., Jiang, Z. M., Hemmati, H., Adams,B., Hassan, A. E., & Martin, P. (2013).Assisting developers of big data analyticsapplications when deploying on hadoopclouds. Paper presented at the Proceedingsof the 2013 International Conference onSoftware Engineering.

Sharma, A., Chaudhary, B., & Gore, M.(2008). Metadata Extraction from Semi-structured Email Documents. Paperpresented at the Computing in the GlobalInformation Technology, 2008. ICCGI'08.The Third International Multi-Conferenceon.

Vaarandi, R. (2005). Tools and Techniques forEvent Log Analysis: Tallinn University ofTechnology Press.

Xu, X., YANG, Z.-q., XIU, J.-p., & Chen, L.(2013). A big data acquisition engine basedon rule engine. The Journal of ChinaUniversities of Posts andTelecommunications, 20, 45-49.

Zhenyou, Z., Jingjing, Z., Shu, L., & Zhi, C.(2011). Research on the integration andquery optimization for the distributedheterogeneous database. Paper presented atthe Computer Science and NetworkTechnology (ICCSNT), 2011 InternationalConference on.

Zuech, R., Khoshgoftaar, T. M., & Wald, R.(2015). Intrusion detection and BigHeterogeneous Data: a Survey. Journal ofBig Data, 2(1), 1-41.