chapter8 hashfunctionsindigitalforensics page129 - … · chapter8 hashfunctionsindigitalforensics...

Chapter 8 Hash functions in digital forensics 29

Chapter 8 Hash functions in digital forensics

In this chapter we describe the role of hash functions in digital forensics. Essentiallyhash functions are used for two main purposes: first, authenticity and integrity ofdigital traces are ensured by applying cryptographic hash functions. Second hashfunctions identify known objects (e.g., illicit files).

Before we give details on their applications in IT forensics, we introduce the foun-dations of hash functions in Section 8.1. Then Section 8.2 describes the use caseauthenticity and integrity of digital traces. Finally in Section 8.3 we explain the usecase data reduction by identification of known digital objects.

8.1 Cryptographic hash functions and approximate matching

In this section we first introduce the general idea of a hash function and then turnto two different concepts: first Section 8.1.1 discusses cryptographic hash functions,which originally come from cryptography to be used in the context of the securitygoals authenticity, integrity, and non-repudiation. Cryptographic hash functionsare useful to uniquely identify an input bit string by its hash value. The secondconcept, on the other hand, is a rather new idea. It deals with the identification ofsimilar input bit strings and is called approximate matching. We turn to approximatematching in Section 8.1.3.

A Hash functiongeneral hash function is simply a function, which takes an arbitrary large bitstring as input and outputs a bit string of fixed size. If n ∈N denotes the bit lengthof the output and if we denote as usual by {0,1}∗ the set of all bit strings, then ahash function h is a mapping

h : {0,1}∗ → {0,1}n. (8.1)

Typically the computation of a hash value is efficient, that is fast in practice. Thesetwo properties are characteristic for a hash function and thus used for its definition(see e.g. [16]).

DDefinition 8.1: Hash function

Let n ∈N be given. A hash function is a function, which satisfies the followingtwo properties:

1. Compression: h : {0,1}∗ −→ {0,1}n.

2. Ease of computation: For all input bit strings bS ∈ {0,1}∗ computa-tion of h(bS) is ‘fast’ in practice.

The output of the function h(bS) is referred to as a hash value, fingerprint,signature or digest.

BExample 8.1

We look at two simple hash functions.

1. We set n = 1. For bS ∈ {0,1}∗ we simply define h(bS) by the leastsignificant bit of bS with the additional definition of h( /0) := 0 for the


empty bit string /0. For instance we have h(10101) = h(11) = h(1) = 1and h(1000) = h(10) = h(0) = 0. Clearly this function satisfies bothrequirements from Definition 8.1.

2. We set n = 2. For bS ∈ {0,1}∗ we simply define h(bS) by bS mod 4,where bS is interpreted as a non-negative binary integer. Again, weset h( /0) := 0 for the empty bit string /0. For instancewe have h(10110)=h(10) = 10 = 2 and h(1000) = h(0) = 0. Again this function satisfiesboth requirements from Definition 8.1.

8.1.1 Cryptographic hash functions

HashApplications functions are well-established in computer science for different purposes.Sample security applications of hash functions comprise storage of passwords (e.g.,on Linux systems), electronic signatures (both MACs and asymmetric signatures),and whitelists/blacklists in digital forensics. Depending on the application, wehave to impose further requirements.

ForPreimage resistance instance, in cryptography a hash value serves as a unique identifier for itsinput, e.g., in the context of a digital signature, where the hash value uniquelyrepresents the input data. Clearly in theory each hash value possesses infinitelymany preimages, that is input bit strings, which map to the given hash value.However, in practice it is not possible to compute such a preimage – the run time ofthe most efficient algorithm to find a preimage is too long. This property is calledpreimage resistance. Besides preimage resistance a cryptographic hash functionsatisfies two additional security requirements, which we list in Definition 8.2.

D Definition 8.2: Cryptographic hash function

Let h : {0,1}∗ −→ {0,1}n be a hash function. h is called a cryptographic hashfunction if it additionally satisfies the following security requirements:

1. Preimage resistance: Let a hash value H ∈ {0,1}n be given. Thenit is infeasible in practice to find an input (i.e., a bit string bS) withH = h(bS).

2. Second preimage resistance: Let a bit string bS1 ∈ {0,1}∗ be given.Then it is infeasible in practice to find a second bit string bS2 withbS1 �= bS2 and h(bS1) = h(bS2).

3. Collision resistance: It is infeasible in practice to find any two bitstrings bS1,bS2 ∈ {0,1}∗ with bS1 �= bS2 and h(bS1) = h(bS2).

Clearly both hash functions from Example 8.1 are not cryptographic hash functions.For instance, we consider h from Example 8.1 1. It is not preimage resistant, becausegiven b ∈ {0,1} we simply take b as preimage and have h(b) = b, that is findingpreimages is trivial. The same obviously holds for second preimage resistance andcollision resistance, respectively.

As we will see in this chapter the IT forensic community adopted the use ofcryptographic hash functions for two main purposes: ensuring authenticity andintegrity of a digital trace and automatic file identification. In both cases, preimageresistance is crucial, because the hash value of the input serves as a unique identifierfor its preimage. If such an identifier is given and if we are able to find a preimage,which is different to the actual input, both IT forensic use cases are corrupted.

8.1 Cryptographic hash functions and approximate matching Page 131

If Sample cryptographichash functions

h is a hash function, then a necessary condition for h to be a cryptographic hashfunction is that the bit length of its digest n is sufficiently large. For preimageresistance and second preimage resistance we have to impose n ≥ 100, for collisionresistance h has to satisfy n ≥ 200. Thus we recommend to make use of the strongerrequirement and only apply hash functions with n ≥ 200. Sample cryptographichash functions, which are used in digital forensics are MD5 (n = 128), SHA-1(n = 160) or hash functions from the SHA-2 family (e.g., SHA-256 (n = 256), [21]).For further details we refer to Table 8.1.

Name MD5 SHA-1 SHA-256 SHA-512 RIPEMD-160n 128 160 256 512 160

Table 8.1: Sample cryp-tographic hash functions.

One Avalanche effectimportant implication of the security properties of a cryptographic hashfunction is the avalanche effect. If we change the input bit string, then every bit ofthe output is expected to change its value with probability 50%, i.e., we do not haveany control over the output, if the input changes. According to the avalanche effect,if only one single bit in the original input bit string bS is changed to get a tamperedone bS�, the two outputs h(bS) and h(bS�) look ‘very’ different. We demonstrate theavalanche effect on base of similar ASCII strings in Example 8.2.

BExample 8.2: Avalanche effect

We demonstrate the avalanche effect by applying SHA-256 to a simple ASCIIstring: in the first string, Wolfgang claims to give Angela 1 million EUR,while the amount changes slightly to 1 billion EUR in the second string.However, the respective SHA-256 hash values look very different.

$ echo ’Dear Angela, I give you 1 million EUR. Wolfgang’ | sha256sumcb10cfd3b6d47af94cd48c096c606ec8d2d836e80c7f87701ff450267efb4787 -$ echo ’Dear Angela, I give you 1 billion EUR. Wolfgang’ | sha256sum8dc377ef008781d03278982928dc7235aff7ac06e39a523eb7fda9ad547f6c4e -

The Linux command echo prints the given string (including a subsequentnew line character) to standard output. The Linux implementation of SHA-256 sha256sum takes this string as input. The number of output charactersof sha256sum is 256

4 = 64, because each group of 4 bits of the hash value isprinted as one hexadecimal digit.

The avalanche effect is eligible in the context of unique identifiers or integrity of atrace, because it is easy to distinguish different input bit strings by comparing theirrespective hash values. However, the avalanche effect avoids detecting similarobjects. It is important to keep this property in mind for the two use cases ofcryptographic hash functions in IT forensics.

8.1.2 Bloom filter

This section introduces Bloom filters, which are an important concept for approxi-mate matching. Bloom filters are commonly used to represent elements of a finiteset S. A Bloom filter is an array of m bits initially all set to zero. In order to ‘insert’an element s ∈ S into the filter, k independent hash functions are needed whereeach hash function h outputs a value between 0 and m−1. Next, s is hashed by allhash functions h. The bits of the Bloom filter at the positions h0(s),h1(s), . . .hk−1(s)are set to one.


To answer the question if s� is in S, we compute h0(s�),h1(s�), . . .hk−1(s�) and analyseif the bits at the corresponding positions in the Bloom filter are set to one. If thisholds, s� is assumed to be in S, however, we may be wrong as the bits may be set toone by different elements from S. Hence, Bloom filters suffer from a non-trivialfalse positive rate. Otherwise, if at least one bit is set to zero, we know that s� /∈ S.It is obvious that the false negative rate is equal to zero.

InFalse positive probability case of uniformly distributed data the probability that a certain bit is set to 1during the insertion of an element is 1/m, i.e., the probability that a bit is still 0 is1− 1/m. After inserting n elements into the Bloom filter, the probability of a givenbit position to be one is 1− (1− 1/m)k·n. In order to have a false positive, all k arraypositions need to be set to one. Hence, the probability p for a false positive is

p =�1− (1− 1/m)k·n

�k≈ (1− e− kn/m)k. (8.2)

8.1.3 Approximate matching: the concept

OftenDetection ofsimilar objects

it is useful in computer science to identify similar digital objects. Prominentuse cases are spam detection, malware analysis, network-based anomaly detection,biometrics, or digital forensics.

WeNo formal definition first remark that although similarity has a natural meaning for us, a formaldefinition is still missing. The corresponding NIST special publication draft 800-168 [22] only describes approximate matching in terms of uses cases, terminology,and requirements. We therefore skip a definition, too.

TheExtending yes/no output basic aim of approximate matching is to extend the yes/no outcome of acryptographic hash function to a continuous one in the scope of automatic detectionof a digital object. As explained in Section 8.1.1 a cryptographic hash functionyields a binary decision ’identical/differing’ for a comparison of two input bitstrings: ’identical’ is encoded for instance as the integer 1, ’differing’ as the non-matching integer 0. The output of an approximate matching comparison on theother hand is a matching score in the interval [0,1], where 1 means a high-level ofsimilarity and 0 a low-level.

TheUse case classes NIST draft 800-168 [22] mentions two use case classes of similarity with twochallenges, respectively. First, approximate matching aims at finding resemblence oftwo objects. The two challenges within this class are object similarity detection (e.g.,different versions of a document) and cross correlation, i.e. finding digital artefacts,which share a common object (e.g., two files sharing an identical picture). Second,approximate matching should detect containment. [22] lists the two accordingchallenges fragment detection (e.g., identify a cluster of a deleted blacklisted file oran IP packet transferring a fragment of a classified document) and embedded objectdetection, i.e. finding an indexed trace within a digital artefact (e.g., a picture withinan email).

TheCore functions concept of approximate matching comprises two core functions: a similaritydigest generation function and a similarity comparison function. In the terminology of[22] the first one is called the feature extraction function and the latter on is denotedas similarity function. We prefer our notation because it more obviously describesthe goal of the respective function.

GivenFeatures, sim-ilarity digest

an input object to the similarity digest generation function, it identifiescharacteristic patterns within the given object. As usual these patterns are calledfeatures. The specification of an approximate matching algorithm therefore de-


scribes how to extract features from the given input. The set of all features is theoutput of the similarity digest generation function and called the similarity digest.

The Similarity comparisonfunction

similarity comparison function takes as input two similarity digests and out-puts a match score in [0,1]. As more the match score is close to 1 the more similarthe corresponding two inputs of the similarity digest generation function are con-sidered.

As Error ratesusual with noisy input, the user of approximate matching has to define athreshold to decide about similarity. As a consequence approximate matchingsuffers from the well-known error rates: the false match rate (FMR) describes theproportion of dissimilar objects falsely declared to match the compared object. Onthe other side the false non match rate (FNMR) describes the proportion of similarobjects falsely declared to not match the compared object.

Similarity Layers of abstractionmay be considered on different layers of abstraction. The NIST draft800-168 [22] distinguishes three layers:

1. First, Bytewise approximatematching

bytewise approximate matching takes a bit string as input for the similaritydigest generation functionwithout any high-layer interpretation of the string,that is the features are extracted directly from the input bit string. Bytewiseapproximate matching is therefore a general approach and may be appliedto any bit string. However, it assumes that similar artefacts, which are ofinterest for the digital forensic investigator, are represented by a similar bitstring – or it fails within this use case. Bytewise approximate matching isoften referred to as fuzzy hashing or similarity hashing.

2. Second, Semantic approximatematching

semantic approximate matching takes the interpretation of the appli-cation data into account and simulates the human similarity perceptionprocedure. For instance, semantic approximate matching in the scope ofpictures extracts the features from the visual perception of the picture ratherthan from its low-layer representation. Semantic approximate matching isoften referred to as perceptual hashing or robust hashing.

3. Third, Syntactic approximatematching

syntactic approximate matching is based on standardised internal struc-tures of an artefact. For instance, within network packets a syntactic ap-proximate matching algorithm may work on fields like source/destinationMAC/IP addresses, ports, protocols.

As bytewise and semantic approximate matching are useful for data reduction, wegive more insights into these approaches in the subsequent sections. Breitingeret al. [5] provide an in-depth overview and we summarise and extend their keyaspects in what follows.

8.1.4 Bytewise approximate matching

According to Breitinger et al. [5] there are seven bytewise approximate matchingalgorithms published by the digital forensic community. In this section we re-view the three main approaches of feature extraction which seem to be the mostpromising ones.

The ssdeep, mrsh-v2first feature extraction approach is used by the well-known bytewise approx-imate matching algorithms ssdeep (due to Kornblum [14]) and mrsh-v2 (due toBreitinger and Baier [3]). The similarity digest generation function subdivides theinput byte stream (denoted as m) into chunks m1, m2, ... as depicted in Figure 8.1.The basic idea is that two digital artefacts are similar if they share a sufficientnumber of chunks.


Figure 8.1: Fea-ture extraction of

ssdeep and mrsh-v2

TheChunk, trigger point end of a chunk mi (and thus the beginning of the subsequent chunk mi+1) iscalled a trigger point. Such a trigger point is found if the final r bytes before thetrigger point meet a certain condition (typically r = 7 and these r bytes determinean integer value, which has tomatch a predefined value for triggering). Each chunkrepresents a feature of the input and the feature set is the sequence of chunks, i.e.the input byte stream is fully covered by the feature set.

To represent a feature, it is hashed by a hash function h (e.g., h is FNV1 for ssdeep,h is MD5 for mrsh-v2) and its hash value is either represented by a Base64 character(ssdeep) or a Bloom filter (mrsh-v2). In case of ssdeep the similarity digest is asequence of Base64 characters, in case of mrsh-v2 it is a sequence of Bloom filters.In Example 8.3 we compute the ssdeep similarity digest of the photo given inFigure 8.2.

Figure 8.2: Sample inputhacker-siedlung.jpg

of ssdeep

B Example 8.3: Similarity digest computation of ssdeep

We compute the ssdeep similarity digest of the photo given in Figure 8.2.

$ ls -l hacker-siedlung.jpg-rw------- 1 baier baier 78831 2015-05-15 10:16 hacker-siedlung.jpg

$ ssdeep -l hacker-siedlung.jpgssdeep,2.13--blocksize:hash:hash,filename1536:ZfICsORJt2PazD7Z2xqHmqL36uuXtrHTXkkknIKB+W2pDHviF4eYySb:\ZfICNRf2CD7YwGqL36FXVTXQnIWgDvi2,"hacker-siedlung.jpg"

1Fowler/Noll/Vo hash, http://www.isthe.com/chongo/tech/comp/fnv/index.html, retrieved 2015-05-22


We first look at the file size, which is 78831 bytes. Then we invoke ssdeep,its flag -l suppresses the whole path listing in the output of ssdeep. Theoutput lists the block size, two parts of the similarity digest, and the filename, which are separated by a colon, respectively.

The block size determines, when a trigger point is found. It aims at splittingthe input byte stream in approximately 64 chunks. It is always of the form3 ·2k, where k is the smallest value with 3 ·2k ·64 ≥ file size. In our examplewe have 78831

3·64 = 410.6, thus k = 9 and the block size is 1536 = 3 ·29.

After the first colon, we get the first part of the ssdeep similarity digestcorresponding to the block size 1536. It consists of Base64 characters, wherethe character Z represents the hash value of h(m1), f the hash value of h(m2),and b the hash value of the final chunk h(m55). After the second colon we seethe second part of the ssdeep similarity digest corresponding to the blocksize 2 ·1536 = 3072. We expect approximately half of the chunks.

The sdhashsecond feature seletion strategy is to extract statistically improbable features.This strategy is implemented by sdhash of Roussev [24]. The basic idea is thatuncommon patterns serve as the baseline for similarity. A statistically improbablefeature within sdhash is a sequence of 64 bytes with a high Shannon entropy, thatis a sufficiently large number of different bytes. The feature set of sdhash is thesequence of the statistically improbable features, which are represented by Bloomfilters. There is a parallelised version available for use in large-scale investigations[25].

The mvhash-Bthird feature selection strategy is based on a majority vote of bit appearancewith a subsequent run length encoding. This approach is used by mvhash-B dueto Breitinger et al. [4]. The majority vote step replaces each byte of the inputbyte string by either an 0x00 byte or an 0xFF byte. The mapping depends on theneighbourhood of the respective byte: if the number of 0 bits predominate in itsneighbourhood, the byte is mapped to 0x00, otherwise it is mapped to 0xFF . Thenrun length encoding is used, where each sequence of identical bytes is replaced byits length. The basic idea of similarity is that predominating regions of a certainbit are characteristic for digital objects. The integers of the run length encodingare then inserted into Bloom filters. The similarity digest of mvhash-B is thereforea sequence of Bloom filters.

8.1.5 Semantic approximate matching

As Perceptual featuressemantic approximate matching extracts perceptual features it is bound to acertain area of applications, for instance images, audio streams or videos. AgainBreitinger et al. [5] present an overview of semantic approximate matching algo-rithms in the context of pictures. This branch dates back to the early 1990ies, whencontent-based image retrieval was an emerging research topic.

There Feature classesare different feature classes, which are used for image approximate matching.Breitinger et al. [5] mention histograms, low-frequency coefficients (e.g., from thediscrete cosine transform), block bitmaps or projection-based. To get an idea ofimage approximate matching, we shortly explain a block bitmap approach usedby the robust hashing algorithm rhash due to Steinebach [29].

The rhashsimilarity digest generation process of rhash is depicted in Figure 8.3. The bitlength of the rhash value is fixed in advance. As usual we denote it by n. In a firststep, the input image is converted to greyscale and normalised (e.g., in a preset


Figure 8.3: Similar-ity digest genera-tion of rhash [29]

size, with respect to orientation). Then the normalised and greyscaled picture issubdivided into n disjoint blocks, which cover the image. For instance, if n is asquare, then rhash subdivides the image into

√n equally sized rows and columns,

respectively. The sample in Figure 8.3 makes use of n = 256 = 162, that is the inputpicture comprises 16 rows and columns, respectively. Next for each block i with0 ≤ i ≤ n−1 rhash computes the mean of of its pixel values. We denote the meanof the i-th block by Mi and the median of the sequence (Mi)0≤i≤n−1 by Md. Finally,the block i contributes to the rhash similarity digest by the bit hi, where hi = 0if and only if mi < Md. A sample rhash similarity digest is given on the right inFigure 8.3.

8.2 Authenticity and integrity of digital traces

InAuthenticity, integrity this section we look at the first use case of hash functions in digital forensics:ensuring authenticity and integrity of digital traces during the IT forensic process(e.g., during data acquisition). Remember authenticity means that the origin of adigital trace is validated, while integrity describes the property that a digital tracedid not change.

TheDead and live analysis use case ’authenticity and integrity of digital traces’ is relevant for both deadand live analysis. We will focus on dead analysis in what follows (i.e., the digitalforensic expert makes use of his own software), but we keep in mind that traces,which are acquired from a live system (e.g., main memory) must be protected byhash values, too.

FromUsage of crypto-graphic hash functions

Section 8.1 we know that cryptographic hash functions ensure integrity andauthenticity by design due to their preimage and second preimage property (seeDefinition 8.2). For this reason the use case ’authenticity and integrity of digitaltraces’ assumes the usage of cryptographic hash functions.

An importantProtect hash values issue is that we have to protect the hash values against tampering.There are two alternatives to achieve this goal: first the classical analogue approachis to write down the hash values by hand in the narrative minutes (e.g., in theinvestigation notebook). Then the hash values are protected by the assumptionthat it is impossible to forge the handwriting of the investigator. Second the digitalapproach is to compute a digital signature over the hash values. This requiresa private cryptographic key, which is related to the investigator. In this case thehash values are protected by the assumption that it is impossible to forge a digitalsignature.

WeGeneral process now discuss the use case ’authenticity and integrity of digital traces’ by lookingat the classical data acquisition process of a dead system. To sum up the paradigmis to first generate a master copy from the original device (because the originaldevice must be touched as few as possible). Then the master copy is bitwise copiedto get the working copy. If we only perform read-only commands on the workingcopy, we later on must prove that the working copy did not change during the

8.2 Authenticity and integrity of digital traces Page 137

investigation (and hence any trace is directly extractable from the original device).The steps are as follows:

1. Compute hash value h1 over the whole original volume.

2. Write hash value h1 down in physical logbook.

3. Make a 1-to-1 copy of the volume using dd. This is the master copy of theoriginal device.

4. Compute hash value h2 over the master copy.


6. Compare h1 and h2: if both hash values match, the master copy is identicalto the original device. Otherwise, we have to go back to step 3.

7. Generate a 1-to-1 copy of the master copy using dd. This is the working copy.

8. Compute hash value h3 over the working copy.


10. Compare h2 and h3: if both hash values match, the working copy is identicalto the master copy and thus to the original device, too. Otherwise, we haveto go back to step 7.

11. Perform the investigation read-only on the working copy and extract digitaltraces.

12. To finish the investigation and to prove integrity of the working copy, com-pute the hash value h4 of the working copy after the investigation and check,if h1 = h4 holds. If yes, any digital trace is directly related to the originaldevice, otherwise the investigator has to identify the step, where he changedthe working copy.

We show how to apply this process on base of the well-known cryptographiclibrary openssl in Example 8.4.

BExample 8.4: Acquire first partition of an HDD

In Linux storage media are typically identified by a device (that is a ’file’in the directory /dev) starting with the two letters sd (historically for SCSIdevice) and a subsequent character to distinguish different devices. Forinstance the first HDD is referred to as /dev/sda, an attached USB stick isthen mapped to /dev/sdb, an external SSD is identified as /dev/sdc, and soon.

In our example we assume that our HDD is the device /dev/sda. Thenits first partition is identified by a digit following the device name (e.g.,/dev/sda1), an extended partition may be the device /dev/sda5.

We apply the general acquisition process and compute the SHA-256 hashvalue of this partition. In this example, we make use of the openssl tool,because openssl is the most common implementation of cryptographic algo-rithms like hash functions, encryption or digital signatures. After invokingopenssl we have to tell the tool, which class of cryptographic algorithmswe want to use. Cryptographic hash functions are identified by the digestcommand dgst. The remaining arguments are the chosen hash function (theflag -sha256) and the input bit string of the hash function (in our examplethe first partition of the HDD /dev/sda1).


# openssl dgst -sha256 /dev/sda1SHA256(/dev/sda1)= b9c028c604b5a1dfaf8acf0098e7f26de32fd47\38c581d9b6cbc84c98b28f39b

# dd if=/dev/sda1 of=mastercopy-sda1.dd

# openssl dgst -sha256 mastercopy-sda1.ddSHA256(mastercopy-sda1.dd)= b9c028c604b5a1dfaf8acf0098e7f26de32fd47\38c581d9b6cbc84c98b28f39b

As both hash values match, we generate the working copy and check therespective hash values.

# dd if=mastercopy-sda1.dd of=workingcopy-sda1.dd

$ openssl dgst -sha256 workingcopy-sda1.ddSHA256(workingcopy-sda1.dd)= b9c028c604b5a1dfaf8acf0098e7f26de32fd47\38c581d9b6cbc84c98b28f39b

Again both hash values match, that is the working copy is bitwise identicalto the first partition of our HDD. We next investigate read-only the workingcopy. In the last step we check that the working copy did not change duringthe processing, whichwe prove by applying SHA-256 to the respective imageof the working copy after our investigation.

$ openssl dgst -sha256 workingcopy-sda1.ddSHA256(workingcopy-sda1.dd)= b9c028c604b5a1dfaf8acf0098e7f26de32fd47\38c581d9b6cbc84c98b28f39b

The hash value of the working copy after the investigation matches therespective hash value of /dev/sda1 and thus any digital trace from theworking copy is extractable from the partition, too.

If for some reason the final hash value does not match, the investigator has tocarefully analyse his narrative minutes to find a step where he modified theworking copy. An example of destroyed integrity is given in what follows:

$ openssl dgst -sha1 workingcopy-sda1.ddSHA256(workingcopy-sda1.dd)= df69b585b1a1af40b1c71d4fe9792fd1e843f8a\2fe0c5c3a39aa205e652aabe4

8.3 Identification of known digital objects

AnBig data challenge important issue in contemporary investigations of computer crime is handlingthe huge amount of data. The reason is that as of today information is stored anddistributed in a digital rather than an analogue way. Low costs of storage devicesand cheap unlimited access to the Internet support our ubiquitous use of digitaldevices. As a consequence a digital forensic investigation typically confronts theIT forensic experts with terabytes of data stored on different sorts of phyiscal orvirtual devices: a classical personal computer, a laptop, a tablet PC, a smartphone,a mail provider, a cloud service provider to name only a few.

TheFinding the nee-dle in the haystack

terabytes of data can be seen as a big haystack, where the actual evidenceof some megabytes has to be found, that is the investigator’s task is to find the

8.3 Identification of known digital objects Page 139

needle in the haystack. In this section we present concepts, which automaticallypreprocess the terabytes of input data to support the investigator in proving orrefuting a hypothesis. If we use themetaphor of ’finding the needle in the haystack’,two concepts are obvious:

1. First, Whitelistingdecreasing the haystack means to scale down the actual data, which hasto be inspected by the digital forensic expert. This concept is known aswhitelisting or filtering out. Any object from the suspect’s drive, which isindexed by the whitelist, is not considered for further inspection. We discusswhitelisting in Section 8.3.1.

2. Second, Blacklistingincreasing the needlemeans to find hints to suspicious data structures,which actually support a certain hypothesis. These hints have to be con-firmed manually by the investigator. This concept is known as blacklisting orfiltering in. We discuss blacklisting in Section 8.3.2.

For Databasesboth concepts, we need databases of irrelevant data (i.e. a whitelist) or in-criminated files (i.e. a blacklist), respectively. The most common whitelist is theReference Data Set (RDS) from the US-NIST National Software Reference Library(NSRL) [23]. The blacklist is case dependent (e.g., pictures of child abuse, classifieddocuments).

The Hash values are usedmost common basic technology for indexing files are hash functions. Theproceeding is quite simple: for each object of the seized device (e.g., a file) calculatethe corresponding digest and compare the respective fingerprint against a white-or blacklist, respectively. As of today cryptographic hash functions (e.g., SHA-1, SHA-256 [21]) are used. Cryptographic hash functions are very efficient andeffective in detecting bitwise identical duplicates, but they fail in revealing similarobjects. However, investigators are typically interested in automatic identificationof similar objects, for instance to detect the correlation between a blacklisted pictureof child abuse and its thumbnail, which was discovered on a seized device.

8.3.1 Whitelisting

A Whitelists are basedon cryptographic hashfunctions

whitelist is an index of known to be good objects, that is of non-suspicious patterns.The concept of whitelisting is quite simple: any object from the suspect’s drive(typically an object is simply a file), which is indexed by the whitelist, is notconsidered for further inspection. Therefore whitelisting is referred to as filteringout, too. In order to handle a whitelist with respect to memory, a compressedrepresentation of each whitelisted object is used. Additionally, as whitelistedobjects are not considered for further investigation, the false match rate (FMR)must be 0. Otherwise it would be possible for an attacker to filter out relevantdigital traces. Therefore whitelists are based on cryptographic hash functions.

The RDSmost common whitelist is the Reference Data Set (RDS) from the US-NISTNational Software Reference Library (NSRL) [23]. The RDS indexes files. Itswebsite states2: The RDS is a collection of digital signatures of known, traceablesoftware applications. There are application hash values in the hash set which may beconsidered malicious, i.e. steganography tools and hacking scripts. There are no hashvalues of illicit data, i.e. child abuse images.

2http://www.nsrl.nist.gov/, 2015-05-20


B Example 8.5

We enumerate sample entries of the NSRL Reference Data Set.

$ less NSRLFile.txt

"SHA-1","MD5","CRC32","FileName","FileSize","ProductCode","OpSystemCode","SpecialCode""000000206738748EDD92C4E3D2E823896700F849","392126E756571EBF112CB1C1CDEDF926","EBD105A0",\"I05002T2.PFB",98865,3095,"WIN","""0000004DA6391F7F5D2F7FCCF36CEBDA60C6EA02","0E53C14A3E48D94FF596A2824307B492","AA6A7B16",\"00br2026.gif",2226,228,"WIN","""000000A9E47BD385A0A3685AA12C2DB6FD727A20","176308F27DD52890F013A3FD80F92E51","D749B562",\"femvo523.wav",42748,4887,"MacOSX","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF",\"J0180794.JPG",32768,18266,"358","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF",\"J0180794.JPG",32768,2322,"WIN","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF",\"J0180794.JPG",32768,2575,"WIN","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF",\"J0180794.JPG",32768,2583,"WIN","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF",\"J0180794.JPG",32768,3271,"WIN","""00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF",\"J0180794.JPG",32768,3282,"UNK",""

We see that the image J0180794.JPG has a file size of 32768 bytes. It is listedsix times, because the product code or the operating system code differ.

TheContent of RDS RDS is updated four times a year. As of May 2015, the current release isRDS 2.48, which contains about 21 million unique files. Its size is about 6 GiB.As listed in Example 8.5 each entry of the RDS lists the SHA-1, MD5 and CRC32checksum together with the file name and file size of the indexed file. The entriesare ordered with respect to the numerical value of the SHA-1 hashes. Hence it iseasy to decide if an input file is indexed by the RDS.

AlthoughEffectivenessof whitelisting

filtering out using the RDS is widespread, only few results are availableabout its effectiveness. Back in 2008 Douglas White from NIST claims in a presen-tation at the American Academy of Forensic Sciences (AAFF) that file-based datareduction leaves an average of 30% of disk space for human investigation3. However, theRDS only indexes application hash values, it does not take any personal files intoaccount.

Therefore Baier and Dichtelmüller [2] performed a study on data reduction fordifferent user profiles. The baseline of their research is the data reduction in termsof the number of files rather than disc space (because an investigator has to look ata file rather than on a certain amount of memory). The methodology of Baier andDichtelmüller [2] is to model different user behaviour and their corresponding filegeneration characteristics. Their data reduction rates for different profiles is givenin Table 8.2.

MG means the number of generated files in the file system of the respective userprofile and MRDS the number of files in the system, which are indexed by the RDS,too. The data reduction rate is the relation of the number of indexed files to allfiles, that is R = MRDS

MG. To be effective, R should be as close as possible to 1. For

instance, the first row in Table 8.2 shows the result for a Windows XP operatingsystem installation only, that is there are no user files. However, only 52.45% ofthe files in the file system are indexed by the RDS. It is obvious, that the reductionrate decreases if we insert additional user files. For example, if we model a user,which mainly uses his computer for playing games (i.e. the profile gamer), the3http://www.nsrl.nist.gov/Documents/aafs2008/dw-3-AAFS-2008-blocks.pdf, retrieved on 2015-05-20

8.3 Identification of known digital objects Page 141

Profile Nr. of Indexed by Data reductionfiles: MG RDS: MRDS rate: R

XP, OS only 10,467 5,490 52.45%XP, standard software 22,801 9,689 42.49%XP gamer 126,684 18,213 14.38%W7, OS only 56,233 18,703 33.26%W7 standard software 77,601 23,414 30.17%W7 universal 322,128 42,296 13.13%Ubuntu 11.04 172,789 26,664 15.43%

Table 8.2: Data reductionrates for different userprofiles using RDS [2]

data reduction rate is below 15%. In this case the investigator has to inspect theremaining 85% of the files manually.

These Whitelisting is ineffectiveresults are informally confirmed by practitioners, who are surprised by the’high’ data reduction rates of Baier and Dichtelmüller [2] and mention an expecteddata reduction rate of 5% for their cases. To sum up, the haystack does not decreasesignificantly using RDS. As the preprocessing of applying the whitelist takes a lotof effort, our overall assessment is that whitelisting is not effective to automaticallypreprocess bulk data.

8.3.2 Blacklisting

In Filter incontrast to a whitelist a blacklist indexes known to be bad objects, that is suspiciouspatterns. If an object from the suspect’s drive matches an element of the blacklist,the investigator gets a hint to a digital trace, which he inspects manually. Thusblacklisting is also called filtering in. Again in order to handle a blacklist withrespect to memory, a blacklist makes use of a compressed representation of eachof its elements.

In Assessmentthis section we assess different aspects of cryptographic hash functions andapproximate matching in the scope of blacklisting. The aspects and our assessmentare summarised in Table 8.3. To illustrate our rating, we assign categories startingwith + for the best rating followed in descending order by ⊕, �, � to the worstrating −.

Property Cryptographic Bytewise approximate Semantic approximatehash function matching matching

Run-time efficiency very fast + fast - medium ⊕ slow −1 1.5 to 6 20 to 500

Compression short + 1% to 3% − short ⊕≈ 256 bits of input length 256 to 600 bits

Object similarity No − Yes + Yes +detectionCross correlation No − Yes + No −Fragment detection No − Yes + No −Embedded object No − Yes + No −detectionDomain specific No + No + Yes −(e.g., only images)Encoding Yes − Yes − No +dependencyFMR / FNMR 0% + Dependent � Dependent �Indexing Yes + Inefficient � Inefficient �

Table 8.3: Assessmentof hash functions withrespect to blacklisting

We Efficiencyfirst turn to the aspect efficiency, that is run-time and memory efficiency. Ourrating of run-time efficiency is based on the experiments of Breitinger et al. [5].

2 Chapter 8 Hash functions in digital forensics

We assign a relative speed of 1 to cryptographic hash functions. Then bytewiseapproximate matching differs by a factor of 1.5 to much slower, e.g., mrsh-v2 hascomparable speed to SHA-1, while sdhash is much slower. However, bytewise ap-proximate matching is typically much faster than semantic approximate matching,because the latter one requires more complex computational steps. With respect tocompression, both cryptographic hash functions and semantic approximate match-ing performwell. The hash value is of fixed small size. On the other hand, bytewiseapproximate matching outputs similarity digests of variable length, which is pro-portional to the input size (with the exception of ssdeep). For instance, a 1 TiBinput requires a size of 10 GiB to 30 GiB for its bytewise approximate matchingblacklist. This constitutes a key drawback of bytewise approximate matching.

WeResemblance next assess the aspect resemblance (see Section 8.1.3). Both bytewise and se-mantic approximate matching are able to decide about object similarity, which isnot the case for cryptographic hash functions. With regard to cross correlation(i.e. finding digital artefacts, which share a common object) only bytewise approxi-mate matching is able to successfully conduct it. The same holds for the aspectcontainment, i.e. fragment detection and embedded object detection: only bytewiseapproximate matching copes with containment.

TheDependency next aspect is dependency with respect to application area and representation, re-spectively. Both cryptographic hash functions and bytewise approximate matchingconsider the bytestream of an object, hence they are not bound to a specific domainof applications (e.g., image similarity, audio similarity). However, as semanticapproximate matching extracts features to simulate human perception, it is boundto a certain domain of applications. If we examine encoding dependency, the situ-ation is vice versa: the byte-level algorithms are dependent on the actual encoding(e.g., an image encoded as jpg is considered to be different from the same imageencoded as png by both cryptographic hash functions and bytewise approximatematching). On the other hand as semantic approximate matching considers theperceptual level, it does not depend on the encoding representation.

WithError rates respect to error rates, both the false non match rate (FNMR) and the falsematch rate (FMR) are of interest. For convenience the FMR should be small, other-wise the investigator is annoyed in manually checking erroneous traces. On theother hand the FNMR must be as close as possible to 0. Otherwise the blacklistfails in pointing to potential evidence, and the trace must be found in a differentway. Cryptographic hash functions do not suffer from error rates due to theirsecurity requirements from the cryptographic domain (e.g., preimage resistance,collision resistance). However, as approximate matching processes noisy input,it suffers from both a non-trivial FMR and FNMR. It is therefore the operator’sresponsibility to prioritise the error rates.

OurIndexing final aspect concerns indexing, that is a sorting algorithm for digests. Asexplained in Section 8.3.1 the RDS sorts cryptographic hash values with respectto their numerical value. Hence indexing is easily possible for blacklists basedon cryptographic hash functions. With respect to approximate matching, firstapproaches towards indexing are available. As they suffer from run time or mem-ory inefficiency, we rate approximate matching rather negative with respect tosorting.

8.4 Summary

In this chapter we described the two main use cases of hash functions in digitalforensics. The use cases are ’authenticity and integrity of digital traces’ (ensured byapplying cryptographic hash functions) and ’identification of known objects’ (e.g.,

8.4 Summary Page 143

illicit files). In the latter case we showed how whitelisting and blacklisting workand how these concepts aim to perform data reduction, respectively. Our conclusionis that whitelisting is not effective and that blacklisting may be performed bycryptographic hash functions or approximate matching.