[ieee 2009 2nd international conference on computer, control and communication (ic$) - karachi,...

Mitigating the Degenerations in Microsoft WordDocuments : An Improved Steganographic Method

Anand Gupta*, Deepak Kumar Barr**, Deepali Sharma***Department of Computer Engineering

**Department of BiotechnologyNetaji Subhas Institute of Technology,Dwarka

New Delhi,[email protected], [email protected], [email protected]

Fax: (91)-011-25099022

Abstract—Steganography is the science of writing hiddenmessages in such a way that no one apart from the sender andintended recipient even realizes if there is any hidden message.Most of the research done before in this area is focussed onimages, audios, and videos but a less amount of work hasbeen done on MS Word documents which is identified withcertain shortcomings. One of the major shortcoming in theprevious method being large number of degenerations whichwere produced to embed a message, making it susceptibleto active warden attack. This paper proposes to mitigate theshortcoming of the previous approach by decreasing the numberof degenerations for the same. Experimental results will showthe feasibility of our approach.

I. INTRODUCTION

Steganography, derived from Greek, literally means ”cov-ered writing”. Steganography is the art of hiding informationin ways that prevent the detection of hidden messages. Itincludes a vast array of secret communication methods thatconceal the message’s very existence. These methods in-clude invisible inks, microdots, character arrangement, digitalsignatures, covert channels and spread spectrum communi-cations. Most of the research done before in this area isfocussed on images, audios, and videos as cover media [1]-[5]. Here,Imperceptibility of data hiding is commonly achievedby exploiting the weaknesses of the human auditory andvisual systems, using the techniques, for example, changingthe least-significant bits of the pixels of a cover image toembed information [6], or shifting lines, words, or charactersby a small amount in an image containing a text [7]. But,still steganography in text documents remains an area ofresearch. As per the work done in [8], Text in the documentsare degenerated with a secret message embedded in it. Themessage is first converted into binary bits and then, for eachbit, the text is degenerated correspondingly which implies thatfor a single alphabet, 1 to 8 degenerations are done. Theselarge number of degenerations makes this technique moreprone to active warden attack. This paper proposes solutionsto the above shortcomings by implementing the techniquethat will result in very less degenerations for even very largemessages, thus making it less prone to active warden attack,thereby enhancing the security.

c©978-1-4244-3314-809$25.00 2009 IEEE

Fig. 1. Data Flow Diagram For The Proposed Approach

The remainder of this paper is organized in the followingmanner. The previous works done in the field of steganographyin the text documents is mentioned followed by the encoun-tered shortcomings in the approach. An Improved techniqueis proposed to overcome those shortcomings, hence makingit more secure. Subsequent sections explain the proposedmethod in detail, followed by the description of prototypeimplementation. Algorithms and the experimental results areshown in Section V and Section VI respectively to demonstratethe feasibility of our method. In the last section, conclusionand future works are discussed.

II. RELATED WORK

Previous works in Steganography involves modifying thecarrier text such that message can be embedded at the places ofdegenerations. A method which produces more plausible andinconspicuous text is least vulnerable to active warden attackmaking it a better technique of steganography. Linguistic

steganography methods that generate the cover text directly,such as the forecast generator [9] and the SpamMimic-spamgenerator [10], do not have the known cover problem. How-ever, the text produced using these methods is often implausi-ble to a human reader. Translation-based steganography [11],on the other hand, uses the expected errors in the translationprocess, especially in machine translation, to solve the issueof producing implausible text. That is, information is hiddenin the noise that occurs in language translation. In cases wheresending imperfect translations to a colleague are reasonable,the stegodocuments resulting from translation-based steganog-raphy are inconspicuous. Furthermore, an improved techniqueproposed in [12] avoids the transmission of the original textand, hence, does not have the known cover problem. Thetranslation-based approach, however, may be vulnerable toactive attacks. If information is hidden in HTML files byadding useless spaces and line breaks[13] or by changing thecase of letters in the tags [14], Warden can simply removeall redundant spaces or change all the letters in the tagsto lowercase in a passing HTML file because these actionswill not affect normal communications, while any hiddeninformation in the file will be removed. On the other hand, theproposed method addresses the issue of producing plausibletext by disguising the degenerated text as the draft work of aninferior author with the degenerated text and the correspondingrevised text intended for the recipient to see as it is, and willthus not be tampered ignorantly.

Related work done in [8] basically involves the degenerationof the contents of a cover document D to arrive at anotherdocument D’ by embedding a secret message in D duringtransformation process using Huffman coding. The very basicidea of previous approach is to first convert the message in toits equivalent binary bits. Bits are, then, embedded into covertext by degenerating its contents to obtain a stegodocument,that is, to hide one bit of data it can create a 1 to 8 de-generations in original document. The degenerations are thenrevised with the changes being tracked, so that the embeddedmessage is retrieved by the recipient. The change trackingfacility provided by MS Word documents plays a key role inthe method mentioned earlier. The technique explained aboveis identified with some shortcomings in its implementationwhich motivated us to come up with an improved technique,that enhances security and efficiency of steganography in MSword documents.

III. MOTIVATION AND CONTRIBUTION

Earlier approaches of Steganography in MS word docu-ments were identified with a major shortcoming. In their work,more than one degenerations were created to embed a singlebit of a message. Thus, large number of degenerations in thestegodocument were created which makes it more susceptibleto active warden attack. The research described in this paperis motivated by the larger goal of mitigating this shortcomingof the previous approach. The salient features of proposedtechnique relative to the previous work is stated below:

1. Only one degeneration is done to hide a single characterunlike the previous approach, where the number of degenera-tions may vary from 1 to 8.

2. The proposed method does not involve any binary con-versions and huffman coding thereby decreasing the runningtime of the Embedder and Decoder.

In the next section, Architecture of the proposed method ispresented that mitigates the shortcoming of the work done in[8]

IV. SYSTEM DESIGN

In this section, the architecture of the proposed model isillustrated with the help of block diagrams. The Proposedmodel can be divided into two sub-models: Embedder andDecoder.Embedder Model shows how a message is embeddedin a covertext by causing degenerations to produce a Stegodoc-ument.Apart from Stegodocument, a word index table (table 1)is also generated which is an input for decoder. The DecoderModel shows the procedure of retrieving the original messagefrom the stegodocument using a Symmetric key and the wordindex table.

sno Wordnono IndexNo ErrorCharacter1 6 2 22 9 1 13 12 3 14 14 2 2Table 1 Word Index Table

The proposed model is divided into the following two parts:

1) The Embedder takes the message characters one by oneand subsequently hide them in the cover text to producethe stegodocument and a word index table.

2) The Decoder reproduces the original message takingstegodocument, Word Index Table and secret key K asinput.

The Embedder model has the following blocks and theirfunctions as explained below :

1) TokenizerIt takes the message as input and creates an array ofindividual characters of the message called as tokens.

2) FinderIt takes an alphabet from the output of Tokenizer andlook for it in the cover text. It returns the first wordencountered in the covertext containing that alphabet.

3) Error Character GeneratorThis module takes the secret key as input and givesall the possible characters in vicinity of the originalcharacter of the message as output.

4) ReplacerIt replaces the original character of the word in thecovertext with the possible error character. It returnsdegenerated word.

Fig. 2. System Architecture Of Embedder

5) Search AlgorithmIt searches for the existence of the degenerated word inthe dictionary. It returns true if the word exists and falseif the word does not exist.

6) MergerIt performs the function of replacing the original wordin covertext with degenerated word to produce thestegodocument.

The Decoder model has the following blocks and their func-tions as explained below :

1) FetcherThis method takes stegodocument and word Index tableas input.Each entry in the word index table points to acharacter in the stegodocument which is given as output.

2) Character FinderThis method takes a single character from fetcher andreturns actual character of the message using the secretkey.

3) AppenderThis module takes the individual characters from thecharacter finder method and appends them to producethe original message.

V. MESSAGE EMBEDDING AND EXTRACTION

In the proposed method, Message is first convertedinto individual tokens or characters. For each char-

Fig. 3. System Architecture Of Decoder

acter, the word containing that character is searchedin the covertext. This word is degenerated by re-placing the original character with a new characterto produce a degenerated word. This new characteris obtained from the secret key which is based onthe fact that characters near to each other on thekeyboard can be mistyped. Among all the possible

Algorithm 1 Tokenizer()Input: a message to be embedded M 0f length nOutput: Array containing individual characters as elements

for i = 0 to n doToken[i]⇐ getCharacter(Message)

end for

replacements, a character is more preferred if thedegenerated word is also present in the dictionary,thereby, avoiding redline marking. When none ofthe degenerated word combinations are present inthe dictionary, misspelling is created using the al-phabet which is most commonly mistyped by thetypist thereby producing a redline. This process isrepeated for all the characters in the message.On theother side, a Word Index Table is maintained whichrecords the location of the degeneration. It stores theword number, index number and the serial numberof the character which is used for degeneration.At the end of the embedding process, a stegodoc-ument S and Word index table are obtained whichare transported via unsecured and secured channelrespectively. The details of the message embeddingprocess are presented in the algorithms below.

Algorithm 2 Finder()Input: Token[i],CovertextOutput: word containing Token[i] in the covertext

while TRUE doword← getWord(covertext)if word contains Token[i] then

breakend if

end whilereturn word

Algorithm 3 ErrorCharacterGenerator()Input: Token [i]Output: Array of possible error characters

repeaterror char ← getErrorCharacter(Token[i])if error char[k] �= null then

k + +end if

until e char[k]==null

Once the stegodocument and Word Index Tableare transported, role of Decoder comes into play.Decoder takes stegodocument and word index tableas input and reproduces the original message asoutput. For each entry in the word index table, adegenerated word is drawn from the covertext. Inthis word, the replaced character is identified which

Algorithm 4 Replacer()Input: word,index,error ch[],number of error charactersOutput: Degenerated Text:Degen word

for i = 1 to k doDegen word⇐Word.Replace(Token[i], error ch[i])if (Search(Degen word)) then

breakend if

end forif i == k then

Degen word←Word.Replace(Token[i])return Degen word

elsereturn Degen word

end if

reveals the actual character in the covertext with thehelp of secret key. This process is repeated till allthe entries in the Word Index Table are processed.All the characters obtained are appended togetherto reproduce the original message. The details ofthe message decoding process are presented in thealgorithm below.

Algorithm 5 Decoder()Input: Stegodocument, Secret Key and Word Index TableOutput: Original Message

if WordIndexTable[i]← null thenExit

elseword no← getWordNumber(word index table[i])index← getIndex(word index table[i])character ch← getChar(Stegodocument, word no, index)character new ch← getRightChar(Secretkey, ch)

end ifprint new ch in the decoded message

VI. PERFORMANCE ANALYSIS AND RESULTS

The proposed model is implemented on the following com-puter configuration:

Dell Technology, Intel(R)PENTIUM 4, 1.70GHz, 256 MBDDR RAM and Microsoft Windows 2000 Professional.

VII. SECURITY CONSIDERATIONS

With reference to the work done in [8], Experimental resultsof the proposed method shows that the extent of degenerationin the covertext is less, thereby, mitigating the chances ofactive warden attack. The proposed method uses alphabet

Fig. 4. Bar chart of the relative relative running time of proposed algorithm,proposed algorithm with random function and algorithm given in [8] for thenumber of data sets n going from 1 to 10

Fig. 5. Bar chart showing the relative size of stegodocument of proposedalgorithm and the algorithm given in [8] for the number of data sets n goingfrom 1 to 10

degeneration by replacing the alphabet with the nearby alpha-bet keys in the keyboard provided by a secret key database.The entries in the secret key database should model realisticerrors done by typist in mistyping an alphabet. Among all thepossibilities, an alphabet is more preferred if the degenerationcaused by its replacement results in a word which is alreadythere in Microsoft Word Dictionary to avoid redline, makingit difficult to visualize the steganography, thus, enhancing thesecurity.

To further enhance the security, a random function is em-ployed. All the possible degenerations to hide a given alphabetin the covertext is evaluated. A random function selects a pos-sibility randomly and cause the corresponding degeneration. Inthis case,whole of the covertext is utilized to hide the message,thus, enhancing the security as degenerations are distributedthroughout the length of the covertext.

Apart from strengthening the security, the running time ofthe proposed method is less relative to the previous method,making it more efficient than the previous approach.

VIII. CONCLUSION AND FUTURE WORK

An improved steganographic method for data hiding inMicrosoft Word documents has been presented in this paper to

Fig. 6. Bar chart showing the relative number of misspellings of proposedalgorithm, proposed algorithm with random function and algorithm given in[8] for the number of data sets n going from 1 to 10

enhance the security. With reference to the previous method,experimental results of the proposed method shows that the ex-tent of degeneration in the covertext is less, thereby, decreasingthe chances of active warden attack. Apart from enhancing thesecurity, the proposed method takes less time relative to theprevious method in both encryption and decryption, making itmore efficient than the work done in [8].

Future work can be investigated by extending this workusing Microsoft Powerpoint, Microsoft Excel, Microsoft Ac-cess using different types of degenerations. Other data-hidingmethodologies based on disguising under collaborative effortsare open future research topics.

REFERENCES

[1] W. Bender, D. Gruhl, N. Morimoto, and A. Lu, ”Techniques fordata hiding,” IBM Syst. J., vol. 35, no. 3-4, pp. 313-336, 1996.

[2] F. A. P. Petitcolas, R. J. Anderson, and M. G. Kuhn, ”Informationhiding-A survey,” Proc. IEEE, vol. 87, no. 7, pp. 1062-1078, Jul.1999.

[3] M. Wu, H. Yu, and A. Gelman, ”Multi-level data hiding fordigital image and video,” presented at the SPIE Photonics East,Boston, MA, 1999.

[4] N. F. Johnson, Z. Duric, and S. Jajodia, Information Hiding:Steganography and Watermarking-Attacks and Countermeasures.Norwell, MA: Kluwer, 2001.

[5] R. Chandramouli, M. Kharrazi, and N. Memon, ”Image steganog-raphy and steganalysis concepts and practice,” DigitalWatermark-ing Lecture Notes in Computer Science 2939, pp. 35-49, 2004.

[6] D. C. Wu and W. H. Tsai, ”A steganographic method for imagesby pixel-value differencing,” Pattern Recognit. Lett., vol. 24, no.9-10, pp. 1613-1626, 2003.

[7] J. T. Brassil and N. F. Maxemchuk, ”Copyright protection for theelectronic distribution of text Documents,” Proc. IEEE, vol. 87,no. 7, pp. 1181-1196, Jul. 1999.

[8] Tsuang-Yuan Liu and Wen-Hsiang Tsai ,”A new steganographicmethod for data hiding in Microsoft Word Documents by aChange Tracking Technique ”in IEEE Transactions On Informa-tion Forensics and Security,Vol 2,Issue 1, pp 24-30,Mar. 2007.

[9] L. Bourbeau, D. Carcagno, E. Goldberg, R. Kittredge, andA. Polgure, ”Bilingual generation of weather forecasts in anoperations environment,” in Proc. 13th Int. Conf. ComputationalLinguistics, Helsinki, Finland, pp. 318-320, 1990.

[10] SpamMimic, [Online]. Available: http://www.spammimic.com.

[11] C. Grothoff, K. Grothoff, L. Alkhutova, R. Stutsman, and M.Atallah, ”Translation-based steganography,” in Proc. InformationHidingWorkshop, pp. 213-233,2005 .

[12] R. Stutsman, C. Grothoff, M. Attallah, and K. Grothoff, ”Lostin just the translation,” in Proc. ACM Symp. Applied Computing,pp. 338-345,2006.

[13] F. Johnson and S. Jajodia, ”Steganalysis: The Investigationof Hidden Information,” in Proc. IEEE Information TechnologyConf., Syracuse, NY, pp. 113-116, Sep. 1998.

[14] Xin-Guang Sui and H. Luo, ”A new steganography methodbased on hypertext,” in Proceeding 2004 Asia Pacific RadioScience Conf.,pp. 181-184, Aug. 2004 .