[Lecture Notes in Computer Science] Compiler Construction Volume 7791 || Architecture-Independent Dynamic Information Flow Tracking

Download [Lecture Notes in Computer Science] Compiler Construction Volume 7791 || Architecture-Independent Dynamic Information Flow Tracking

Post on 14-Dec-2016

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Architecture-Independent Dynamic Information

    Flow Tracking

    Ryan Whelan1, Tim Leek2, and David Kaeli1

    1 Department of Electrical and Computer EngineeringNortheastern University, Boston, MA USA

    {rwhelan,kaeli}@ece.neu.edu2 Cyber System Assessments Group

    MIT Lincoln Laboratory, Lexington, MA USAtleek@ll.mit.edu

    Abstract. Dynamic information ow tracking is a well-known dynamicsoftware analysis technique with a wide variety of applications that rangefrom making systems more secure, to helping developers and analystsbetter understand the code that systems are executing. Traditionally,the ne-grained analysis capabilities that are desired for the class ofthese systems which operate at the binary level require tight coupling toa specic ISA. This places a heavy burden on developers of these systemssince signicant domain knowledge is required to support each ISA, andthe ability to amortize the eort expended on one ISA implementationcannot be leveraged to support other ISAs. Further, the correctness ofthe system must carefully evaluated for each new ISA.

    In this paper, we present a general approach to information ow track-ing that allows us to support multiple ISAs without mastering the intri-cate details of each ISA we support, and without extensive verication.Our approach leverages binary translation to an intermediate representa-tion where we have developed detailed, architecture-neutral informationow models. To support advanced instructions that are typically im-plemented in C code in binary translators, we also present a combinedstatic/dynamic analysis that allows us to accurately and automaticallysupport these instructions. We demonstrate the utility of our system inthree dierent application settings: enforcing information ow policies,classifying algorithms by information ow properties, and characterizingtypes of programs which may exhibit excessive information ow in aninformation ow tracking system.

    Keywords: Binary translation,binary instrumentation, informationowtracking, dynamic analysis, taint analysis, intermediate representations.

    This work was sponsored by the Assistant Secretary of Defense for Research and En-gineering under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations,conclusions and recommendations are those of the authors and are not necessarilyendorsed by the United States Government.

    K. De Bosschere and R. Jhala (Eds.): CC 2013, LNCS 7791, pp. 144163, 2013.c Springer-Verlag Berlin Heidelberg 2013

  • Architecture-Independent Dynamic Information Flow Tracking 145

    1 Introduction

    Dynamic information ow tracking (also known as dynamic taint analysis) is awell-known software analysis technique that has been shown to have wide appli-cability in software analysis and security applications. However, since dynamicinformation ow tracking systems that operate at the binary level require ne-grained analysis capabilities to be eective, this means that they are generallytightly coupled with the ISA of code to be analyzed.

    In order to implement a ne-grained analysis capability such as informationow tracking for an ISA of interest, an intimate knowledge of the ISA is re-quired in order to accurately capture information ow for each instruction. Thisis especially cumbersome for ISAs with many hundreds of instructions that havecomplex and subtle semantics (e.g., x86). Additionally, after expending the workrequired to complete such a system, the implementation only supports the sin-gle ISA, and a similar eort is required for each additional ISA. To overcomethis challenge, weve elected to take a compiler-based approach by translatingarchitecture-specic code into an architecture-independent intermediate repre-sentation where we can develop, reuse, and extend a single set of analyses.

    In this work, we present Pirate: a Platform for Intermediate Representation-based Analyses of Tainted Execution. Pirate decouples the tight bond betweenthe ISA of code under analysis and the additional instrumentation code, andprovides a general taint analysis framework that can be applied to a large numberof ISAs. Pirate leverages QEMU [4] for binary translation, and LLVM [14] asan intermediate representation within which we can perform our architecture-independent analyses. We show that our approach is both general enough tobe applied to multiple ISAs, and precise enough to provide the detailed kindof information expected from a ne-grained dynamic information ow trackingsystem. In our approach, we dene detailed byte-level information owmodels for29 instructions in the LLVM intermediate representation which gives us coverageof thousands of instructions that appear in translated guest code. We also applythese models to complex guest instructions that are implemented in C code.To the best of our knowledge, this is the rst implementation of a binary leveldynamic information ow tracking system that is general enough to be appliedto multiple ISAs without requiring source code.

    The contributions of this work are:

    A framework that leverages dynamic binary translation producing LLVMintermediate representation that enables architecture-independent dynamicanalyses, and a language for precisely expressing information ow of this IRat the byte level.

    A combined static/dynamic analysis to be applied to the C code of the bi-nary translator for complex ISA-specic instructions that do not t within theIR, enabling the automated analysis and inclusion of these instructions in ourframework.

  • 146 R. Whelan, T. Leek, and D. Kaeli

    An evaluation of our framework for x86, x86 64, and ARM, highlightingthree security-related applications: 1) enforcing information ow policies, 2)characterizing algorithms by information ow, and 3) diagnosing sources ofstate explosion for each ISA.

    The rest of this paper is organized as follows. In Section 2, we present backgroundon information ow tracking. Sections 3 and 4 present the architectural overviewand implementation of Pirate. Section 5 presents our evaluation with threesecurity-related applications, while Section 6 includes some additional discussion.We review related work in Section 7, and conclude in Section 8.

    2 Background

    Dynamic information ow tracking is a dynamic analysis technique where datais labeled, and subsequently tracked as it ows through a program or system.Generally data is labeled and tracked at the byte level, but this can also hap-pen at the bit, word, or even page level, depending on the desired granularity.The labeling can also occur at varying granularities, where each unit of data isalso accompanied by one bit of data (tracked or not tracked), one byte of data(accompanied by a small number), or a data structure that tracks additionalinformation. Tracking additional information is useful for the cases when labelsets are propagated through the system. In order to propagate the ow of data,major components of the system need a shadow memory to keep track of wheredata ows within the system. This includes CPU registers, memory, and in thecase of whole systems, the hard drive also. When information ow tracking isimplemented for binaries at a ne-grained level, this means that propagationoccurs at the level of the ISA where single instructions that result in a ow ofinformation are instrumented. This instrumentation updates the shadow mem-ory accordingly when tagged information is propagated.

    Information ow tracking can occur at the hardware level [7,26,28], or insoftware through the use of source-level instrumentation [12,15,31], binary in-strumentation [10,13,22], or the use of a whole-system emulator [9,20,24]. Ingeneral, hardware-based approaches are faster, but less exible. Software-basedapproaches tend to have higher overheads, but enable more detailed dynamicanalyses. Source-level approaches tend to be both fast and exible, but aresometimes impractical when source code is not available. These techniques haveproven to be eective in a wide variety of applications, including detection andprevention of exploits, malware analysis, debugging assistance, vulnerability dis-covery, and network protocol reverse engineering.

    Due to the popularity of the x86 ISA, and the tight bond of these binary instru-mentation techniqueswith the ISAunder analysis,manyof these systemshavebeencarefully designed to correctly propagate information ow only for the instructionsthat are included in x86. This imposes a signicant limitation on dynamic informa-tion ow tracking since a signicant eort is required to support additional ISAs.Pirate solves this problem by decoupling this analysis technique from the under-lying ISA, without requiring source code or higher-level semantics.

  • Architecture-Independent Dynamic Information Flow Tracking 147

    Fig. 1. Lowering code to the LLVM IRwith QEMU and Clang

    !"

    #

    Fig. 2. System architecture split be-tween execution and analysis

    3 System Overview

    At a high level, Pirate works as follows. The QEMU binary translator [4] is atthe core of our system. QEMU is a versatile dynamic binary translation platformthat can translate 14 dierent guest architectures to 9 dierent host architec-tures by using its own custom intermediate representation, which is part of theTiny Code Generator (TCG). In our approach, we take advantage of this transla-tion to IR to support information ow tracking for multiple guest architectures.However, the TCG IR consists of very simple RISC-like instructions, making itdicult to represent complex ISA-specic instructions. To overcome this limi-tation, the QEMU authors implement a large number of guest instructions inhelper functions, which are C implementations of these instructions. Each guestISA implements hundreds of instructions in helper functions, so we have deviseda mechanism to automatically track information ow to, from, and within the