intrusion detection and malware analysis - malware collection...honeypot taxonomy...

Intrusion Detection and Malware AnalysisMalware collection

Pavel LaskovWilhelm Schickard Institute for Computer Science

Motivation for malware collection

Understanding vulnerabilities and attack techniquesDevelopment of protection and neutralization toolsUnderstanding the attacker communities and their “businessmodels”.

Malware collection tools

Honeypot: an isolated, unprotected and monitored system,containing seemingly valuable for attacker resources, aimedat collecting examples of malicious activity.Honeyclient: an automated client-side vulnerable systemexecuted in a controlled environment.Honeynet: a distributed collection of honeypots and emailfilters intended for a large-scale collection and observationofmalware

Honeypot taxonomy

Low-interaction: simple daemonssimulating network services; noexploitation.Medium-interaction: emulatedvulnerabilities for attracting andexecuting malware in a controlledenvironment.High-interaction: real systemscommunicating with malware in acontrolled environment.

Low-interaction honeypots

(+) Low security risks due to emulation(+) Simple installation and recovery(+) Suitable for analysis of automatic attacks(+) High scalability(−) Not suitable for detection of interactive attacks due to limited

emulated functionality(−) Hardly suitable for acquisition of malware binaries

High-interaction honeypots

(+) Suitable for detection and acquision of any malware kinds(−) Time and resource consuming installation and maintenance(−) High security risks: additional security mechanisms are

required(−) Virtualization can be detected by malware

Medium-interaction honeypots

(+) A relatively wide exploit coverage(+) Extensive monitoring and collection functionality(+) Full virtualization not necessary(+) Relative ease of deployment and maintenance(+) Low to moderate security risks (egress outbreak)(−) Manual emulation of vulnerabiliies still necessary(−) Detection of novel exploits not always reliable

A honeypot example: Nepenthes170 P. Baecher et al.

Fig. 1. Concept behind nepenthes platform

– Submission modules take care of the downloaded malware, e.g., by savingthe binary to a hard disc, storing it in a database, or sending it to anti-virusvendors.

– Logging modules log information about the emulation process and help ingetting an overview of patterns in the collected data.

In addition, several further components are important for the functionalityand efficiency of the nepenthes platform: shell emulation, a virtual filesystem foreach emulated shell, geolocation modules, sniffing modules to learn more aboutnew activity on specified ports, and asynchronous DNS resolution.

The schematic interaction between the different components is depicted inFigure 1 and we introduce the different building blocks in the next paragraphs.

Vulnerability modules are the main factor of the nepenthes platform. They en-able an effective mechanism to collect malware. The main idea behind these mod-ules is the following observation: in order to get infected by autonomous spreadingmalware, it is sufficient to only emulate the necessary parts of a vulnerable ser-vice. So instead of emulating thewhole service,we only need to emulate the relevantparts and thus are able to efficiently implement this emulation. Moreover, this con-cepts leads to a scalable architecture and the possibility of large-scale deploymentdue to only moderate requirements on processing resources and memory. Often theemulation can be very simple: we just need to provide some minimal information atcertain offsets in the network flow during the exploitation process. This is enoughto fool the autonomous spreading malware and make it believe that it can actu-ally exploit our honeypot. This is an example of the deception techniques used inhoneypot-based research. With the help of vulnerability modules we trigger an in-coming exploitation attempt and eventually we receive the actual payload, whichis then passed to the next type of modules.

Shellcode parsing modules analyze the received payload and extract automat-ically relevant information about the exploitation attempt. Currently, only one

Vulnerability modules: emulate vulnerable parts of networkservices.Shellcode parsers: analyse shellcode to locate its source.Fetch modules: download binaries from remote locations.Submission modules: store binaries in a specified location.

Nepenthes vulnerability modules

Poor man’s implementation of the original vulnerabilitySend N fixed strings, random junks, exploit “stages”Dismiss intermediate received stagesRecord final stage and use in payload

Example:

ConsumeLevel LSASSDialogue::incomingData(Message *msg) {

m_buffer->add(msg->getMsg(), msg->getSize());

char reply[512];

for (int32_t i = 0; i < 512; i++) {

reply[i] = rand() % 256;

}

}

Nepenthes shellcode analysis

Analyze the incoming payload and extract malware locationshellcode-signature module

Signature-controlled shellcode analyzerPerl-compatible RE patterns for commonly seen shellcodeIdentify parameters of shellcode (ports, URIs, . . .)

Shell emulator with arbitrary commands

Nepenthes download module

Download the actual malware from previously generatedURLSeveral modules for various protocol:

HTTP(S), (T)FTP, RCP, . . .“Proprietary” malware protocols:

I CSend and CReceive from AgoBotI LinkBind and LinkConnectback from linkbot

RFC-incompliant implementations of HTTP and FTP

Honeyclients

Objective: detection of attacks directed at client-sidesoftware, mostly web browsers:

browser exploits“drive-by downloads”typo-squatting

Applications:security analysis of web sitesfinding malicious content distribution sitesdetection of new browser exploitsmalware collection

Browser exploitation via redirection

Source: Yi-Min Wang, Microsoft Research

Honeyclient example: HoneyMonkey

A VM-based high interactionhoneyclient, running a vulnerablebrowser.Automatic detection of redirectionrelationships between contentdistribution sitesDetection of zero-day attacks

HoneyMonkey architecture

Stage 1: N URLs per VM,unpatched WinXP,

no redirection analysis





Stage 2: 1 URL per VM,unpatched WinXP SP2,

redirection analysis

Stage 3: 1 URL per VM,patched WinXP SP2,redirection analysis

“Interesting” URLs

Zero-day exploits

Exploit URLs

Exploit URL topologygraph

Analysis of exploitURL density

Access blocking

Vulnerability patching

HoneyMonkey deployment results

Results were obtained in May-June 2005 on a list of 16,190 URLswith known bad content (pornography, adware distribution, someshopping and freeware screensaver sites).

HoneyMonkey configuration Exploit num./freq.Stage 1, fully unpatched 207 (1.3%)Stage 2, fully unpatched (SP1) 688 (4.2%)Stage 2, fully unpatched (SP2) 204 (1.3%)Stage 3, SP2 partially patched 17 (0.1%)Stage 3, SP2 fully patched 0 (0%)

In July 2005, 27 URLs were discovered that distributed azero-day exploit.

From honeypots to honeynets

Goals:Wide coverage of up-to-date “malware landscape”Fast discovery new malware strains

Challenges:Maintenance: deployment by less qualified administratorsSecurity: avoid potential infection of host systemsAutomation: adjust to potentially unknown vulnerabilitiesScalability: infrastructure for storing massive amounts ofmalwareUtility: interface for analysis toolsStealth: should not be detectable by malware

Honeynet example: SGNET

Message extraction from TCP flowsGeneration and refinement of a finite state machine modelfor a communication protocol used by malwareGeneration of a honeyd-compatible script for implementationof a finite-state machine.Communication interface for interaction with the repositoryand analysis components.

What do we want to know about malware?

Is it recognized by existing antivirus products?What is its functionality?

How is malware distributed?What other harmful functions does malware carry out?

What are relationships between various classes of malware?Do they share common techniques? How do they evolve?

How much of malware is unknown?

Experiment:Current instances of malware were collected from a Nepentheshoneypot.Files were scanned with Avira AntiVir.

Results:First scan:

detected76%

undetected 24%

Second scan:

detected85%

undetected15%

After four weeks 15% of malware instances were stillnot recognized!

CWSandbox: system architectureMalware

virtual memory and executing code in a differentprocess’s context.

Windows kernel32.dll offers the API functionsReadProcessMemory and WriteProcessMemory,which lets the CWSandbox read and write to an arbi-trary process’s virtual memory, allocating new memoryregions or changing an already allocated memory re-gion’s using the VirtualAllocEx and Virtual-ProtectEx functions.

It’s possible to execute code in another process’s con-text in at least two ways:

• suspend one of the target application’s running threads,copy the to-be-executed code into the target’s addressspace, set the resumed thread’s instruction pointer to thecopied code’s location, and then resume the thread; or

• copy the to-be-executed code into the target’s addressspace and create a new thread in the target process withthe code location as the start address.

With these building blocks in place, it’s now possible toinject and execute code into another process.

The most popular technique is DLL injection, inwhich the CWSandbox puts all custom code into a DLLand the hook function directs the target process to loadthis DLL into its memory. Thus, DLL injection fulfillsboth requirements for API hooking: the custom hookfunctions are loaded into the target’s address space, andthe API hooks are installed in the DLL’s initialization rou-tine, which the Windows loader calls automatically.

The API functions LoadLibraryor LoadLibrary-Ex perform the explicit DLL linking; the latter allowsmore options, whereas the first function’s signature isvery simple—the only parameter it needs is a pointer tothe DLL name.

The trick is to create a new thread in the targetprocess’s context using the CreateRemoteThreadfunction and then setting the code address of the APIfunction LoadLibrary as the newly created thread’sstarting address. When the to-be-analyzed applicationexecutes the new thread, the LoadLibrary function iscalled automatically inside the target’s context. Becausewe know kernel32.dll’s location (always loaded atthe same memory address) from our starter application,and know the LoadLibrary function’s code location,we can also use these values for the target application.

CWSandbox architectureWith the three techniques we described earlier set up, wecan now build the CWSandbox system that’s capable ofautomatically analyzing a malware sample. This systemoutputs a behavior-based analysis; that is, it executes themalware binary in a controlled environment so that wecan observe all relevant function calls to the WindowsAPI, and generates a high-level summarized report from

the monitored API calls. The report provides data foreach process and its associated actions—one subsectionfor all accesses to the file system and another for all net-work operations, for example. One of our focuses is onbot analysis, so we spent considerable effort on extractingand evaluating the network connection data.

After it analyzes the API calls’ parameters, the sand-box routes them back to their original API functions.Therefore, it doesn’t block the malware from integratingitself into the target operating system—copying itself tothe Windows system directory, for example, or addingnew registry keys. To enable fast automated analysis, weexecute the CWSandbox in a virtual environment so thatthe system can easily return to a clean state after complet-ing the analysis process. This approach has some draw-backs—namely, detectability issues and slowerexecution—but using CWSandbox in a native environ-ment such as a normal commercial off-the-shelf systemwith an automated procedure that restores the system to aclean state can help circumvent these drawbacks.

The CWSandbox has three phases: initialization, exe-cution, and analysis. We discuss each phase in more detailin the following sections.

Initialization phaseIn the initialization phase, the sandbox, which consists ofthe cwsandbox.exe application and the cwmoni-tor.dll DLL, sets up the malware process. This DLLinstalls the API hooks, realizes the hook functions, andexchanges runtime information with the sandbox.

The DLL’s life cycle is also divided into three phases:initialization, execution, and finishing. The DLL’s mainfunction is to handle the first and last phases; the hookfunctions handle the execution phase. DLL operationsare executed during the initialization and finishing

www.computer.org/security/ ■ IEEE SECURITY & PRIVACY 35

Figure 2. CWSandbox overview. CWSandbox.exe creates a newprocess image for the to-be-analyzed malware binary and then injectsthe cwmonitor.dll into the target application’s address space. Withthe help of the DLL, we perform API hooking and send all observedbehavior via the communication channel back to cwsandbox.exe. Weuse the same procedure for child or infected processes.

Malware application child

cwmonitor.dll

Executes

Communication

Executes

Malware application

cwmonitor.dll

cwsandbox.exe

Communication

During the initialization of a malware binary cwmonitor.dll

is injected into its memory to carry out API hooking.DLL intercepts all API calls and reports them toCWSandbox.The same procedure is repeated for any child or infectedprocess.

Case study: Malware classification

Goals:Detect variations of known malware families.Detect previusly unknown families.Determine essential features of specific families.

Main ideas:Use AV scanners to assign labels to malware binaries.Use machine learning to classify binaries among knownmalware families.

Wild Collectionexploits

Monitoringbinary report

Classification

AV scanner Learninglabel

model

label

Malware classification: data acquisision

Binaries are collected from Nepenthesand from spam-traps, and are analyzedby CWSandbox.Labels are assigned by running AviraAntiVir. Unrecognized binaries arediscarded.14 classes are considered for training: 1backdoor, 2 trojans and 11 worms.

Report corpus

Acquisition oftraining data

Feature extraction

Training

Classification

label

Malware classification: feature extraction

Operational features: a set of all stringscontained between delimeters “<” and “>”.“Wildcarding”: removal of potentiallyrandom attributes.

<copy_file filetype="File" srcfile="c:\1ae8b19ecea1b65705595b245f2971ee.exe"

dstfile="C:\WINDOWS\system32\urdvxc.exe"

creationdistribution="CREATE_ALWAYS" desiredaccess="FILE_ANY_ACCESS"

flags="SECURITY_ANONYMOUS" fileinformationclass="FileBasicInformation"/>

<set_value key="HKEY_CLASSES_ROOT\CLSID\{3534943...2312F5C0&}"

data="lsslwhxtettntbkr"/>

<create_process commandline="C:\WINDOWS\system32\urdvxc.exe /start"

targetpid="1396" showwindow="SW_HIDE"

apifunction="CreateProcessA" successful="1"/>

<create_mutex name="GhostBOT0.58b" owned="1"/>

<connection transportprotocol="TCP" remoteaddr="XXX.XXX.XXX.XXX"

remoteport="27555" protocol="IRC" connectionestablished="1" socket="1780"/>

<irc_data username="XP-2398" hostname="XP-2398" servername="0"

realname="ADMINISTRATOR" password="r0flc0mz" nick="[P33-DEU-51371]"/>

Report corpus


Feature extraction

Training

Classification

label

Malware classification: training

For each known malware family, train aSupport Vector Machine for separatingthis family for the others:

minw,b

12||w||2+C

M

∑i=0

ξi

s.t. yi((w·xi) + b) ≥ 1−ξi, i = 1, . . . , M.ξi ≥ 0

Determine the optimal parameter C bycross-validation

Classification of unknown malware binaries

Maximum distance: a label is assignedto a new report based on the highestscore among the 14 classfiers:

d(x) = (w · x) + b

Maximum “likelihood”: estimateconditional probability of class “+1” as:

P(y = +1 | d(x)) =1

1 + exp(Ad(x) + B)

where parameters A and B areestimated by a logistic regression fit onan independent training data set.

Report corpus


Feature extraction

Training

Classification

label

Results: known malware instances

Test binaries are drawn from the same 14 familiesrecognized by AntiVir.

1 2 3 4 5 6 7 8 9 10 11 12 13 140

0.2

0.4

0.6

0.8

1

Malware families

Acc

urac

y pe

r fam

ily

Accuracy of classification

(a) Accuracy per malware familyTr

ue m

alw

are

fam

ilies

Predicted malware families

Confusion matrix for classification

1 2 3 4 5 6 7 8 9 10 11 12 13 14

123456789

1011121314

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(b) Confusion of malware families

Average accuracy: 88%.

Results: unknown malware instances

Test binaries are drawn from the same 14 familiesrecognized by AntiVir four weeks later.

1 2 3 4 7 8 9 10 11 13 140

0.2

0.4

0.6

0.8

1

Malware families

Acc

urac

y pe

r fam

ily

Accuracy of prediction

(c) Accuracy per malware familyTr

ue m

alw

are

fam

ilies

Predicted malware families

Confusion matrix for prediction

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1

2

3

4

7

8

9

10

11

13

140.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(d) Confusion of malware families

Average accuracy: 69%.

Lessons learned

Malware collection is a crucial prerequisite for understandingnew malware threats and development of appropriateprotection tools.The main difficulty of malware collection lies in having todeal with highly dynamic and heterogeneous exploitationtechniques.Malware analysis significantly facilitates malwareunderstanding and the development of protectionmechanisms.The main technique of malware analysis is execution ofmalware in a specially instrumented environment.

Recommended reading

Paul Baecher, Markus Koetter, Thorsten Holz, Maximillian Dornseif, and Felix C.Freiling.The Nepenthes platform: An efficient approach to collect malware.In Recent Adances in Intrusion Detection (RAID), pages 165–184, 2006.

Corrado Leita, Marc Dacier, and Frédéric Massicotte.Automatic handling of protocol dependencies and reaction to 0-day attacks withScriptGen based honeypots.In Recent Adances in Intrusion Detection (RAID), pages 185–205, 2006.

Niels Provos and Thorsten Holz.Virtual Honeypots: From Botnet Tracking to Intrusion Detection.Addison Wesley, 2007.

Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick Düssel, and Pavel Laskov.Learning and classification of malware behavior.In Detection of Intrusions and Malware, and Vulnerability Assessment, Proc. of 5thDIMVA Conference, pages 108–125, 2008.

Yi-Ming Wang, Doug Beck, Xuxuan Jiang, and Roussi Roussev.Automated web patrol with Strider HoneyMonkeys: Finding web sites that exploitbrowser vulnerabilities.In Proc. of Network and Distributed System Security Symposium (NDSS), 2006.

Carsten Willems, Thorsten Holz, and Felix Freiling.CWSandbox: Towards automated dynamic binary analysis.IEEE Security and Privacy, 5(2):32–39, 2007.

intrusion detection and malware analysis - malware collection...honeypot taxonomy...

Documents