intrusion detection and malware analysis - malware collection...honeypot taxonomy...
TRANSCRIPT
Intrusion Detection and Malware AnalysisMalware collection
Pavel LaskovWilhelm Schickard Institute for Computer Science
Motivation for malware collection
Understanding vulnerabilities and attack techniquesDevelopment of protection and neutralization toolsUnderstanding the attacker communities and their “businessmodels”.
Malware collection tools
Honeypot: an isolated, unprotected and monitored system,containing seemingly valuable for attacker resources, aimedat collecting examples of malicious activity.Honeyclient: an automated client-side vulnerable systemexecuted in a controlled environment.Honeynet: a distributed collection of honeypots and emailfilters intended for a large-scale collection and observationofmalware
Honeypot taxonomy
Low-interaction: simple daemonssimulating network services; noexploitation.Medium-interaction: emulatedvulnerabilities for attracting andexecuting malware in a controlledenvironment.High-interaction: real systemscommunicating with malware in acontrolled environment.
Low-interaction honeypots
(+) Low security risks due to emulation(+) Simple installation and recovery(+) Suitable for analysis of automatic attacks(+) High scalability(−) Not suitable for detection of interactive attacks due to limited
emulated functionality(−) Hardly suitable for acquisition of malware binaries
High-interaction honeypots
(+) Suitable for detection and acquision of any malware kinds(−) Time and resource consuming installation and maintenance(−) High security risks: additional security mechanisms are
required(−) Virtualization can be detected by malware
Medium-interaction honeypots
(+) A relatively wide exploit coverage(+) Extensive monitoring and collection functionality(+) Full virtualization not necessary(+) Relative ease of deployment and maintenance(+) Low to moderate security risks (egress outbreak)(−) Manual emulation of vulnerabiliies still necessary(−) Detection of novel exploits not always reliable
A honeypot example: Nepenthes170 P. Baecher et al.
Fig. 1. Concept behind nepenthes platform
– Submission modules take care of the downloaded malware, e.g., by savingthe binary to a hard disc, storing it in a database, or sending it to anti-virusvendors.
– Logging modules log information about the emulation process and help ingetting an overview of patterns in the collected data.
In addition, several further components are important for the functionalityand efficiency of the nepenthes platform: shell emulation, a virtual filesystem foreach emulated shell, geolocation modules, sniffing modules to learn more aboutnew activity on specified ports, and asynchronous DNS resolution.
The schematic interaction between the different components is depicted inFigure 1 and we introduce the different building blocks in the next paragraphs.
Vulnerability modules are the main factor of the nepenthes platform. They en-able an effective mechanism to collect malware. The main idea behind these mod-ules is the following observation: in order to get infected by autonomous spreadingmalware, it is sufficient to only emulate the necessary parts of a vulnerable ser-vice. So instead of emulating thewhole service,we only need to emulate the relevantparts and thus are able to efficiently implement this emulation. Moreover, this con-cepts leads to a scalable architecture and the possibility of large-scale deploymentdue to only moderate requirements on processing resources and memory. Often theemulation can be very simple: we just need to provide some minimal information atcertain offsets in the network flow during the exploitation process. This is enoughto fool the autonomous spreading malware and make it believe that it can actu-ally exploit our honeypot. This is an example of the deception techniques used inhoneypot-based research. With the help of vulnerability modules we trigger an in-coming exploitation attempt and eventually we receive the actual payload, whichis then passed to the next type of modules.
Shellcode parsing modules analyze the received payload and extract automat-ically relevant information about the exploitation attempt. Currently, only one
Vulnerability modules: emulate vulnerable parts of networkservices.Shellcode parsers: analyse shellcode to locate its source.Fetch modules: download binaries from remote locations.Submission modules: store binaries in a specified location.
Nepenthes vulnerability modules
Poor man’s implementation of the original vulnerabilitySend N fixed strings, random junks, exploit “stages”Dismiss intermediate received stagesRecord final stage and use in payload
Example:
ConsumeLevel LSASSDialogue::incomingData(Message *msg) {
m_buffer->add(msg->getMsg(), msg->getSize());
char reply[512];
for (int32_t i = 0; i < 512; i++) {
reply[i] = rand() % 256;
}
}
Nepenthes shellcode analysis
Analyze the incoming payload and extract malware locationshellcode-signature module
Signature-controlled shellcode analyzerPerl-compatible RE patterns for commonly seen shellcodeIdentify parameters of shellcode (ports, URIs, . . .)
Shell emulator with arbitrary commands
Nepenthes download module
Download the actual malware from previously generatedURLSeveral modules for various protocol:
HTTP(S), (T)FTP, RCP, . . .“Proprietary” malware protocols:
I CSend and CReceive from AgoBotI LinkBind and LinkConnectback from linkbot
RFC-incompliant implementations of HTTP and FTP
Honeyclients
Objective: detection of attacks directed at client-sidesoftware, mostly web browsers:
browser exploits“drive-by downloads”typo-squatting
Applications:security analysis of web sitesfinding malicious content distribution sitesdetection of new browser exploitsmalware collection
Browser exploitation via redirection
Source: Yi-Min Wang, Microsoft Research
Honeyclient example: HoneyMonkey
A VM-based high interactionhoneyclient, running a vulnerablebrowser.Automatic detection of redirectionrelationships between contentdistribution sitesDetection of zero-day attacks
HoneyMonkey architecture
Stage 1: N URLs per VM,unpatched WinXP,
no redirection analysis
Stage 1: N URLs per VM,unpatched WinXP,
no redirection analysis
Stage 1: N URLs per VM,unpatched WinXP,
no redirection analysis
Stage 2: 1 URL per VM,unpatched WinXP SP2,
redirection analysis
Stage 3: 1 URL per VM,patched WinXP SP2,redirection analysis
“Interesting” URLs
Zero-day exploits
Exploit URLs
Exploit URL topologygraph
Analysis of exploitURL density
Access blocking
Vulnerability patching
HoneyMonkey deployment results
Results were obtained in May-June 2005 on a list of 16,190 URLswith known bad content (pornography, adware distribution, someshopping and freeware screensaver sites).
HoneyMonkey configuration Exploit num./freq.Stage 1, fully unpatched 207 (1.3%)Stage 2, fully unpatched (SP1) 688 (4.2%)Stage 2, fully unpatched (SP2) 204 (1.3%)Stage 3, SP2 partially patched 17 (0.1%)Stage 3, SP2 fully patched 0 (0%)
In July 2005, 27 URLs were discovered that distributed azero-day exploit.
From honeypots to honeynets
Goals:Wide coverage of up-to-date “malware landscape”Fast discovery new malware strains
Challenges:Maintenance: deployment by less qualified administratorsSecurity: avoid potential infection of host systemsAutomation: adjust to potentially unknown vulnerabilitiesScalability: infrastructure for storing massive amounts ofmalwareUtility: interface for analysis toolsStealth: should not be detectable by malware
Honeynet example: SGNET
Message extraction from TCP flowsGeneration and refinement of a finite state machine modelfor a communication protocol used by malwareGeneration of a honeyd-compatible script for implementationof a finite-state machine.Communication interface for interaction with the repositoryand analysis components.
What do we want to know about malware?
Is it recognized by existing antivirus products?What is its functionality?
How is malware distributed?What other harmful functions does malware carry out?
What are relationships between various classes of malware?Do they share common techniques? How do they evolve?
How much of malware is unknown?
Experiment:Current instances of malware were collected from a Nepentheshoneypot.Files were scanned with Avira AntiVir.
Results:First scan:
detected76%
undetected 24%
Second scan:
detected85%
undetected15%
After four weeks 15% of malware instances were stillnot recognized!
How much of malware is unknown?
Experiment:Current instances of malware were collected from a Nepentheshoneypot.Files were scanned with Avira AntiVir.
Results:First scan:
detected76%
undetected 24%
Second scan:
detected85%
undetected15%
After four weeks 15% of malware instances were stillnot recognized!
How much of malware is unknown?
Experiment:Current instances of malware were collected from a Nepentheshoneypot.Files were scanned with Avira AntiVir.
Results:First scan:
detected76%
undetected 24%
Second scan:
detected85%
undetected15%
After four weeks 15% of malware instances were stillnot recognized!
CWSandbox: system architectureMalware
virtual memory and executing code in a differentprocess’s context.
Windows kernel32.dll offers the API functionsReadProcessMemory and WriteProcessMemory,which lets the CWSandbox read and write to an arbi-trary process’s virtual memory, allocating new memoryregions or changing an already allocated memory re-gion’s using the VirtualAllocEx and Virtual-ProtectEx functions.
It’s possible to execute code in another process’s con-text in at least two ways:
• suspend one of the target application’s running threads,copy the to-be-executed code into the target’s addressspace, set the resumed thread’s instruction pointer to thecopied code’s location, and then resume the thread; or
• copy the to-be-executed code into the target’s addressspace and create a new thread in the target process withthe code location as the start address.
With these building blocks in place, it’s now possible toinject and execute code into another process.
The most popular technique is DLL injection, inwhich the CWSandbox puts all custom code into a DLLand the hook function directs the target process to loadthis DLL into its memory. Thus, DLL injection fulfillsboth requirements for API hooking: the custom hookfunctions are loaded into the target’s address space, andthe API hooks are installed in the DLL’s initialization rou-tine, which the Windows loader calls automatically.
The API functions LoadLibraryor LoadLibrary-Ex perform the explicit DLL linking; the latter allowsmore options, whereas the first function’s signature isvery simple—the only parameter it needs is a pointer tothe DLL name.
The trick is to create a new thread in the targetprocess’s context using the CreateRemoteThreadfunction and then setting the code address of the APIfunction LoadLibrary as the newly created thread’sstarting address. When the to-be-analyzed applicationexecutes the new thread, the LoadLibrary function iscalled automatically inside the target’s context. Becausewe know kernel32.dll’s location (always loaded atthe same memory address) from our starter application,and know the LoadLibrary function’s code location,we can also use these values for the target application.
CWSandbox architectureWith the three techniques we described earlier set up, wecan now build the CWSandbox system that’s capable ofautomatically analyzing a malware sample. This systemoutputs a behavior-based analysis; that is, it executes themalware binary in a controlled environment so that wecan observe all relevant function calls to the WindowsAPI, and generates a high-level summarized report from
the monitored API calls. The report provides data foreach process and its associated actions—one subsectionfor all accesses to the file system and another for all net-work operations, for example. One of our focuses is onbot analysis, so we spent considerable effort on extractingand evaluating the network connection data.
After it analyzes the API calls’ parameters, the sand-box routes them back to their original API functions.Therefore, it doesn’t block the malware from integratingitself into the target operating system—copying itself tothe Windows system directory, for example, or addingnew registry keys. To enable fast automated analysis, weexecute the CWSandbox in a virtual environment so thatthe system can easily return to a clean state after complet-ing the analysis process. This approach has some draw-backs—namely, detectability issues and slowerexecution—but using CWSandbox in a native environ-ment such as a normal commercial off-the-shelf systemwith an automated procedure that restores the system to aclean state can help circumvent these drawbacks.
The CWSandbox has three phases: initialization, exe-cution, and analysis. We discuss each phase in more detailin the following sections.
Initialization phaseIn the initialization phase, the sandbox, which consists ofthe cwsandbox.exe application and the cwmoni-tor.dll DLL, sets up the malware process. This DLLinstalls the API hooks, realizes the hook functions, andexchanges runtime information with the sandbox.
The DLL’s life cycle is also divided into three phases:initialization, execution, and finishing. The DLL’s mainfunction is to handle the first and last phases; the hookfunctions handle the execution phase. DLL operationsare executed during the initialization and finishing
www.computer.org/security/ ■ IEEE SECURITY & PRIVACY 35
Figure 2. CWSandbox overview. CWSandbox.exe creates a newprocess image for the to-be-analyzed malware binary and then injectsthe cwmonitor.dll into the target application’s address space. Withthe help of the DLL, we perform API hooking and send all observedbehavior via the communication channel back to cwsandbox.exe. Weuse the same procedure for child or infected processes.
Malware application child
cwmonitor.dll
Executes
Communication
Executes
Malware application
cwmonitor.dll
cwsandbox.exe
Communication
During the initialization of a malware binary cwmonitor.dll
is injected into its memory to carry out API hooking.DLL intercepts all API calls and reports them toCWSandbox.The same procedure is repeated for any child or infectedprocess.
Case study: Malware classification
Goals:Detect variations of known malware families.Detect previusly unknown families.Determine essential features of specific families.
Main ideas:Use AV scanners to assign labels to malware binaries.Use machine learning to classify binaries among knownmalware families.
Wild Collectionexploits
Monitoringbinary report
Classification
AV scanner Learninglabel
model
label
Malware classification: data acquisision
Binaries are collected from Nepenthesand from spam-traps, and are analyzedby CWSandbox.Labels are assigned by running AviraAntiVir. Unrecognized binaries arediscarded.14 classes are considered for training: 1backdoor, 2 trojans and 11 worms.
Report corpus
Acquisition oftraining data
Feature extraction
Training
Classification
label
Malware classification: feature extraction
Operational features: a set of all stringscontained between delimeters “<” and “>”.“Wildcarding”: removal of potentiallyrandom attributes.
<copy_file filetype="File" srcfile="c:\1ae8b19ecea1b65705595b245f2971ee.exe"
dstfile="C:\WINDOWS\system32\urdvxc.exe"
creationdistribution="CREATE_ALWAYS" desiredaccess="FILE_ANY_ACCESS"
flags="SECURITY_ANONYMOUS" fileinformationclass="FileBasicInformation"/>
<set_value key="HKEY_CLASSES_ROOT\CLSID\{3534943...2312F5C0&}"
data="lsslwhxtettntbkr"/>
<create_process commandline="C:\WINDOWS\system32\urdvxc.exe /start"
targetpid="1396" showwindow="SW_HIDE"
apifunction="CreateProcessA" successful="1"/>
<create_mutex name="GhostBOT0.58b" owned="1"/>
<connection transportprotocol="TCP" remoteaddr="XXX.XXX.XXX.XXX"
remoteport="27555" protocol="IRC" connectionestablished="1" socket="1780"/>
<irc_data username="XP-2398" hostname="XP-2398" servername="0"
realname="ADMINISTRATOR" password="r0flc0mz" nick="[P33-DEU-51371]"/>
Report corpus
Acquisition oftraining data
Feature extraction
Training
Classification
label
Malware classification: training
For each known malware family, train aSupport Vector Machine for separatingthis family for the others:
minw,b
12||w||2+C
M
∑i=0
ξi
s.t. yi((w·xi) + b) ≥ 1−ξi, i = 1, . . . , M.ξi ≥ 0
Determine the optimal parameter C bycross-validation
Classification of unknown malware binaries
Maximum distance: a label is assignedto a new report based on the highestscore among the 14 classfiers:
d(x) = (w · x) + b
Maximum “likelihood”: estimateconditional probability of class “+1” as:
P(y = +1 | d(x)) =1
1 + exp(Ad(x) + B)
where parameters A and B areestimated by a logistic regression fit onan independent training data set.
Report corpus
Acquisition oftraining data
Feature extraction
Training
Classification
label
Results: known malware instances
Test binaries are drawn from the same 14 familiesrecognized by AntiVir.
1 2 3 4 5 6 7 8 9 10 11 12 13 140
0.2
0.4
0.6
0.8
1
Malware families
Acc
urac
y pe
r fam
ily
Accuracy of classification
(a) Accuracy per malware familyTr
ue m
alw
are
fam
ilies
Predicted malware families
Confusion matrix for classification
1 2 3 4 5 6 7 8 9 10 11 12 13 14
123456789
1011121314
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
(b) Confusion of malware families
Average accuracy: 88%.
Results: unknown malware instances
Test binaries are drawn from the same 14 familiesrecognized by AntiVir four weeks later.
1 2 3 4 7 8 9 10 11 13 140
0.2
0.4
0.6
0.8
1
Malware families
Acc
urac
y pe
r fam
ily
Accuracy of prediction
(c) Accuracy per malware familyTr
ue m
alw
are
fam
ilies
Predicted malware families
Confusion matrix for prediction
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1
2
3
4
7
8
9
10
11
13
140.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
(d) Confusion of malware families
Average accuracy: 69%.
Lessons learned
Malware collection is a crucial prerequisite for understandingnew malware threats and development of appropriateprotection tools.The main difficulty of malware collection lies in having todeal with highly dynamic and heterogeneous exploitationtechniques.Malware analysis significantly facilitates malwareunderstanding and the development of protectionmechanisms.The main technique of malware analysis is execution ofmalware in a specially instrumented environment.
Recommended reading
Paul Baecher, Markus Koetter, Thorsten Holz, Maximillian Dornseif, and Felix C.Freiling.The Nepenthes platform: An efficient approach to collect malware.In Recent Adances in Intrusion Detection (RAID), pages 165–184, 2006.
Corrado Leita, Marc Dacier, and Frédéric Massicotte.Automatic handling of protocol dependencies and reaction to 0-day attacks withScriptGen based honeypots.In Recent Adances in Intrusion Detection (RAID), pages 185–205, 2006.
Niels Provos and Thorsten Holz.Virtual Honeypots: From Botnet Tracking to Intrusion Detection.Addison Wesley, 2007.
Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick Düssel, and Pavel Laskov.Learning and classification of malware behavior.In Detection of Intrusions and Malware, and Vulnerability Assessment, Proc. of 5thDIMVA Conference, pages 108–125, 2008.
Yi-Ming Wang, Doug Beck, Xuxuan Jiang, and Roussi Roussev.Automated web patrol with Strider HoneyMonkeys: Finding web sites that exploitbrowser vulnerabilities.In Proc. of Network and Distributed System Security Symposium (NDSS), 2006.
Carsten Willems, Thorsten Holz, and Felix Freiling.CWSandbox: Towards automated dynamic binary analysis.IEEE Security and Privacy, 5(2):32–39, 2007.