detecting malicious activity and malware on a large network brandon enright – cisco computer...

Detecting Malicious Activityand Malware on a

Large NetworkBrandon Enright – Cisco Computer Security Response Team

Detecting Malicious Activity and Malware on a Large Network

Brandon Enright – Cisco Computer Security Response Team

About me:• Hacker• Problem seeker and solver• Linux user• Extreme twisty puzzle enthusiast• Nmap evangelist• Armchair physicist• Mad scientist• Crypto nerd



About Cisco:

400 Sites

In 100 Countries



About Cisco:

2010 numbers -- Doesn’t include or



About Cisco:

40,000 routers and switches on the network



About Cisco:

… wait... WTF? 40,000 routers on the network?

Yeah. It’s the Cisco way:

“Is there any chance a router or switch will kinda sorta almost maybe solve part

of my problem?”

The answer is yes. Install a router.



Big Networks are Hard



Big networks are hard:

• Every version of software under the sun• BYOD (Bring Your Own Device)• Every version of smart (and dumb) phone ever made• Thousands of VPN users at all times• The sun never sets on the network – no down time• Network logs exceed the size manageable by single-system solutions (> 1Tb / day)

How do you know if you have a big network:Can you memorize all of your public IP prefixes?Cisco’s Primary AS announces 74 IPv4 prefixes (1.17M IPs)



If you’re going to do security right you need a LOT of data:

• NetFlow• Transparent web proxy logs• IDS alerts

• HIPS logs• AV logs• IR agent

• DHCP logs• DNS logs• RPZ / Sinkhole logs• VPN logs• AAA logs• Syslog

IT Infrastructure

Network layer

Host level



… and you’re going to need a place to store and search that data

Data

• If you don’t have easy access to almost all of your data in one place you won’t use your data to its fullest



And “Big Data” will solve all of your problems…



And SIEM vendors correlate!

Correlation



WTF is correlation?

If you’re dumb you think:

If you’re smart you think:

If you’re a marketer you think:



This is what correlation actually is:

Web Proxy

timestamp (date)

source IP

source port

destination IP

destination port(s)

URL

IP reputation

request type

referer

User Agent

HIPS

timestamp (date)

source IP

source port

destination IP

destination port(s)

hostname

nbtname

sourcetype

eventsource

alerttype

Correlation is just a union, join, intersection, or other basic relation between common fields in different data sets

Your will beat anyday.



Fortunately not all hope is lost:

• SIEM “solutions” are almost entirely marketing hype but they are a reasonable way to get at your data

• “Big Data” doesn’t mean anything concrete but big data systems do help you get at your data quickly and easily

This presentation is about going beyond the marketing and canned reports to find

malicious activity on your network.



Gold mining your logs



Investigative versus High Fidelity reports:

• High Fidelity reports are ones that have no realistic chance of producing a false positive and can be fully automated by a computer. No human being needs to “spot-check” the results.

• Investigative reports are pretty much everything else. The goal is always for maximum fidelity but it’s generally not feasible to build a report with perfect results.



The High Fidelity intuition trap:

Be careful labeling a report “High Fidelity”. Bayes Theorem is an unforgiving mistress. Presumably you have tons of logs which have the tendency to make the seemingly unlikely happen frequently.

Wikipedia on Bayes Theorem:

You have a drug test that produces 99% true positive results for drug usersand 99% true negative results for non-drug users.

Suppose that 0.5% of people are users of the drug.

If a randomly selected individual tests positive, what is the probability he or she is a user?

33.2% (66.8% false positive rate)



Talk scope:

This talk is not about “100% effective” ways of finding malicious activity.

Instead it’s about giving you the investigative ideas that should get you started.



HTTP is the InternetAsk any user and they’ll tell you…



HTTP as a data source:

• To most users, if HTTP is broken then the Internet is useless• Organizations pretty much universally allow HTTP out• Even hosts with a RFC1918 address often use HTTP proxies• The browser and all of its plugins is one of the biggest attack

surfaces used by everyone• HTTP is so ubiquitous it’s practically a transport protocol

now

All of these factors (and others) have come together to make the web the most common malware delivery mechanism and HTTP the most common command and control mechanism.

And that makes your HTTP logs one of your most valuable data sources for finding malicious activity!



Web Browsers vs Everything Else:

There are certain things web browsers always do:• Set a User-Agent: header• Set a Referer: header when appropriate• Use HTTP 1.1• Lots of other idiosyncrasies like “Accept-Type:” and

“Connection:”

Start by querying for things that don’t match web browser behavior.



Web Browsers vs Everything Else (continued):

This activity did not come from browsers:

pwned (click fraud)



Web Browsers are quirky but consistent:

Within a browser version (and often a whole browser family) the quirks stay the same:• Header order is consistent• Parameter lists for headers like Accept-Encoding: are

generally static• Header capitalization is consistent

GET / HTTP/1.1Host: www.google.comUser-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like […]Accept-Encoding: gzip,deflate,sdchAccept-Language: en-US,en;q=0.8Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8Connection: keep-alive

Fake Chrome request (header order is wrong):

Quirks are very hard for malware to emulate!



If the browser tells you something, check it’s story out:

Nice try but that isn’t anywhere close to IE’s User-Agent string.



Sometimes it’s worthwhile to dig even deeper with fact-checking:

Legitimate IE User-Agent strings:

• Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)

• Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)

• Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)

• Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C)

Is there any sort of consistency between Mozilla version, IE version, Windows version, and Trident version?



Fact-checking User-Agent strings (continued):

First extract the (sub)fields for processing:

• | rex field=cs_useragent "^Mozilla/(?<mozver>[\d.]+)“

• | rex field=cs_useragent "MSIE (?<iefullver>(?<iever>\d+(\.\d)?)[\d.]*)“

• | rex field=cs_useragent "Trident/(?<triver>[\d.]+)“

• | rex field=cs_useragent "Windows NT (?<ntfullver>(?<ntver>\d+(\.\d)?)[\d.]*)"

In machine learning parlance this is feature extraction.




Building a contingency table between the Mozilla and IE version:

index=wsa msie cs_useragent="Mozilla/*msie*" | dedup host | rex field=cs_useragent "^Mozilla/(?<mozver>[\d.]+)" | rex field=cs_useragent "MSIE (?<iefullver>(?<iever>\d+(\.\d+)?)[\d.]*)" | rex field=cs_useragent "Trident/(?<triver>[\d.]+)" | contingency mozver iever

Machine learning models automate this sort of analysis.

7 9 10 6 8 5.5 5 4 11 14 16912 520 24 1052 902 504 30 0 1 05 0 8510 2842 0 7 0 0 0 0 13 0 0 0 0 0 0 0 27 0 0

TOTAL 16912 9030 2866 1052 909 504 30 27 1 1




Building a contingency table between the IE and Trident version:

index=wsa msie cs_useragent="Mozilla/*msie*" | dedup host | rex field=cs_useragent "^Mozilla/(?<mozver>[\d.]+)" | rex field=cs_useragent "MSIE (?<iefullver>(?<iever>\d+(\.\d+)?)[\d.]*)" | rex field=cs_useragent "Trident/(?<triver>[\d.]+)" | contingency iever triver

5 6 4 7 3.17 6075 3891 574 55 29 4784 22 0 1 0

10 1 1727 0 0 06 0 0 3 0 08 5 1 453 0 0

5.5 0 0 0 0 05 0 0 0 0 04 0 0 0 0 0

11 0 0 0 0 0TOTAL 10865 5641 1034 56 2



Fact-checking User-Agent strings (continued):Build other contingency tables and then put the logic together:

index=wsa msie cs_useragent="Mozilla/*msie*" (NOT cs_useragent="*iemobile*") | dedup host | rex field=cs_useragent "^Mozilla/(?<mozver>[\d.]+)" | rex field=cs_useragent "MSIE (?<iefullver>(?<iever>\d+(\.\d)?)[\d.]*)" | rex field=cs_useragent "Trident/(?<triver>[\d.]+)" | rex field=cs_useragent "Windows NT (?<ntfullver>(?<ntver>\d+(\.\d)?)[\d.]*)" | search (NOT cs_useragent="Mozilla/*(compatible;*") OR ((mozver < 4) OR (mozver > 5)) OR ((iever < 6) OR (iever > 10)) OR ((ntver < 5) OR (ntver > 6.3)) OR ((mozver="4.0" AND (iever > 8)) OR (mozver="5.0" AND (iever < 9))) OR ((triver < 4) OR (triver > 7)) OR ((iever < 7 AND (triver="*")) OR (iever="8.0" AND (NOT triver="4.0)) OR (iever="9.0" AND (NOT triver="5.0)) OR (iever="10.0" AND (NOT triver="6.0"))) OR (iefullver="*.*.*" OR ntfullver="*.*.*") OR (NOT ntver="*") OR ((iever="6.0" AND ntver > 5.1) OR (iever="7.0" AND ntver < 5.1) OR (iever="8.0" AND ntver < 5.1) OR (iever="9.0" AND ntver < 6) OR (iever="10.0" AND ntver < 6))

Logic similar to this is built automatically with machine learning.



When you see it, you know it’s bad:



So ask yourself, what would “bad” look like?

index=wsa dosexec java (NOT cs_url="*.exe") | dedup host

http://pacsd.melinert.org/r0vTmK-0OfJB07ey/20hdj/80XDJH0/PJd-A0xNrk/15DH1/0zz-gb1/2TWd/0LNuV/0iaBa_12TNk0-UlY_n08rz-T0Uay/90xxmM0B-r880PHRM_0m3TB0_9ZzP0fO_JA0zwxW-0Hh-e50BKiA0mcHu/0Y_jmM0iN-jt02XM_00oD4f0H_mOM0QZTp_17BW30YfWI-0IWU9_0p-FkN0_kqeh_0mNey0MN-go0/QoTO0p_rWJ0/xhoB_0q4/Vy/0XouZ-02op-F0l8b/S0g2_NE15_dkL0QAB_50VvS_d15L0_20nD5k/14Jra-0w1/Rs_0yn7/H0J-Lts07-GmE0s7M_d0_zkD00_qEd/Y0u5ER/ZTVyJa0mSV.exe?IeLtBYmZ4cZC=73b6a&h=11

http://www2.nq8x6r92.4pu.com/?90xcqmmo=XZPlx67S5dSU5qHcc6NqZHBnntfl2arRn6RuqGaja5Rpkp7X5KqeoqalabFnq2xoX5zroKKdnaeU0praqK%2BLhoRW6MzVqqCV4ZU%3D&h=16



If it looks bad… turn it into a specific query:

index=wsa dosexec java (NOT cs_url="*.exe")| regex cs_url="^http://[^/]+(/[a-zA-Z0-9_-]+){8,32}[^\?]+\?[a-zA-Z0-9]+=[a-zA-Z0-9]{4,8}&h=\d+$"

Pattern

IP DomainDomain …

Pattern Domain Pattern IP… } “Connect the dots”



So what else looks bad?

How about a POST to an IP address running a PHP script that takes a parameter with no Referer?

index=wsa post php cs_url="*.php?*" "ip address“(NOT cs_referer="*") cs_method="POST“| regex cs_url="^http://(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/“| dedup s_ip | dedup host



http://184.72.43.99/index.php?broswer=7020449211b5ce1d1

http://64.62.146.102/showthread.php?t=256534570

Here is what turns up:

Build Patterns

} “Connect the dots”

Pattern Domain IP…



Humans are notprecision machines



Do X every Y seconds:

Human: Uh okay…?

Computer: No problem.

This is not a human



Time deltas and statistics are your friend:

Find the time gaps, do statistics, profit.

index=wsa POST (NOT (cs_referer="*")) (NOT "TCP_MEM_HIT") (x_wbrs_score < 0.0) | rex field=cs_url "http:\/\/(?<domain>[^\/]+)" | strcat host "_to_" domain sd | streamstats current=f last(_time) as next_time by host | eval gap = next_time -_time | stats count avg(gap) as avgg var(gap) as varg values(domain) as domain by sd | eval varavg = (varg / avgg) | search (count >= 10) (avgg > 10) (varavg < 0.05) | table domain count avgg varg varavg | sort varavg

To be honest, Splunk is not the right tool for the job here.



The top periodic activities table:

Periodic activity by itself only says non-human, not malicious. Must be coupled with additional analysis.



What does machine-generated activity look like?

Second of Minute

Min

ute

of H

our

Check out Detecting and Analyzing Automated Activity on Twitter byChao Michael Zhang and Vern Paxson



DNS is the Lifeblood ofEverything



You should capture DNS queries:

• Humans use names

• Domain names are very inexpensive

• Provides a layer of indirection which increases resiliency

• Makes simple blocking a bit harder

• Allows things like Fast Flux



If you don’t capture answers you can use DNSDB:

https://www.dnsdb.info/



Set operations can give you the context you need:

DNS is a good starting-point for detection but often is just the tip of the iceberg of data contained in your other logs.

Bad

Mac

hine

1 Bad Machine 2

Known-Good Machine

ProbablyBad Stuff



Follow the DNS graph:

Bad Domain

Bad IP

Bad IP

Bad IP

Bad IP

Bad IP

Bad DomainBad Domain

Bad Domain

Bad Domain

Bad Domain

Bad Domain

Bad Domain



The Moral:

If you have a of data

it,you should

you will find



Questions?

detecting malicious activity and malware on a large network brandon enright – cisco computer...

Documents

cisco way

malicious activityand

security right

lot of data

data dataif

deviceevery version

ipv4 prefixes

dumb phone