bsides lisbon - data science, machine learning and cybersecurity

46
By Tiago Henriques, Filipa Rodrigues Florentino Bexiga, Ana Barbosa I, for one, welcome our new Cyber Overlords! An introduction to the use of data science in cybersecurity

Upload: tiago-henriques

Post on 21-Jan-2018

1.044 views

Category:

Technology


0 download

TRANSCRIPT

By Tiago Henriques, Filipa Rodrigues Florentino Bexiga, Ana Barbosa

I, for one, welcome our new Cyber Overlords!

An introduction to the use of data science in cybersecurity

WHO ARE WE?

MACHINE LEARNING AND CYBERSECURITY

IMAGE WORKFLOW

IMAGE ANALYSIS IN DETAIL

DATA VISUALISATION

Agenda

Tiago is the CEO and Data necromancer at BinaryEdge however he gets to meddle in the intersection of data science and cybersecurity by providing his team with lovely problems that they solve on a daily basis.

Tiago Henriques

Presenter

Florentino is the Data MacGyver at BinaryEdge. On a daily basis he needs to deploy infrastructure used to analyse big and realtime data. When not doing that, he can be found creating models to analyse data. Give him an orange, he’ll give you a skynet. Why an orange you ask? He’s hungry and likes oranges, there!

Florentino Bexiga

Presenter

Filipa is the Data Diva at BinaryEdge, she dances the macarena with numbers to get them to tell her all their dirty secret.

Filipa Rodrigues

Presenter

Ana is the Data Ferret at BinaryEdge. She is small and hides between the 110th and 111th characters of the ascii code to see and show data in that unique perspective of someone who can’t reach the box of cookies stored on top of the capitol 'I'

Ana Barbosa

Presenter

HACKINGSKILLS

SECURITY DOMAINEXPERTISE

STATISTICSKNOWLEDGE

MACHINELEARNING

TRADITIONALRESEARCH

DANGER

ZONE!

DATASCIENCE

Source: Data-Driven Security: Analysis, visualisation and Dashboards (adapted)

BinaryEdge

200 port scan of the entire internet/ month1,400,000,000 scanning events/ month *746,000 torrents monitored and increasing1,362,225,600 torrent events/ month

* at a minimum

How we got here....

<= 100

Number of IPs found

>= 1,000,000100,000 < #found < 1,000,00010,000 < #found <= 100,0001,000 < #found <= 10,000100 < #found <= 1,000

Worldwide distribution of IPs running services

% of coverage100%90%80%70%60%50%40%30%20%10%0%

Map IPv4 addresses to Hilbert curves

Data Science & Machine Learning

How many IP addresses did job X had vs. job Y?What is the average duration of the scans?Can we extract more from all the screenshots we get?Can we have a more optimized job distribution?

We can only identify X% of services because we’re using static signatures, can we do better?

Can we find similar images?

MULTIPLE WILD QUESTIONS APPEAR... ...ONE COMMON ANSWER

DATA SCIENCE

&MACHINE LEARNING

Data Science & Machine Learning

DATA SCIENCE MACHINE LEARNING

INITIAL ANALYSIS AND CLEAN UP

EXPLORATORY DATA ANALYSIS

DATA VISUALISATION

KNOWLEDGE DISCOVERY

CLASSIFICATION

CLUSTERING

SIMILARITY MATCHING

REGRESSION

IDENTIFICATION

Problems and Limitations of Machine Learning in CyberSecurity

Lots of adversarial scenarios – Attacks to the classifiers, goes against the foundation of machine learning

Prediction – Scenarios and data too volatile, not enough proper sources of data

Lack of data in quantity and quality to train models

Good use cases

further work needs to be done, but will allow to move antivirus from a static/ signature based system into a much improved dynamic/ learning based system

If a computer is hacked certain behaviors will change, if constant data is being monitored and fed into a system the hack could be detected

detection of vulnerable patterns during development

sentiment analysis applied to emails, tweets, social networks of employees

PATTERN DETECTION/OUTLIER DETECTION (IDS/IPS)

ANTIVIRUS

ANTI-SPAM

SMARTER FUZZERS

SOURCE CODE ANALYSIS

INTERNAL ATTACKERS

metadata

�les people

photosfamily&friends

behaviour

social

search

companyregistration

ip addressurl address

newsforums

sub-reddits

internal

external

phone

email

linked urls

likestopics

BGP

AS

whois

AS membership

AS peer

list of IPs

sharedinfrastructure

co-hostedsites

contact

geolocation

o�celocations

socialnetworks

phone

portscan

dns

torrents

binaryedge.io2016

domainsAXFRMX records

screenshots

web

services

http https

webserverframeworkheaderscookies

certi�catecon�gurationauthoritiesentities

SMB

VNC

RDP

users

apps�les

peers torrent name

OCR

SW

bannersimage

classi�er

vulnerabilities

data points

Torrent Correlation

Torrent Correlation

China or Military

Data correlation

Data correlation

Turkish IP

metadata

�les people

photosfamily&friends

behaviour

social

search

companyregistration

ip addressurl address

newsforums

sub-reddits

internal

external

phone

email

linked urls

likestopics

BGP

AS

whois

AS membership

AS peer

list of IPs

sharedinfrastructure

co-hostedsites

contact

geolocation

o�celocations

socialnetworks

phone

portscan

dns

torrents

binaryedge.io2016

domainsAXFRMX records

screenshots

web

services

http https

webserverframeworkheaderscookies

certi�catecon�gurationauthoritiesentities

SMB

VNC

RDP

users

apps�les

peers torrent name

OCR

SW

bannersimage

classi�er

vulnerabilities

data points

DEMO

At PixelsCamp

At PixelsCamp

metadata

�les people

photosfamily&friends

behaviour

social

search

companyregistration

ip addressurl address

newsforums

sub-reddits

internal

external

phone

email

linked urls

likestopics

BGP

AS

whois

AS membership

AS peer

list of IPs

sharedinfrastructure

co-hostedsites

contact

geolocation

o�celocations

socialnetworks

phone

portscan

dns

torrents

binaryedge.io2016

domainsAXFRMX records

screenshots

web

services

http https

webserverframeworkheaderscookies

certi�catecon�gurationauthoritiesentities

SMB

VNC

RDP

users

apps�les

peers torrent name

OCR

SW

bannersimage

classi�er

vulnerabilities

data points

Microservices (REST API)

MICROSERVICES(REST API)PORT WORD

TAG

FACECOUNTRY LOGO

IP

Scan

DOES IT GENERATE A

SCREENSHOT?

STORE THE IMAGE FILE

ON THE CLOUD

YES

NO

GENERATE A NOTIFICATION

THAT NEW IMAGE WAS UPLOADED

FINISHSCAN

GENERATES EVENTS

{ "origin": { "type": "vnc",... }, "target": { "ip": "XX.XXX.XX.XXX", "port": 5900 }, "result": { "data": { "version": "3.7", "width": "1366", "height": "768", "auth_enabled": false, "link": "https://5723981752938cbafeefbcfab42342342.jpg" } }, "@timestamp": "2016-04-22T14:53:02.377Z"}

Image WorkflowGET IMAGE

EXTRACT TARGET METADATA

DOES IT CONTAIN ANY

CONTENT?

YES

CREATE IMAGE SIGNATURE

STORE DATA

NO

FINISH

ENHANCE IMAGE FOR LOGO AND FACE DETECTION AND OCR EXTRACTION

PERFORM LOGO AND FACE DETECTIONAND OCR EXTRACTION

STORE RESULTS

PERFORM ADDITIONAL ACTIONS

Image WorkflowImage WorkflowGET IMAGE

EXTRACT TARGET METADATA

DOES IT CONTAIN ANY

CONTENT?

YES

CREATE IMAGE SIGNATURE

STORE DATA

NO

FINISH

ENHANCE IMAGE FOR LOGO AND FACE DETECTION AND OCR EXTRACTION

PERFORM LOGO AND FACE DETECTIONAND OCR EXTRACTION

STORE RESULTS

PERFORM ADDITIONAL ACTIONS

Shannon’s Entropy

Entropy = 0.00 bits Entropy ~ 0.03 bits Entropy ~ 2.13 bits

Filter

DEMO

Data Visualization

EXPLORATION REPRESENTATION DETAILS FINISHING UPTOOLS

“a multidisciplinary recipe of art, science, math, technology, and many other interesting ingredients.” Andy Kirk, “Data Visualization: a successful design process”

Experimentation is important

design can be used in the future

Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP

69,543,915 25,436,974 7,008,108 3,475,472 1,287,446 1,043,331

951,629 854,817 789,515 759,115 490,290 288,885

266,827 257,105 219,025 198,898 186,286 141,474

How many open ports does an IP have?Number of IPs with X open portsport

Number of IPs

EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP

Distribution of IP addresses running encrypted and unencrypted services{ "origin": { "type": "service-simple",... }, "target": { "ip": "XX.XX.XXX.XXX", "port": 80, "protocol": "tcp" }, "result": { ... "service": { "product": "Microsoft HTTPAPI httpd", "name": "http", "extrainfo": "SSDP/UPnP", "cpe": [ "cpe:/o:microsoft:windows" ] } }, "@timestamp": "2016-04-22T04:07:18.161Z"}

on port 443

on port 80

51,467,779

HTTP

28,671,263

IPs runningHTTP services

IPs runningHTTPS services

16,519,503 IPs running bothHTTP and HTTPS services

HTTP&

HTTPSHTTPS

Data Visualization

Data Visualization

Top 10 Web Servers for the WebMost common web servers found on port 80

Apache httpd

AkamaiGHost

Micorosft IIS httpd

nginx

lighttpd

Huawei HG532e ADSL modem http admin

Microsoft HTTPAPI httpd

Technicolor DSL modem http admin

Mbedthis-Appweb

micro_httpd

2 4 6 80 10 12 millions11,493,552

8,361,080

4,843,769

3,860,883

2,031,741

1,539,629

952,300

699,202

694,393

678,657

EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP

{... "result": { "data": { "apps": [ { "name": "Apache", "con�dence": 100, "version": "2.2.26", "categories": [ "web-servers" ]... } } }}

EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP

Overview of protocols used for email, according to encryption usedEmail Protocols

ENCRYPTED UNENCRYPTED

POP3 POP3S SMTP SMTPS IMAP IMAPS4,572,161 3,742,289 3,531,071 2,971,159 4,131,737 3,703,364

10,416,812 12,234,969

SERVICE

COUNT

Data Visualization

{ "origin": { "type": "service-simple",... }, "target": { "ip": "XX.XXX.XXX.XX", "port": 143, "protocol": "tcp" }, "result": {... "service": { "method": "probe_matching", "product": "Dovecot imapd", "name": "imap", "cpe": [ "cpe:/a:dovecot:dovecot" ]... }, "@timestamp": "2016-04-22T01:56:54.583Z"}

EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP

Big Data TechnologiesChanges in amount of data exposed without security

MongoDB Memcached Redis 2 TB

644.3 TB

Aug 2015 Jan 2016 July 2016

724.7 TB 627.7 TB

13.2 TB11.3 TB

710.9 TB 12.0 TB

598.7 TB 27.5 TB 1.5 TB

1.8 TB

619.8 TB

{ "origin": { "type": "redis",... }, "target": { "ip": "XXX.XX.XX.XXX", "port": 6379 }, "result": { "data": { "redis_version": "3.0.6",... "used_memory": 1374760, "used_memory_human": "1.31M", "used_memory_rss": 1839104, "used_memory_peak": 25195656, "used_memory_peak_human": "24.03M", "used_memory_lua": 36864, "mem_fragmentation_ratio": 1.34,... }, "@timestamp": "2016-04-22T15:37:10.913Z"}

Data Visualization

EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP

HeartbleedCountries with higher number of IPs vulnerable to Heartbleed

Russia5,264

Republic of Korea4,564China

6,790

United States23,649

Italy2,508

Germany6,382

France5,622

Netherlands2,779United Kingdom

3,459

Japan2,484

{ "origin": { "type": "ssl", }, "target": { "ip": “XXX.XX.X.XXX”, "port": 443 }, "result": { "data": { "vulnerabilities": { "heartbleed": { "is_vulnerable_to_heartbleed": true }, "openssl_ccs": { "is_vulnerable_to_ccs_injection": false } }, } }}

Data Visualization

Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP

VNC wordcloud

loginwindows

edition

2016delete

ctrl

server

press

microsoft

system

welcomeyour help

�lelinux

googlekernel

from

ubuntu

SSH Banners

SSH-2.0-OpenSSH_5.3

SSH-2.0-OpenSSH_6.6.1p1

SSH-2.0-OpenSSH_6.6.1

SSH-2.0-OpenSSH_4.3

SSH-2.0-OpenSSH_6.0p1

SSH-2.0-OpenSSH_6.7p1

SSH-2.0-dropbear_2014.63

SSH-2.0-OpenSSH_5.5p1

SSH-2.0-ROSSSH

SSH-2.0-OpenSSH_5.9p1202,361

352,978

436,700449,570

462,616

537,667555,779

604,579

1,501,749

2,632,270

countbanner

Most common SSH Banners found

EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP

{ "origin": { "type": "ssh", "job_id": "client-816f1185-4bc1-4b5f-9a7d-61a2df315a6b", "client_id": "client", "country": "uk", "module": "grabber", "ts": 1453385574412 }, "target": { "ip": "X.X.X.X", "port": 22, "protocol": "tcp" }, "result": { "data": {... "banner": "SSH-2.0-OpenSSH_6.6.1p1" } }}

Data Visualization

SSH-2.0-

OpenS

SH_5.3

SSH-2.0-

OpenS

SH_6.6.

1p1

SSH-2.0-

OpenS

SH_6.6.

1

SSH-2.0-

OpenS

SH_4.3

SSH-2.0-

OpenS

SH_6.0p

1

SSH-2.0-

OpenS

SH_6.7p

1

SSH-2.0-

drop

bear_

2014

.63

SSH-2.0-O

penSSH_5

.5p1

SSH-2.0-

ROSSSH

SSH-2.0-

OpenS

SH_5.9p

1

202,361

352,978

436,700449,570

462,616

537,667555,779

604,579

1,501,749

2,632,270

EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UPData Visualization

{ "origin": { "type": "ssh", "job_id": "client-816f1185-4bc1-4b5f-9a7d-61a2df315a6b", "client_id": "client", "country": "uk", "module": "grabber", "ts": 1453385574412 }, "target": { "ip": "X.X.X.X", "port": 22, "protocol": "tcp" }, "result": { "data": {... "banner": "SSH-2.0-OpenSSH_6.6.1p1" } }}

EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP

SSH Key LengthsMost common key lengths found

Key lengthcount

641,719

1040

186,070

1032

13,845

4096

5,068,711

1024

3,740,593

2048

9,064

512

7,830

2056

6,265

2064

6,212

1016

4,755

768

{ "origin": {... }, "target": { "ip": "X.X.X.X", "port": 22, "protocol": "tcp" }, "result": {... { "cypher": "ssh-rsa", "key": "AAAAB3NzaC1yc2EAAAABIwAAAQEAudfUFJtWp8R5qPxXB0acGHctH0Yyx-VrZZfvnG37osNc32kX35aXVm8Ulk49zl/jMIIQnzP7zeOUJeJJsyXsG6Cu3qjLvD5qlc0tRjoVmV08aDgAsfeq7qQFEzzDqyoL8kV9akj8WyP+aN3QHvM4a/+3Y+UTVqrw5jSUiIIW5JOd+UWzSz6SCGalFbop1wGELUTY6MDTHwwn+qXYgltQG6hP5tI9tl3gAVajIHg2IxM8IXz4SYH33ZeOPypzrcr1/DvFx1s0773eGSArIi83BeYyxvN/T68RxIqAieLxVy8zJgyevpqHpUX7/+kDuvVZdfKkmFoNzBTEiIvR5eMrjTw==", "�ngerprint": "5b:71:c9:85:6a:ea:40:dc:62:95:4c:25:40:b7:97:55", "length": 2048 } ],... } }}

Data Visualization

Tools

EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP

BALANCE

Automation

Programming Language to create plots

Fine tunning in illustrator(make it better for the audience)

Hand-editing process

Human error

Originality

Automated Analysis

Illustrator (or other tool) to create visualization solution

Human error

Data Visualization

EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP

DOCUMENT EVERY STEP OF THE PROCESSCalculationsChoices of visualisationsChoices of data points

REVIEW EVERYTHINGWhat could have been done differently?What could be better?

TAKE CONSTRUCTIVE FEEDBACKEven if it means to start overA visualization can be used in the future

Data Visualization

INTERNET SECURITY EXPOSURE2016

BinaryEdge.ioBe Ready. Be Safe. Be Secure.

ise.binaryedge.io

THE SCIENCE BEHIND THE DATA

CREATED BYBINARYEDGE