9 - papadimitriou

Upload: vishal-patil

Post on 03-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 9 - Papadimitriou

    1/20

    Detecting Data Leakage

    Panagiotis [email protected]

    Hector [email protected]

  • 7/28/2019 9 - Papadimitriou

    2/20

    Leakage Problem

    Stanford Infolab 2

    App. U1 App. U2

    Jeremy Sarah Mark

    Other Sources

    e.g. Sarahs Network

    Name: Mark

    Sex: Male

    .

    Name: Sarah

    Sex: Female

    .

    Kathryn

  • 7/28/2019 9 - Papadimitriou

    3/20

    Outline

    Problem Description

    Guilt Models

    Pr{U1 leaked data} = 0.7

    Pr{U2 leaked data} = 0.2

    Distribution Strategies

    Stanford Infolab 3

  • 7/28/2019 9 - Papadimitriou

    4/20

    Problem Description

    Guilt Models

    Distribution Strategies

    Stanford Infolab 4

  • 7/28/2019 9 - Papadimitriou

    5/20

    Problem Entities

    Entity Dataset

    Distributor

    Facebook

    T

    Set of all Facebook profiles

    Agents

    Facebook Apps U1, , Un

    R1, , Rn

    Ri: Set of peoples profiles who have

    added the application Ui

    LeakerS

    Set of leaked profiles

    Stanford Infolab 5

  • 7/28/2019 9 - Papadimitriou

    6/20

    Agents Data Requests

    Sample

    100 profiles of Stanford people

    Explicit

    All people who added application

    (example we used so far)

    All Stanford profiles

    Stanford Infolab 6

  • 7/28/2019 9 - Papadimitriou

    7/20

    Problem Description

    Guilt Models

    Distribution Strategies

    Stanford Infolab 7

  • 7/28/2019 9 - Papadimitriou

    8/20

    Guilt Models (1/3)

    Stanford Infolab 8

    Other Sources

    e.g. Sarahs

    Network

    8

    p

    p: posterior probability that a leaked profile

    comes from other sources

    pGuilty Agent: Agent who leaks at least one profile

    Pr{Gi|S}: probability that agent Ui is guilty, given

    the leaked set of profiles S

  • 7/28/2019 9 - Papadimitriou

    9/20

    Guilt Models (2/3)

    Stanford Infolab 99

    or

    or

    Agents leak each of their

    data items independently

    Agents leak all their data

    items OR nothing

    or

    (1-p)2

    (1-p)p

    p(1-p)

    p2

  • 7/28/2019 9 - Papadimitriou

    10/20

    Guilt Models (3/3)

    Independently NOT Independently

    Stanford Infolab 10

    Pr{G1}

    Pr{G2} Pr{G2}

    Pr{G1}

  • 7/28/2019 9 - Papadimitriou

    11/20

    Problem Description

    Guilt Models

    Distribution Strategies

    Stanford Infolab 11

  • 7/28/2019 9 - Papadimitriou

    12/20

    The Distributors Objective (1/2)

    Stanford Infolab 12

    U1

    U2

    U3

    U4

    R1

    Pr{G1|S}>>Pr{G2|S}

    Pr{G1|S}>> Pr{G4|S}

    S (leaked)

    R1

    R3

    R2

    R3

    R4

  • 7/28/2019 9 - Papadimitriou

    13/20

    The Distributors Objective (2/2)

    To achieve his objective the distributor has to

    distribute sets Ri, , Rn that

    minimize

    Intuition: Minimized data sharing amongagents makes leaked data reveal the guilty

    agents

    Stanford Infolab 13

    njiRRRi ij

    ji

    i

    ,...,1,,1

  • 7/28/2019 9 - Papadimitriou

    14/20

    Distribution Strategies Sample (1/4)

    Set T has four profiles:

    Kathryn, Jeremy, Sarah and Mark

    There are 4 agents:

    U1, U2, U3 and U4

    Each agent requests a sample of any 2 profiles

    of T for a market survey

    Stanford Infolab 14

  • 7/28/2019 9 - Papadimitriou

    15/20

    Distribution Strategies Sample (2/4)

    Poor

    ji

    ji RRMinimize

    Stanford Infolab 15

    U1

    U2

    U3

    U4

    U1

    U2

    U3

    U4

  • 7/28/2019 9 - Papadimitriou

    16/20

    Distribution Strategies Sample (3/4)

    Optimal Distribution

    Avoid full overlaps and minimize

    Stanford Infolab 16

    U1

    U2

    U3

    U4

    i ij

    ji

    i

    RRR

    1

  • 7/28/2019 9 - Papadimitriou

    17/20

    Distribution Strategies Sample (4/4)

    Stanford Infolab 17

  • 7/28/2019 9 - Papadimitriou

    18/20

    Distribution Strategies

    Sample Data Requests

    The distributor has the

    freedom to select the data

    items to provide the agents

    with

    General Idea:

    Provide agents with as much

    disjoint sets of data as possible

    Problem: There are caseswhere the distributed data

    must overlap E.g.,

    |Ri|++|Rn|>|T|

    Explicit Data Requests

    The distributor must

    provide agents with the

    data they request General Idea:

    Add fake data to the

    distributed ones to minimize

    overlap of distributed data

    Problem: Agents can collude

    and identify fake data

    NOT COVERED in this talk

    Stanford Infolab 18

  • 7/28/2019 9 - Papadimitriou

    19/20

    Conclusions

    Data Leakage

    Modeled as maximum likelihood problem

    Data distribution strategies that help identify

    the guilty agents

    Stanford Infolab 19

  • 7/28/2019 9 - Papadimitriou

    20/20

    Thank You!