detecting image splicing in the wild web

33
Detecting image splicing in the wild (Web) Markos Zampoglou, Symeon Papadopoulos, Yiannis Kompatsiaris 1 Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI) WeMuV2015 workshop, ICME, June 29, 2015, Turin, Italy

Upload: symeon-papadopoulos

Post on 14-Aug-2015

242 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Detecting image splicing in the wild Web

Detecting image splicing in the wild (Web)Markos Zampoglou, Symeon Papadopoulos, Yiannis Kompatsiaris

1Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI)

WeMuV2015 workshop, ICME, June 29, 2015, Turin, Italy

Page 2: Detecting image splicing in the wild Web

A new journalistic paradigm

#2

…and its pitfalls

Page 3: Detecting image splicing in the wild Web

Blind image splicing detection

• Assume the splice differs in some aspect from the rest of the image

– Capture invisible “traces”: DCT coefficient distribution, PRNU, CFA interpolation patterns…

• But traces degrade at subsequent image alterations

• Social media journalism establishes a different paradigm from typical image forensics

– We don’t have the luxury of demanding we see the originals

#3

Page 4: Detecting image splicing in the wild Web

Image tampering lifecycle

#4

Page 5: Detecting image splicing in the wild Web

Images in the wild

#5

• Twitter:– Images larger than 2048×1024 are scaled down

– Large PNG files (> 3MB) converted to JPEG

– JPEG files resaved at quality 75

• Facebook– Images larger than 2048 × 2048 are scaled down

– Large PNG files converted to JPEG

– JPEG files resaved at varying quality (~70-90)

• Both media platforms also erase metadata from images

Page 6: Detecting image splicing in the wild Web

Existing image splicing datasets

#6

Name Format Masks #images

Columbia1 BMP grayscale No 933/912

Columbia Unc.2 TIFF Unc. Yes 183/180

CASIA TIDE v2.03 TIFF Unc. , JPEG, BMP No 7491/5123

VIPP Synthetic4 JPEG Yes 4800/4800

VIPP Realistic4 JPEG Manual 63/68

1http://www.ee.columbia.edu/ln/dvmm/downloads/AuthSplicedDataSet/AuthSplicedDataSet.htm2http://www.ee.columbia.edu/ln/dvmm/downloads/authsplcuncmp/3http://forensics.idealtest.org:8080/indexopt_v2.php4http://clem.dii.unisi.it/~vipp/index.php/imagerepository/129-a-framework-for-decision-fusion-in-image-forensics-based-on-dempster-shafer-theory-of-evidence

Page 7: Detecting image splicing in the wild Web

Issues with existing datasets

#7

• Ground-truth masks: only Columbia Uncompressed and VIPP offer binary masks

• Quality of splices: only CASIA and VIPP Realistic contain realistic forgeries

• Image format: Only VIPP and CASIA offer JPEG images– At least 87% of the common crawl corpus

(http://commoncrawl.org/) images are JPEG– Out of 13,577 forged images collected in our investigations, ~95% were in JPEG format

• Neatness: All datasets contain first-level forgeries with no further alterations

Page 8: Detecting image splicing in the wild Web

Collecting a dataset of Web forgeries

• Aim: build an evaluation framework with the web-based case in mind

– Evaluate existing and future algorithms against the real-world, web-based application scenario

– Assess the status of the web: how many versions of each forgery, how close to the original

• Methodology: identify verified forgeries, and exhaustively download as many instances as possible for analysis

#8

Page 9: Detecting image splicing in the wild Web

The Wild Web Dataset (1/5)

• Identified 82 cases of confirmed forgeries

#9

Page 10: Detecting image splicing in the wild Web

The Wild Web Dataset (2/5)

• Collected all detectable instances of each case

• Removed exact file duplicates

• 13,577 images in total

• Identified and removed heavily altered variants of each case

#10

Page 11: Detecting image splicing in the wild Web

The Wild Web Dataset (3/5)

• By removing crops and post-splices, we were left with 9,751 images

• Variants within cases were separated, and the sources were gathered where possible

#11

Page 12: Detecting image splicing in the wild Web

The Wild Web Dataset (4/5)

• Designed ground-truth binary masks for each sub-case corresponding to each possible forgery step (for complex forgeries)

#12

Page 13: Detecting image splicing in the wild Web

The Wild Web Dataset (5/5)

#13

• The final dataset by the numbers:

– 82 cases of forgeries

– 92 forgery variants

– 101 unique masks

– 13,577 images total

– 9,751 images resembling the original forgery

• For each of the 82 cases, a match on any mask of any variant should be considered an overall success

Page 14: Detecting image splicing in the wild Web

Experimental evaluations

#14

• Emulated real-world conditions: we applied the minimum typical transformations (JPEG resave & rescaling) to the datasets compatible with the task:

– Columbia Uncompressed

– VIPP Synthetic

– VIPP Realistic

– Set 1: JPEG recompression at Quality 75

– Set 2: rescale to 75% size followed JPEG recompression at Quality 75

Page 15: Detecting image splicing in the wild Web

Reconsidering evaluation protocols (1/3)

#15

• Forgery localization algorithms typically produce a value map

• Ground truth takes the form of a binary mask signifying the tampered area

• Past approaches compare values under the mask to the rest of the image:– Kolmogorov-Smirnov (KS) statistic (Farid et al, 2009)

– Median value (Fontani et al, 2013)

Page 16: Detecting image splicing in the wild Web

Reconsidering evaluation protocols (2/3)

#16

• A recompressed image from VIPP Realistic, analyzed using (Lin et al, 2009)

Page 17: Detecting image splicing in the wild Web

• This would be considered a good detection under typical methodologies

– Median under mask: ~0.93

– Median outside mask: ~0.02

– K-S Statistic: ~0.41

• Any human evaluator would disagree

#17

Reconsidering evaluation protocols (3/3)

Page 18: Detecting image splicing in the wild Web

Proposed evaluation protocol (1/2)

#18

1. Take the output value map

2. Binarize according to some method-appropriate threshold

– e.g. 0.5 for probabilistic methods

3. Compare the binary map to the ground truth mask:

4. Values above an experimental threshold (0.65) suggest a strong match

𝐸 𝐴,𝑀 = 𝐴 ∩𝑀 2

𝐴 × 𝑀

Page 19: Detecting image splicing in the wild Web

Proposed evaluation protocol (2/2)

#19

• Adapt to mimic a human’s perspective:

1. Apply multiple morphological processing operations

2. Try multiple (method-appropriate) thresholds

3. Keep the best-fitting result (bias towards success)

• For non-spliced images (true negative/false positive detection), apply the same methodology and declare a success for a blank binary map

– Main disadvantage: binary outcome, no parameters to tweak for ROC curve generation.

Page 20: Detecting image splicing in the wild Web

Evaluations

#20

• Evaluated seven algorithms:

– Double JPEG quantization (Lin et al, 2009), (Bianchi et al, 2011), (Bianchi et al, 2012a)

– Non-Aligned double JPEG quantization (Bianchi et al, 2012b)

– CFA artifacts (Ferrara et al, 2007)

– High-frequency DW noise (Mahdian et al, 2009)

– JPEG ghosts (Farid, 2010)

Page 21: Detecting image splicing in the wild Web

• Comparing median values:

Evaluation results: Emulated datasets (1/2)

#21

Dataset(Lin et al,

2009)

(Bianchi et

al, 2011)

(Ferrara et

al, 2007)

(Bianchi

et al,

2012b)

(Bianchi

et al,

2012b)

(Mahdian

et al,

2009)

Columbia

Uncomp.

Orig.

JPEG

Resized

- -

0.89 (0.05)

0.05 (0.05)

0.03 (0.04)

- -

0.39 (0.04)

0.09 (0.05)

0.11 (0.05)

VIPP

Synthetic

Orig.

JPEG

Resized

0.47 (0.05)

0.30 (0.04)

0.05 (0.05)

0.51 (0.05)

0.43 (0.04)

0.05 (0.05)

0.15 (0.05)

0.16 (0.05)

0.05 (0.04)

0.57 (0.01)

0.39 (0.05)

0.05 (0.05)

0.28 (0.05)

0.16 (0.05)

0.05 (0.05)

0.13 (0.05)

0.10 (0.05)

0.06 (0.05)

VIPP

Realistic

Orig.

JPEG

Resized

0.54 (0.04)

0.32 (0.04)

0.13 (0.04)

0.58 (0.04)

0.36 (0.04)

0.12(0.06)

0.04 (0.04)

0.04 (0.04)

0.03 (0.04)

0.70 (0.04)

0.51 (0.04)

0.23 (0.04)

0.28 (0.04)

0.17 (0.04)

0.17 (0.04)

0.20 (0.04)

0.20 (0.04)

0.18 (0.04)

Page 22: Detecting image splicing in the wild Web

• Proposed evaluation framework:

Evaluation results: Emulated datasets (2/2)

#22

Dataset(Lin et al,

2009)

(Bianchi et

al, 2011)

(Ferrara et

al, 2007)

(Bianchi

et al,

2012b)

(Bianchi

et al,

2012b)

(Mahdian

et al,

2009)

Columbia

Uncomp.

Orig.

JPEG

Resized

- -

0.66 (0.16)

0.00 (0.20)

0.00 (0.24)

- -

0.12 (0.57)

0.02 (0.86)

0.04 (0.79)

VIPP

Synthetic

Orig.

JPEG

Resized

0.44 (0.27)

0.26 (0.30)

0.00 (0.23)

0.52 (0.00)

0.30 (0.10)

0.00 (0.00)

0.01 (0.23)

0.01 (0.28)

0.00 (0.23)

0.58 (0.09)

0.23 (0.27)

0.00 (0.15)

0.04 (0.25)

0.01 (0.29)

0.00 (0.29)

0.04 (0.74)

0.04 (0.74)

0.00 (0.84)

VIPP

Realistic

Orig.

JPEG

Resized

0.41 (0.46)

0.13 (0.44)

0.00 (0.47)

0.38 (0.09)

0.17 (0.29)

0.00 (0.00)

0.09 (0.22)

0.00 (0.25)

0.00 (0.28)

0.23 (0.30)

0.14 (0.46)

0.03 (0.25)

0.03 (0.39)

0.01 (0.43)

0.01 (0.47)

0.04 (0.90)

0.02 (0.90)

0.01 (0.47)

Page 23: Detecting image splicing in the wild Web

Evaluation results: Emulated datasets (4/4)

#23

• Methods behave generally as expected

– CFA patterns destroyed by the first JPEG compression

• (Mahdian et al, 2009) is not particularly effective, but shows little vulnerability to alterations

• DQ methods show some degree of robustness to recompression only

• Rescaling is extremely disruptive, as expected

Page 24: Detecting image splicing in the wild Web

Evaluation results: Wild Web dataset (1/2)

#24

• 36 out of 82 cases were successfully detected by at least one method

– Not a single image gave good results for the other 46 cases, for any algorithm

(Lin et

al, 2009)

(Bianchi et

al, 2011)

(Ferrara et

al, 2007)

(Bianchi et

al, 2012b)

(Bianchi et

al, 2012b)

(Mahdian

et al, 2009)

(Farid,

2010)

Detections 13 12 1 8 5 15 29

Unique 4 1 0 1 2 6 10

Page 25: Detecting image splicing in the wild Web

Evaluation results: Wild Web dataset (2/2)

#25

• The noise-based method of (Mahdian et al, 2009) proved disproportionately successful,– We should not forget how prone to false positives it is.

• JPEG Ghosts are very robust, if we can manage the amount of output they produce

• Even in the cases where successful detection occurred, only a few images were correctly detected– 1386 images in the entire dataset (~ 14.3%)

– Excluding the three easiest classes, only 333 out of 8580 images were detected (~ 3.9%)

Page 26: Detecting image splicing in the wild Web

Forgery detection in the Wild (1/4)

#26

Page 27: Detecting image splicing in the wild Web

Forgery detection in the Wild (2/4)

#27

Page 28: Detecting image splicing in the wild Web

Forgery detection in the Wild (3/4)

#28

Page 29: Detecting image splicing in the wild Web

Forgery detection in the Wild (4/4)

#29

Page 30: Detecting image splicing in the wild Web

Conclusions

• In the web, very few images retain traces which are detectable with today’s state-of-the-art forensic approaches

• It is difficult to estimate the relative age of each instance of a viral image

• DQ-based methods give results with the highest confidence, but are not particularly robust

• JPEG Ghosts demonstrate significantly higher robustness than other methods, but produce large amounts of noisy output

• DW high-frequency noise also appears to give good results, but seems extremely prone to false positives

#30

Page 31: Detecting image splicing in the wild Web

Future steps

• For the web journalism case, robustness ought to be a central consideration for future algorithm evaluations

• The Wild Web dataset is freely distributed for research purposes– Due to copyright considerations, this is currently only feasible through direct contact– The dataset should be maintained to incorporate new cases of forgeries, as they

come out

• Advance the state-of-the-art by focusing on more robust traces of splicing

• Following the life-cycle of images on the web can help locate their earliest versions and build an account of the alterations that have taken place (Kennedy & Chang, 2008)

• The question remains: to what extent is the task feasible? When can we be certain that all traces have been lost?

#31

Page 32: Detecting image splicing in the wild Web

References

#32

• Bianchi, Tiziano, Alessia De Rosa, and Alessandro Piva. "Improved DCT coefficient analysis for forgery localization in JPEG images." In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pp. 2444-2447. IEEE, 2011.

• Bianchi, Tiziano and Alessandro Piva, “Image forgery localization via block-grained analysis of JPEG artifacts,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 3, pp. 1003–1017, 2012.

• Ferrara, Pasquale, Tiziano Bianchi, Alessia De Rosa, and Alessandro Piva. "Image forgery localization via fine-grained analysis of cfa artifacts." Information Forensics and Security, IEEE Transactions on 7, no. 5 (2012): 1566-1577.

• Farid, Hany. "Exposing digital forgeries from JPEG ghosts." Information Forensics and Security, IEEE Transactions on 4, no. 1 (2009): 154-160.

• Fontani, Marco, Tiziano Bianchi, Alessia De Rosa, Alessandro Piva, and Mauro Barni. "A framework for decision fusion in image forensics based on dempster–shafer theory of evidence." Information Forensics and Security, IEEE Transactions on 8, no. 4 (2013): 593-607.

• Kennedy, Lyndon, and Shih-Fu Chang. "Internet image archaeology: automatically tracing the manipulation history of photographs on the web." In Proceedings of the 16th ACM international conference on Multimedia, pp. 349-358. ACM, 2008.

• Lin, Zhouchen, Junfeng He, Xiaoou Tang, and Chi-Keung Tang. "Fast, automatic and fine-grained tampered JPEG image detection via DCT coefficient analysis." Pattern Recognition 42, no. 11 (2009): 2492-2501.

• Mahdian, Babak and Stanislav Saic, “Using noise inconsistencies for blind image forensics,” Image and Vision Computing, vol. 27, no. 10, pp. 1497–1503, 2009.

Page 33: Detecting image splicing in the wild Web

Thank you!

• Slides: http://www.slideshare.net/sympapadopoulos/detecting-image-splicing-in-the-wild-web

• Get in touch:

@markzampoglou / [email protected]

@sympapadopoulos / [email protected]

#33