ninja research lab, university of victoria

24
Intellectual Property and Mining Software Repositories Ninja Research Lab, University of Victoria Daniel M German Mining Software Archives, Ascona, 2010 18 March 2010 Daniel M German Ninja Research Lab, University of Victoria

Upload: others

Post on 30-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Ninja Research Lab, University of Victoria

Daniel M German

Mining Software Archives, Ascona, 2010

18 March 2010

Daniel M German Ninja Research Lab, University of Victoria

Page 2: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

1 Intellectual Property and Mining Software Repositories

Daniel M GermanNinja Research Lab, University of Victoria

Daniel M German Ninja Research Lab, University of Victoria

Page 3: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

FOSS has fulfilled the goals of COTS

1 FOSS is a thriving ecosystem2 Widely used in industry3 But comes with a price

Daniel M German Ninja Research Lab, University of Victoria

Page 4: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Use cases

1 Is my system properly honouring the licenses of all of itscomponents?

2 Given my intentions, can I use this component?3 Is any of this code derived from FOSS?

Daniel M German Ninja Research Lab, University of Victoria

Page 5: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Software is complex, auditing its IP is challenging

Daniel M German Ninja Research Lab, University of Victoria

Page 6: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Auditing IP:

1 Is my system properly honouring the licenses of all of itscomponents?

1 What components is it using?

Not trivial!

2 What is the license of each component?3 What is the license of each file in each component?4 How do the licenses of the files of a system interact with the

license of the system?

Daniel M German Ninja Research Lab, University of Victoria

Page 7: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

License identification challenges

Type. Challenge

Finding the licensestatement

F1. License statements are usually mixed with other text

F2. Files might reference another file where the license islocated

F3. Files might contain multiple licenses

Language related L1. Licensing statements contain spelling errors

L2. A given license is referred in different ways

L3. Licensors change the spelling/grammar of the licensestatement

License customiza-tion

C1. Several licenses must be customized when used

C2. Licensors modify, add or remove conditions to wellknown licenses

C3. Licensors modify licenses for various intents

Daniel M German Ninja Research Lab, University of Victoria

Page 8: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Current developments: Ninka

1 License identification system

Capable of identifying more than 100 FOSS licensesDesigned to avoid making mistakes (at the cost of recall)Faster than the competition

Ninka FOSSo. ohcount OSLCCorrect 200 137 83 57

Incorrect 7 112 167 193Unknown 43 1 0 0

Recall 82.3% 99.2% 100.0% 100.0%Precision 96.6% 55.0% 33.2% 29.5%

F-measure 0.889 0.708 0.498 0.371Execution Time 22s 923 s 27s 372s

Daniel M German Ninja Research Lab, University of Victoria

Page 9: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Debian 5.0.2 licenses: Most common licenses pernumber of applications in which they appear

License Aps. PropNONE 8241 74.2%GPLv2+ 5486 49.4%SeeFile 1252 11.3%LibraryGPLv2+ 1150 10.4%SameAsPerl 791 7.1%LesserGPLv2.1+ 767 6.9%MITX11 601 5.4%BSD3 646 5.8%GPLv2 582 5.2%LesserGPLv2+ 470 4.2%GPLnoVersion 334 3.0%BSD2 255 2.3%publicDomain 244 2.2%

Daniel M German Ninja Research Lab, University of Victoria

Page 10: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Fedora 12: Most common licenses found by file

License No. FilesNONE 61475 19%EPLv1 40310 12%GPLv2+ 31392 10%UNKNOWN 23202 7%Apachev2 18059 6%GPLv2 15173 5%LesserGPLv3 12616 4%LesserGPLv2.1+ 9342 3%LibraryGPLv2+ 9320 3%GPLv3+ 7475 2%SeeFile 6163 2%boostV1 4802 1%BSD3 4460 1%MITX11noNotice 4219 1%CDDLv1orGPLv2 3651 1%

Daniel M German Ninja Research Lab, University of Victoria

Page 11: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Fedora 12: Most common declared licenses

Declared License Source License # Src Pkgs # Bin Pkgsgplv2+ GPLv2+ 118 145asl 2.0 Apachev2 28 48lgplv2+ LesserGPLv2.1+ 27 36mit MITX11noNotice 21 30mit MITold 18 23lgplv2+ LibraryGPLv2+ 16 23gpl+ or artistic SameAsPerl 14 14gplv2 GPLv2 11 12bsd BSD3 11 11gplv2 GPLv2+ 10 14

Daniel M German Ninja Research Lab, University of Victoria

Page 12: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Applications that had Errors in their Licensing

Files without a license that should have oneCutting-and-pasting the wrong license statementInconsistent license clausesIncorrect name of the licenseLicense statements can only be edited by their copyrightowners

Daniel M German Ninja Research Lab, University of Victoria

Page 13: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

License Maintenance: Requirements

Editing of the license statements.Verifying the validity of the license statements.Summarizing licenses in source code files.Tracking of copyright owners.

Daniel M German Ninja Research Lab, University of Victoria

Page 14: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Current Development: Auditing Fedora 12

1 Determining the license of a component is sometimeseasy:

All files share the same license

2 But sometimes it is extremely difficult:

Same source package splits into different binary packageseach with a different licenseSometimes licenses are in documentationErrors in licenses!

Sometimes by developersSometimes by packagers

Daniel M German Ninja Research Lab, University of Victoria

Page 15: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Fedora 12: Licenses for source packages having codewith one license

Declared License Source License # Src Pkgs # Bin Pkgsgplv2+ GPLv2+ 118 145asl 2.0 Apachev2 28 48lgplv2+ LesserGPLv2.1+ 27 36mit MITX11noNotice 21 30mit MITold 18 23lgplv2+ LibraryGPLv2+ 16 23gpl+ or artistic SameAsPerl 14 14gplv2 GPLv2 11 12bsd BSD3 11 11gplv2 GPLv2+ 10 14lgplv2+ LesserGPLv2+ 8 9gplv3+ GPLv3+ 8 9mit X11mit 7 12epl EPLv1 6 6mit X11 5 6lgplv2+ SeeFile 5 6mit SeeFile 4 5bsd BSD2 4 6bsd BSD4 4 4asl 1.1 Apachev1.1 4 4

Daniel M German Ninja Research Lab, University of Victoria

Page 16: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Example 2: packages with one license that isinconsistent with the declared license

WarningLevel

Issue Source Package Declared License Source License

OK Incorrect mysql-connector-java gplv2 with exceptions GPLv2

license glade3 gplv2+ and (gplv2+ andlgplv2+) and lgplv2

GPLv2+

identification imagemagick imagemagick LesserGPLv2+

gzip gplv2 and gfdl GPLv2+

mpfr lgplv2+ and gplv2+ and gfdl LesserGPLv2.1+

libpng zlib GPLv2+

OK Optionalcompo-nent

libpng zlib GPLv2+

OK Used as acompo-nent

opensp mit LibraryGPLv2+

OK Inconsistentdeclaredlicense

automake gplv2+ and gfdl and mit GPLv2+

Daniel M German Ninja Research Lab, University of Victoria

Page 17: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Example 2: packages with one license that isinconsistent with the declared license..

WarningLevel

Issue Source Package Declared License Source License

Suspicious FedoraFalsePositive

eclipse-cdt epl and cpl EPLv1

Suspicious License bsf asl 1.1 Apachev2

change mtools gplv2+ GPLv3+

Unknown License wasnot found

ortp lgplv2+ and vsl LesserGPLv2.1+

Daniel M German Ninja Research Lab, University of Victoria

Page 18: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Example 3: Packages under the GPL with code underthe BSD-4

Warning Issue PackagesLevelOk Copyright by

UofCftp, guile, kernel, nmap, rpm, squid

Copyright byNetBSD

exiv2, rpcbind

Sample code bashSuspicious Files using

BSD-4cups, isdn4k-utils, xen

Daniel M German Ninja Research Lab, University of Victoria

Page 19: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Example 4: Source packages that contain filesdistributed with inconsistent GPL versions

Warning Issue Package Declared SourceLevel License LicenseSuspicious License fetchmail gplv1+ GPLv2+

Evolution iptables gplv1+ GPLv2+cvs gplv1+ GPLv2+bash gplv2+ GPLv3+bison, gplv2+ GPLv3+

Some incon-sistent

mtools gplv2+ GPLv3+

files vinagre gplv2+ GPLv3+Contradictory vinagre gplv2+ GPLv3+documentation

Daniel M German Ninja Research Lab, University of Victoria

Page 20: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Results of interactions with Fedora and Upstream

Status Issue Source PackageResolved Incorrect license enchant, kdesdk, wiresharkUpstream in sourcesResolved Incorrect license xenIndependantly in sourcesResolved Incorrect declared abrtby Fedora license

Dynamic linking phpwith GPL

Acknowledged Dynamic linking lvm2, pilot-linkby Fedora with GPLReported Incorrect license cups, isdn4k-utilsUpstream in sourcesReported Incorrect declared alsa-utils, bison, eclipse-cdt,to Fedora license fetchmail, firstboot, iproute,

iptables, kdebindings, mtools,ortp, rpcbind, vinagre, vino, yum

Daniel M German Ninja Research Lab, University of Victoria

Page 21: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Current Development: Ultra fast 1-to-n clone detection

1 Windows 7 USB/DVD Download Tool contains GPL codebut it is distributed with a proprietary license!

2 How can I know if my source code contains FOSS sourcecode?

3 Running ccfinder on 0.5 million files of Debian 5.0.2 took35 days!

Other tools simply run out of memoryWe hit a worst-case: it took 1.5 days to analyze 1 file forclones (in itself only!)

Daniel M German Ninja Research Lab, University of Victoria

Page 22: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Current Development: Yocca..

1. Yocca is a system for the verification of the existence ofclones between a file and a large corpus of code (potentiallymillions of files)

1 based on n-grams2 performs syntactic clone detection3 Runs in time O(n log n).

Corpus Size 1st Qu. Median Mean 3rd Qu. Max100 0.270 0.375 0.519 0.495 2.880

1,000 0.260 0.370 0.557 0.530 3.49010,000 0.260 0.540 0.822 0.865 5.560

100,000 0.308 3.700 9.420 9.535 100.510

Times in seconds.Daniel M German Ninja Research Lab, University of Victoria

Page 23: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Future work

Legal issues will not go awayWe are scratching the surfaceSeveral areas of future work:

Architecture recoveryOrigin analysis

particularly at the assembly level

Dependency analysis

What does my application really need?

Daniel M German Ninja Research Lab, University of Victoria

Page 24: Ninja Research Lab, University of Victoria

Intellectual Property and Mining Software Repositories

Acknowledgements

This work is being done in collaboration with:

Ahmed HassanGiulio AntoniolJulius DavisKatsuro InoueMassimiliano Di PentaSimone LivieriYann-Gael GueheneucYuki Manabe

Daniel M German Ninja Research Lab, University of Victoria