1 licensing is software too: achievements and challenges (and how this relates to code provenance)...

36
Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy [email protected] http://www.rcost.unisannio.it/mdipenta

Upload: lucas-chambers

Post on 16-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

Licensing is Software Too: Achievements and Challenges(and how this relates to code provenance)

Massimiliano Di PentaUniversity of Sannio, Italy

[email protected]

http://www.rcost.unisannio.it/mdipenta

Page 2: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

2

Acknowledgements

Daniel M. Germán, Univ. Victoria, Canada

Julius Davies, Univ. Victoria, Canada

Giuliano Antoniol, Ecole Polyt. Montréal, Canada

Yann-Gaël Guéhéneuc, Ecole Polyt. Montréal, Canada

Page 3: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

3

Reusing Open Source Software When developing a software system,

we try (if possible) not to reinvent the wheel Components, libraries, source

code snippets out of there, ready to be reused Code search engines are becoming popular

Open source code modification and redistribution governed by Software licenses Copyright statements

Everything contained in a licensing block…

Page 4: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

4

What does a licensing contain?/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */

/* ***** BEGIN LICENSE BLOCK *****

* Version: MPL 1.1/GPL 2.0/LGPL 2.1

*

* The contents of this file are subject to the Mozilla Public License Version

* 1.1 (the "License"); you may not use this file except in compliance with

* the License. You may obtain a copy of the License at

* http://www.mozilla.org/MPL/

….

* Portions created by the Initial Developer are Copyright (C) 2002

* the Initial Developer. All Rights Reserved.

*

* Contributor(s):

* Brian Ryner <[email protected]>

….

* decision by deleting the provisions above and replace them with the notice

* and other provisions required by the GPL or the LGPL. If you do not delete

* the provisions above, a recipient may use your version of this file under

* the terms of any one of the MPL, the GPL or the LGPL.

*

* ***** END LICENSE BLOCK ***** */

#include "nsXULAppAPI.h"

#ifdef XP_WIN

#include <windows.h>

License(MPL+GPL+LGPL)

Copyrightstatement

Copyrightyear

Contributor

Page 5: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

5

Restrictive vs. permissive licenses

Restrictive (aka copyleft or reciprocal) Changed software must be made available

under similar terms wrt. the original Example: GPL

Permissive Modifications/enhancements may remain

proprietary Distribution of source code or binary permitted

– Provided copyright notice and/or liability disclaimers– Contributor names do not imply endorsement

Examples: Berkeley Software Distribution (BSD), Apache Software License, MIT

Page 6: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

6

FOSS development teams care! (source: Debian)

I am in the process of trying to prepare 0.8.0 for Debian GNU/Linux I have started going over the copyright/license headers. In src/celeste many files are missing copyright information. Most of these are files imported with minimal changes from Gabor API http://www.kung-foo.tv/gaborapi.php or libsvm http://www.csie.ntu.edu.tw/\~cjlin/libsvm/.

The attached patch adds copyright and license statements to these files.[1]

Please apply and update the headers (adding copyright holders) if you make substantial changes.

thanks, cu andreas

[1] I have doublechecked with Gabor API's upstream author Adriaan Tijsseling that files like ContrastFilter.cpp are Copyright (c) Adriaan Tijsseling and licensed under GPLv2+, although the original headers just say:

Original Author: Yasunobu Honma

Modifications by: Adriaan Tijsseling (AGT)

Page 7: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

7

Conjectures Since licenses determine the way software

can be composed and re-distributed They may change/evolve as any other part of

the software They might be subject to bugs too

– See our ICPC 2010 paper about how to identify licensing incompatibilities

They might determine the success/failure of a software project

Code provenance and licenses: Licenses constrain source code migration

between projects Code provenance might be useful to determine

the licensing of closed components

Page 8: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

8

Licenses influence the software lifetime OpenBSD founder and project leader Theo de Raadt

removed a security software package called IP-Filter [written by Darren Reed] after its author changed its license.

Stephen Shankland, CNET News, 2001/05/30.

Licenses evolve as software does Failing to account for that would cause copyright

infringements Decisions on license changes impact as other

decisions on software evolution Little attention so far from the scientific community

Need for methods and tools to audit licensing and their changes

Page 9: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

9

Example: Java Until November 2006, the license of Java JDK v1.2

said:

“Except as specifically authorized in any Supplemental License Terms, you may not make copies of Software, other than a single copy of Software for archival purposes” This disallowed the inclusion of Java in Linux distributions

Java 5.0 released under the GPL v2 with the CLASSPATH exception: Java could be modified/updated under the GPL v2

Java programs could be released under any license as long as they satisfy the conditions stated in the CLASSPATH exception

Changing the license of a system can promote and ease the distribution and reuse of a software system

Page 10: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

11

Example: QT First released under a non-open source but free

license, called the FreeQT License, and a commercial license

QT became the basis for KDE QT v2.0 was released under a new license, the Q Public

License incompatible with the GPL

GNOME project started as a QT-free alternative to KDE

Harmony project started as a GPL replacement of QT Trolltech changed the license of QT v3 to the GPL v2

The Harmony project was abandoned

Changing the license of FOSS system towards a more permissive might cause the abandonment of a competing system

Page 11: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

13

Empirical Study Goal: analyze licensing evolution Purpose: investigating how

developers change licensing statements

Context: CVS/SVN repositories of ArgoUML, Eclipse-JDT, the FreeBSD and

the OpenBSD kernels, Mozilla, Samba

Page 12: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

14

Research Questions

RQ1: To what extent are files changing their licenses?

RQ2: How are copyright years changed in licensing statements?

RQ3: Who are the contributors of a software project and how do they change?

Page 13: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

15

Licensing Analysis Method – Extracting Licensing statements

/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */

/* ***** BEGIN LICENSE BLOCK *****

* Version: MPL 1.1/GPL 2.0/LGPL 2.1

*

* The contents of this file are subject to the Mozilla Public License Version

* 1.1 (the "License"); you may not use this file except in compliance with

* the License. You may obtain a copy of the License at

* http://www.mozilla.org/MPL/

….

* Portions created by the Initial Developer are Copyright (C) 2002

* the Initial Developer. All Rights Reserved.

*

* Contributor(s):

* Brian Ryner <[email protected]>

….

* decision by deleting the provisions above and replace them with the notice

* and other provisions required by the GPL or the LGPL. If you do not delete

* the provisions above, a recipient may use your version of this file under

* the terms of any one of the MPL, the GPL or the LGPL.

*

* ***** END LICENSE BLOCK ***** */

#include "nsXULAppAPI.h"

#ifdef XP_WIN

#include <windows.h>

Page 14: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

16

Licensing Analysis Method – Classifying licenses FoSSology [Gobeille, MSR 2008]: detects licenses

using the Binary Symbolic Alignment Matrix (bSAM) Ninka [German et al., ASE 2010]: uses a pattern-

matching approach

/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- *//* ***** BEGIN LICENSE BLOCK ***** * Version: MPL 1.1/GPL 2.0/LGPL 2.1 * * The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * http://www.mozilla.org/MPL/

…. * Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved. * * Contributor(s): * Brian Ryner <[email protected]>

….

/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- *//* ***** BEGIN LICENSE BLOCK ***** * Version: MPL 1.1/GPL 2.0/LGPL 2.1 * * The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * http://www.mozilla.org/MPL/

…. * Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved. * * Contributor(s): * Brian Ryner <[email protected]>

….MPL 1.1/GPL 2.0/LGPL 2.1

Page 15: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

17

Licensing Analysis Method – Identifying changes in copyright years Mining references to years in licensing…

/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- *//* ***** BEGIN LICENSE BLOCK ***** * Version: MPL 1.1/GPL 2.0/LGPL 2.1 * * The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * http://www.mozilla.org/MPL/

…. * Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved. * * Contributor(s): * Brian Ryner <[email protected]>

….

/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- *//* ***** BEGIN LICENSE BLOCK ***** * Version: MPL 1.1/GPL 2.0/LGPL 2.1 * * The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * http://www.mozilla.org/MPL/

…. * Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved. * * Contributor(s): * Brian Ryner <[email protected]>

….

Page 16: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

18

Licensing Analysis Method – Identifying contributor names

Mining emails, plus various patterns Copyright … year name Contributor(s) …

And mapped to committers, whenever possible

/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- *//* ***** BEGIN LICENSE BLOCK ***** * Version: MPL 1.1/GPL 2.0/LGPL 2.1 * * The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * http://www.mozilla.org/MPL/

…. * Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved. * * Contributor(s): * Brian Ryner <[email protected]>

….

/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- *//* ***** BEGIN LICENSE BLOCK ***** * Version: MPL 1.1/GPL 2.0/LGPL 2.1 * * The contents of this file are subject to the Mozilla Public License Version * 1.1 (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * http://www.mozilla.org/MPL/

…. * Portions created by the Initial Developer are Copyright (C) 2002 * the Initial Developer. All Rights Reserved. * * Contributor(s): * Brian Ryner <[email protected]>

….

Page 17: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

19

RQ1: Most relevant license changesEclipse-JDT

Common Public License v1.0 Eclipse Public License v1.0 CHANGE 2394Common Public License v0.5 Common Public License v1.0 UPDATE 808

MozillaNPL 'NPL v1.1'-style+GPL v2+LGPL v2.1 DUAL 2914

NPL 'Dual MPL GPL'-style+MPL DUAL 1274

'Dual MPL GPL'-style+MPL NPL BUG 1194

Licensing updated as new licenses were developed

Eclipse JDT: CPL 0.5CPL 1.0EPL 1.0 IBM has relinquished control of licenses to the Eclipse

Foundation

Mozilla: NPLMPL + GPL (+ LGPL) NPL allowed to release Netscape 6 as a proprietary system MPL only allows to re-distribute the source code under the

MPL Multiple licenses to deal with incompatibilities Files wrongly changed to NPL (bug #98089)

Page 18: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

20

RQ1: Most relevant license changes

FreeBSD

BSD UCRegents (4-cl BSD)'BSD UCRegents'-style (4-cl BSD) UPDATE 491

'BSD UCRegents'-style (4-cl BSD) 'INRIA-OSL'-style (3-cl BSD) UPDATE 300

OpenBSD'BSD UCRegents'-style (4-cl BSD) 'INRIA-OSL'-style (3-cl BSD) UPDATE 964

BSD UCRegents (4-cl BSD)'BSD UCRegents'-style (4-cl BSD) UPDATE 414

FreeBSD and OpenBSD are more eclectic than other projects Moving from BSD-4 clauses to the more

permissive BSD-3 and BSD-2

Page 19: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

21

RQ1: Most relevant license changes

ArgoUML

None 'Free with copyright clause'-style +'UC Regents free with copyright clause'-style ADD 127

SambaNone GPL v2 ADD 15

ArgoUML and Samba kept the same licenses over the analyzed time span Change is from None to a simple license Authors realized the importance of including a

license

Page 20: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

22

RQ2: How and why were copyright years changed?

Files for which the copyright years were updated underwent a significantly higher number of changes than others

When developers perform substantial changes to a file, they also update copyright years

Required by copyright regulations Lack of updates with substantial changes

would allow an infringer to claim “innocent infringement”

Commits explicitly targeted to copyright years “Updated copyrights” “Updated copyrights to 2004”

Page 21: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

23

RQ3: When do contributors change?

Changes where contributor names are added are significantly bigger than other changesContributors often added when they make substantial changes

Contributor names are importantassets in source code

Like the signature on a picture However…

contributors can change during the time no standard way of reporting them no clear rule on when one should become a

contributor Their presence can have legal implications

Page 22: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

Licenses InfluenceCode Migration

Page 23: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

25

Free (software) as a bird… As birds migrate differently

during different seasons…. Code might have a

migration preferential direction

Given two systems e.g. FreeBSD and Linux

We find the same code in both systems

Three scenarios: Migration FreeBSD Linux Migration Linux FreeBSD Migration third-party

FreeBSD, Linux

Page 24: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

27

Sibling(s) Origin Identify siblings between systems using clone detection

CCFinderX, with >100 tokens as threshold, plus other heuristics Trace back into past siblings – their code fragments in

the same files Again clone detection, the sibling fragment wrt. previous file

revisions When they disappear, then we have their origins

Take the oldest of the two as the true originSys 1 – File i

Sys 2 – File j

siblings

Cloned fragments

Cloned fragments

Migrationdirection

Page 25: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

28

Code Migration and LicensesFreeBSD Linux Files BSD GPL 8BSD MIT 2BSD None 2Corporate BSD+GPL 89GPL None 1Phrase BSD+GPL 1X.Net+BSD MIT 1

Linux FreeBSD Files

BSD+GPL Corporate 8GPL BSD 17GPL BSD+GPL 1GPL CPL+BSD+GPL 1MIT BSD 1MIT+GPL None 2None BSD 1Phrase+GPL

MIT 2

OpenBSD Linux FilesBSD BSD+GPL 1BSD MIT 2BSD Unknown 1BSD+GPL GPL 1BSD+Phrase

Phrase+GPL 1

MIT GPL 23

After Jan 1, 2002

Nothing before

Before Jan 1, 2002

Almost nothing after

Page 26: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

29

Discussion

Siblings have a preferential flow Initially from BSD(s) to Linux – frequent Today from Linux to FreeBSD – less frequent Thus, due to licenses but also to the system

level of development

Companies directly contribute to code in different kernels – see Intel drivers with dual licenses In this case, code migrates from a third party

towards Linux and FreeBSD

Page 27: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

Identifying licenses of jar archives

Page 28: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

31

Motivations

Very often, Java open source software is distributed in jar archivesSee http://mvnrepository.com/

Problem: the jar might not contain licensing infoUnder what conditions can we integrate

the component?The jar might not be legally usedEven if it’s from open source code, we

might not found exactly the same jar

Page 29: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

32

Search-driven approach

Extracting info from the class bytecode Class and package names.. or a fingerprint.. We use the ASM library (http://asm.ow2.org/)

Querying Google Code Search Using the full qualified class name Using the package only Query performed using the Google Code API

(http://code.google.com/apis/gdata/) If the same class is not found, its license is

obtained by those of classes belonging to the same package

Page 30: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

33

Google Code Search Output

Page 31: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

34

% of correct classifications Found license:

Min. 29% (commons.codec), Avg. 82%, median: 89.5%

Inferred licenses: Min. 62% (JLayer 1.0),

Avg. 95%, median 100%

The inferring heuristic significantly better both in terms of completeness and of precision

Page 32: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

35

Incorrect classifications Most of them are between LGPL

and GPL and between BSD and Apache.

commons-codec: mismatching between Apache and BSD files licensed under the Apache v 1.1

derived from the BSD

JLayer: mismatching between GPL and LGPL same inferred licenses in both

releases (0.4 and 1.0)

however, JLayer moved from GPL to LGPL from release 0.4 to release 1.0

Page 33: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

36

Conclusions We proposed a code analysis method as

support for lawyers other than for software engineers

We studied how licensing are used and evolveLicense type, copyright year, contributors

Main findings: License influence projects outcome License influence code migration Moving towards more permissive licenses Copyright years and contributor names updated

to preserve rights on new code

Page 34: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

37

Licensing and code provenance

Licensing influences the direction in which code flows from a system towards another one Often code flows in the direction of more

permissive licenses… ..but there are many other factors influencing

how code flows

Search-driven approaches can be adopted to determine from what code does a closed component come from And thus its licensing… Issues related to the capabilities of the code

search tools

Page 35: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

38

Thank you!

Page 36: 1 Licensing is Software Too: Achievements and Challenges (and how this relates to code provenance) Massimiliano Di Penta University of Sannio, Italy dipenta@unisannio.it

39

References Daniel M. Germán, Jens H. Weber-Jahnke, Massimiliano Di Penta: Lawful

Software Engineering, Proceedings of FoSER: Working Conference on the Future of Software Engineering Research, November 2010, Santa Fe', USA, 2010, ACM

Daniel M. Germán, Massimiliano Di Penta, Julius Davies: Understanding and Auditing the Licensing of Open Source Software Distributions. ICPC 2010: 84-93

Massimiliano Di Penta, Daniel M. Germán, Yann-Gaël Guéhéneuc, Giuliano Antoniol: An exploratory study of the evolution of software licensing. ICSE 2010: 145-154

Massimiliano Di Penta, Daniel M. Germán, Giuliano Antoniol: Identifying licensing of jar archives using a code-search approach. MSR 2010: 151-160

Massimiliano Di Penta, Daniel M. Germán: Who are Source Code Contributors and How do they Change? WCRE 2009: 11-20

Daniel M. Germán, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, Giuliano Antoniol: Code siblings: Technical and legal implications of copying code between applications. MSR 2009: 81-90

Daniel M. Germán, Yuki Manabe, Katsuro Inoue: A sentence-matching method for automatic license identification of source code files. ASE 2010: 437-446

Daniel M. Germán, Ahmed E. Hassan: License integration patterns: Addressing license mismatches in component-based development. ICSE 2009: 188-198

Robert Gobeille: The FOSSology project. MSR 2008: 47-50