code is not text! how graph technologies can help us to understand our code better

33
Code Is Not Text! How graph technologies can help us to understand our code better Andreas Dewes (@japh44) [email protected] 21.07.2015 EuroPython 2015 – Bilbao

Upload: andreas-dewes

Post on 14-Aug-2015

495 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Code is not text! How graph technologies can help us to understand our code better

Code Is Not Text!

How graph technologies can help us to understand our code better

Andreas Dewes (@japh44)

[email protected]

21.07.2015

EuroPython 2015 – Bilbao

Page 2: Code is not text! How graph technologies can help us to understand our code better

About

Physicist and Python enthusiast

We are a spin-off of the

University of Munich (LMU):

We develop software for data-driven code analysis.

Page 3: Code is not text! How graph technologies can help us to understand our code better

How we ussually think about code

Page 4: Code is not text! How graph technologies can help us to understand our code better

But code can also look like this...

Page 5: Code is not text! How graph technologies can help us to understand our code better

Our Journey

1. Why graphs are interesting

2. How we can store code in a graph

3. What we can learn from the graph

4. How programmers can profit from this

Page 6: Code is not text! How graph technologies can help us to understand our code better

Graphs explained in 30 seconds

node / vertex

edge

node_type: classsdefname: Foo

label: classsdefdata: {...}

node_type: functiondefname: foo

Old idea, many new solutions: Neo4j, OrientDB, ArangoDB, TitanDB, ... (+SQL, key/value stores)

Page 7: Code is not text! How graph technologies can help us to understand our code better

Graphs in Programming

Used mostly within the interpreter/compiler.

Use cases

• Code Optimization• Code Annotation• Rewriting of Code• As Intermediate Language

Page 8: Code is not text! How graph technologies can help us to understand our code better

Building the Code Graph

def encode(obj): """ Encode a (possibly nested) dictionary containing complex values into a form that can be serialized using JSON. """ e = {} for key,value in obj.items(): if isinstance(value,dict): e[key] = encode(value) elif isinstance(value,complex): e[key] = {'type' : 'complex', 'r' : value.real, 'i' : value.imag} return e

dict

name

nameassign

functiondef

body

body

targets

forbody iterator

value

import asttree = ast.parse(" ")...

Page 9: Code is not text! How graph technologies can help us to understand our code better

Storing the Graph: Merkle Trees

https://en.wikipedia.org/wiki/Merkle_treehttps://git-scm.com/book/en/v2/Git-Internals-Git-Objects

https://en.bitcoin.it/wiki/Protocol_documentation#Merkle_Trees

/ 4a7ef...

/flask 79fe4...

/docsa77be...

/docs/conf.py9fa5a../flask/app.py

7fa2a.....

...

tree

blob

Example: git(also Bitcoin)

Page 10: Code is not text! How graph technologies can help us to understand our code better

{i : 1}

{id : 'e'}

{name: 'encode', args : [...]}

{i:0}

AST Example

e4fa76b...

a76fbc41...

c51fa291...

name

nameassign

body

body

targets

for

body iterator

value

dict

functiondef

{i : 1}

{id : 'f'}

{i:0}

5afacc...

ba4ffac...

7faec44...

name

assign

body body

targets

value

dict

functiondef

{name: 'decode', args : [...]}

74af219...

Page 11: Code is not text! How graph technologies can help us to understand our code better

Efficieny of this Approach

Page 12: Code is not text! How graph technologies can help us to understand our code better

What this enables

• Store everything, not just condensed meta-data (like e.g. IDEs do)

• Store multiple projects together, to reveal connections and similarities

• Store the whole git commit history of a given project, to see changes across time.

Page 13: Code is not text! How graph technologies can help us to understand our code better

Modules

ClassesFunctions

The Flask project(30.000 vertices)

Page 14: Code is not text! How graph technologies can help us to understand our code better

Working with Graphs

Page 15: Code is not text! How graph technologies can help us to understand our code better

Querying & Navigation

1. Perform a query over some indexed field(s) to retrieve an initial set of nodes or edges.

graph.filter({'node_type' : 'functiondef',...})

2. Traverse the resulting graph along its edges.

for child in node.outV('body'): if child['node_type'] == ...

Page 16: Code is not text! How graph technologies can help us to understand our code better

Examples

Show all symbol names, sorted by usage.

graph.filter({'node_type' : {$in : ['functiondef','...']}})

.groupby('name',as = 'cnt').orderby('-cnt')

index 79...foo 7...bar 5

Page 17: Code is not text! How graph technologies can help us to understand our code better

Examples (contd.)

Show all versions of a given function.

graph.get_by_path('flask.helpers.url_for')

def url_for(endpoint, **values): """Generates a URL to the given endpoint with the method provided. Variable arguments that are unknown to the target endpoint are appended to the generated URL as query arguments. If the value of a query argument is ``None``, the whole pair is skipped. In case blueprints are active you can shortcut references to the same blueprint by prefixing the local endpoint with a dot (``.``). This will reference the index function local to the current blueprint:: url_for('.index')

def url_for(endpoint, **values): """Generates a URL to the given endpoint with the method provided. Variable arguments that are unknown to the target endpoint are appended to the generated URL as query arguments. If the value of a query argument is ``None``, the whole pair is skipped. In case blueprints are active you can shortcut references to the same blueprint by prefixing the local endpoint with a dot (``.``). This will reference the index function local to the current blueprint:: url_for('.index')

def url_for(endpoint, **values): """Generates a URL to the given endpoint with the method provided. Variable arguments that are unknown to the target endpoint are appended to the generated URL as query arguments. If the value of a query argument is ``None``, the whole pair is skipped. In case blueprints are active you can shortcut references to the same blueprint by prefixing the local endpoint with a dot (``.``). This will reference the index function local to the current blueprint:: url_for('.index')

def url_for(endpoint, **values): """Generates a URL to the given endpoint with the method provided. Variable arguments that are unknown to the target endpoint are appended to the generated URL as query arguments. If the value of a query argument is ``None``, the whole pair is skipped. In case blueprints are active you can shortcut references to the same blueprint by prefixing the local endpoint with a dot (``.``). This will reference the index function local to the current blueprint:: url_for('.index')

fa7fca...

3cdaf...

Page 18: Code is not text! How graph technologies can help us to understand our code better

Visualizing Code

Page 19: Code is not text! How graph technologies can help us to understand our code better

Example: Code Complexity

Graph Algorithm for Calculating the Cyclomatic Complexity (the Python variety)

node = root

def walk(node,anchor = None): if node['node_type'] == 'functiondef': anchor=node anchor['cc']=1 #there is always one path elif node['node_type'] in ('for','if','ifexp','while',...): if anchor: anchor['cc']+=1 for subnode in node.outV: walk(subnode,anchor = anchor)

#aggregate by function path to visualize

The cyclomatic complexity is a quantitative measure of the number of linearly independent paths through a program's source code. It was developed by Thomas J. McCabe, Sr. in 1976.

Page 20: Code is not text! How graph technologies can help us to understand our code better

Example: Flaskflask.helpers.send_file (complexity: 22)

flask.helpers.url_for(complexity: 14)

area: AST weight( lines of code)

height: complexitycolor:complexity/weighthttps://quantifiedcode.github.io/code-is-beautiful

Page 21: Code is not text! How graph technologies can help us to understand our code better

Exploring Dependencies in a Code Base

Page 22: Code is not text! How graph technologies can help us to understand our code better

Finding Patterns & Problems

Page 23: Code is not text! How graph technologies can help us to understand our code better

Pattern Matching: Text vs. Graphs

Many other standards: XQuery/XPath, Cypher (Neo4j), Gremlin (e.g. TitanDB), ...

node_type: wordcontent: {$or : [hello, hallo]}#...>followed_by: node_type: word content: {$or : [world, welt]}

Hello, world!

/(hello|hallo),*\s*

(world|welt)/i

word(hello)

punctuation(,)

word(world)

Page 24: Code is not text! How graph technologies can help us to understand our code better

Example: Building a Code Checker

node_type: tryexcept

>handlers:

$contains:

node_type: excepthandler

type: null

>body:

node_type: pass

try:

customer.credit_card.debit(-100)

except:

pass #to-do: implement this!

Page 25: Code is not text! How graph technologies can help us to understand our code better

Adding an exception to the rule

node_type: tryexcept

>handlers:

$contains:

node_type: excepthandler

type: null

>body:

$not:

$anywhere:

node_type: raise

exclude: #we exclude nested try's

node_type:

$or: [tryexcept]

try:

customer.credit_card.debit(-100)

except:

logger.error("This can't be good.")

raise #let someone else deal with

#this

Page 26: Code is not text! How graph technologies can help us to understand our code better

Bonus Chapter: Analyzing Changes

Page 27: Code is not text! How graph technologies can help us to understand our code better

Example: Diff from Django Project

Page 28: Code is not text! How graph technologies can help us to understand our code better

{i : 1}

{id : 'e'}

{name: 'encode', args : [...]}

{i:0}

Basic Problem: Tree Isomorphism (NP-complete!)

name

nameassign

body

body

targets

for

body iterator

value

dict

functiondef

{i : 1}

{id : 'ee'}

{name: '_encode', args : [...]}

{i:0}

name

nameassign

body

body

targets

for

body iterator

value

dict

functiondef

Page 29: Code is not text! How graph technologies can help us to understand our code better

Similar Problem: Chemical Similarity

https://en.wikipedia.org/wiki/Epigallocatechin_gallate

Epigallocatechin gallate

Solution(s):

Jaccard FingerprintsBloom Filters...

Benzene

Page 30: Code is not text! How graph technologies can help us to understand our code better

Applications

Detect duplicated codee.g. "Duplicate code detection using anti-unification", P Bulychev et. al. (CloneDigger)

Generate semantic diffse.g. "Change Distilling:Tree Differencing for Fine-Grained Source Code Change Extraction", Fluri, B. et. al.

Detect plagiarism / copyrighted codee.g. "PDE4Java: Plagiarism Detection Engine For Java Source Code: A Clustering Approach", A. Jadalla et. al.

Page 31: Code is not text! How graph technologies can help us to understand our code better

Example: Semantic Diff

@mock.patch('django.db.migrations.questioner.MigrationQuestioner.ask_not_null_alteration',

return_value='Some Name')

def test_alter_field_to_not_null_oneoff_default(self, mocked_ask_method):

"""

#23609 - Tests autodetection of nullable to non-nullable alterations.

"""

class CustomQuestioner(...)

# Make state

before = self.make_project_state([self.author_name_null])

after = self.make_project_state([self.author_name])

autodetector = MigrationAutodetector(before, after, CustomQuestioner())

changes = autodetector._detect_changes()

self.assertEqual(mocked_ask_method.call_count, 1)

# Right number/type of migrations?

self.assertNumberMigrations(changes, 'testapp', 1)

self.assertOperationTypes(changes, 'testapp', 0, ["AlterField"])

self.assertOperationAttributes(changes, "testapp", 0, 0, name="name", preserve_default=False)

self.assertOperationFieldAttributes(changes, "testapp", 0, 0, default="Some Name")

Page 32: Code is not text! How graph technologies can help us to understand our code better

Summary: Text vs. Graphs

Text+ Easy to write+ Easy to display+ Universal format+ Interoperable- Not normalized- Hard to analyze

Graphs+ Easy to analyze+ Normalized+ Easy to transform- Hard to generate- Not (yet) interoperable

The Future(?): Use text for small-scale manipulation of code, graphs for large-scale visualization, analysis and transformation.

Page 33: Code is not text! How graph technologies can help us to understand our code better

Thanks!

Andreas Dewes (@japh44)[email protected]

www.quantifiedcode.comhttps://github.com/quantifiedcode

@quantifiedcode