details java bytecodeskeywords4bytecodes.org/rali-olst2012.pdf · java bytecodes pablo ariel...

44
Keywords for Bytecodes Dr. Duboue Introduction The Speaker Bytecodes as Semantics Reverse Engineering Details Corpus Assembly Main Pipeline Results Applications Other Topics GRIUM/RALI Other Academic Focus on Technology Summary Predicting English Keywords from Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université de Montréal

Upload: others

Post on 21-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Predicting English Keywords fromJava Bytecodes

Pablo Ariel Duboue, PhD

Les Laboratoires FoulabMontreal, Quebec

Séminaires RALI-OLST, Université de Montréal

Page 2: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Outline

IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering

DetailsCorpus AssemblyMain PipelineResultsApplications

Other TopicsGRIUM/RALIOther AcademicFocus on Technology

Page 3: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Outline

IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering

DetailsCorpus AssemblyMain PipelineResultsApplications

Other TopicsGRIUM/RALIOther AcademicFocus on Technology

Page 4: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Before Montreal

I Columbia UniversityI WSD in biology texts (GENIES)I Natural Language Generation in medical and

intelligence domains (MAGIC, AQUAINT)I Thesis: “Indirect Supervised Learning of Strategic

Generation Logic”, defended Jan. 2005.I Advisor: Kathy McKeownI Committee:

Hirschberg/Jurafsky/Rambow/Jebara

I IBM Research WatsonI AQUAINT: Question Answering (PIQuAnT)I Enterprise Search - Expert Search (TREC)I Connections between events (GALE)I Deep QA - Watson

Page 5: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

In Montreal

I am passionate about improving societythrough language technology and split mytime between teaching, doing researchand contributing to free software projects

I Working with Prof. Nie at GRIUMI Taught a graduate class in NLG in ArgentinaI Contributed to Free Software projects, including

some of my ownI Doing some consulting focusing on startups

Page 6: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Outline

IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering

DetailsCorpus AssemblyMain PipelineResultsApplications

Other TopicsGRIUM/RALIOther AcademicFocus on Technology

Page 7: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Semantics, Java Bytecodes, Javadocs

I Motivation: Machine Learning for NaturalLanguage Generation

I Finding good semantic representations “in thewild” is very rare

I Level of detail of semantic representations vs.natural language

I Similarities with binary code and codecomments

I Reverse Engineering practitioners could toleratenoisy text

I As discussed in the INLG panel last summer

Page 8: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Java Bytecodes

I JVM is a stack machineI The set of opcodes (~200) is small to simplify

porting to new architectures.I The opcodes fall into six categories:

I Load/store (e.g. aaload, bastore)I Arithmetic/logic (e.g. iadd, fcmpg)I Type conversion (e.g. i2b, f2d)I Object construction and manipulation (new,

putfield)I Operand stack manipulation (e.g. swap,

dup2_x1)I Control flow (e.g. if_icmpgt,goto)I Method invocation and return (e.g.

invokedynamic, lreturn)

Page 9: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

LDC and CALL

I While bytecodes represent a reducedvocabulary, they can incorporate names ofclasses or methods and string constants

ldc pushes a constant onto the operandstack (number or string)

getfield instance and field namegetstatic classname and field name

invokedynamic invokes a dynamic method

Page 10: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Javadocs

I Javadocs are standardized Java commentsI Include special mark-up in the form of ’@’

constructionsI @param, @throws, @return among others

I In my work, I focus on the comments associatedwith each method

I Example:I Creates a CacheRandom instance with a given

cache capacity. @param capacity Thecapacity of the cache.

I Adjusts the relative offset where the matchbegins to an absolute value. Only used byAwkMatcher to adjust the offset for streammatches. @return The length of the match.

Page 11: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Outline

IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering

DetailsCorpus AssemblyMain PipelineResultsApplications

Other TopicsGRIUM/RALIOther AcademicFocus on Technology

Page 12: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

What is Reverse Engineering

I From WikipediaReverse engineering is the process ofdiscovering the technological principlesof a device, object, or system throughanalysis of its structure, function, andoperation. (...) The same techniques aresubsequently being researched forapplication to legacy software systems(...) to replace incorrect, incomplete, orotherwise unavailable documentation.

I REcon: the premier reverse engineeringconference, held yearly at Montreal

Page 13: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Reverse Engineering Example

private f i n a l i n t c( i n t ) {0 aload_01 get f ie ld org . jpc . emulator . f . v4 invokeinterface org . jpc . support . j .e ( )9 aload_010 get f ie ld org . jpc . emulator . f . i13 invokev i r tua l org . jpc . emulator . motherboard .q.e( )16 aload_017 get f ie ld org . jpc . emulator . f . j20 invokev i r tua l org . jpc . emulator . motherboard .q.e( )23 iconst_024 i s to re_225 i load_126 i f l e 12829 aload_030 get f ie ld org . jpc . emulator . f .b33 invokev i r tua l org . jpc . emulator . processor . t .w( )

Page 14: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Reverse Engineering Example

private f i n a l i n t c( i n t ) {0 aload_01 get f ie ld org . jpc . emulator . f . v4 invokeinterface org . jpc . support . j .e ( )9 aload_010 get f ie ld org . jpc . emulator . f . i13 invokev i r tua l org . jpc . emulator . motherboard .q .e( )16 aload_017 get f ie ld org . jpc . emulator . f . j20 invokev i r tua l org . jpc . emulator . motherboard .q .e( )23 iconst_024 i s to re_225 i load_126 i f l e 12829 aload_030 get f ie ld org . jpc . emulator . f .b33 invokev i r tua l org . jpc . emulator . processor . t .w( )

Page 15: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Reverse Engineering Example

private f i n a l i n t c( i n t ) {0 aload_01 get f ie ld org . jpc . emulator . f . v4 invokeinterface org . jpc . support . j .e ( )9 aload_010 get f ie ld org . jpc . emulator . f . i13 invokev i r tua l org . jpc . emulator . motherboard .q .e( )16 aload_017 get f ie ld org . jpc . emulator . f . j20 invokev i r tua l org . jpc . emulator . motherboard .q .e( )23 iconst_024 i s to re_225 i load_126 i f l e 12829 aload_030 get f ie ld org . jpc . emulator . f .b33 invokev i r tua l org . jpc . emulator . processor . t .w( )

Page 16: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Reverse Engineering Example

private f i n a l i n t c( i n t ) {0 aload_01 get f ie ld org . jpc . emulator . f . v4 invokeinterface org . jpc . support . j .e ( )9 aload_010 get f ie ld org . jpc . emulator . f . i13 invokev i r tua l org . jpc . emulator . motherboard .q.e( )16 aload_017 get f ie ld org . jpc . emulator . f . j20 invokev i r tua l org . jpc . emulator . motherboard .q.e( )23 iconst_024 i s to re_225 i load_126 i f l e 12829 aload_030 get f ie ld org . jpc . emulator . f .b33 invokev i r tua l org . jpc . emulator . processor . t .w( )

Page 17: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Reverse Engineering Example

private f i n a l i n t c( i n t ) {0 aload_01 get f ie ld org . jpc . emulator . f . v4 invokeinterface org . jpc . support . j .e ( )9 aload_010 get f ie ld org . jpc . emulator . f . i13 invokev i r tua l org . jpc . emulator . motherboard .q.e( )16 aload_017 get f ie ld org . jpc . emulator . f . j20 invokev i r tua l org . jpc . emulator . motherboard .q.e( )23 iconst_024 i s to re_225 i load_126 i f l e 12829 aload_030 get f ie ld org . jpc . emulator . f .b33 invokev i r tua l org . jpc . emulator . processor . t .w( )

Page 18: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Outline

IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering

DetailsCorpus AssemblyMain PipelineResultsApplications

Other TopicsGRIUM/RALIOther AcademicFocus on Technology

Page 19: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Debian

I Using the Debian archiveI apt-file search - -package-only .jar

I 1,400+ packagesI dpkg-query -p package name

I Look for Source fieldI dpkg-source -x source .dsc

I Search for Java source files.I dpkg -x binary .deb

I Search for jars, disassemble the methods.

I Assembling the Bytecodes / Javadoc CorpusI Disassemble using jclassinfo - -disasmI Dump Javadoc comments using qdox.

I A lightweight Java source parsing library.I Heuristically match source methods to compiled

methods.I Normalize source code signatures to binary

signatures.

Page 20: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Numbers

I Final corpus:I 1M methods.I 35M words.I 24M JVM instructions.

I This corpus is 3x bigger than the one discussed inthe REcon talk

Page 21: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Outline

IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering

DetailsCorpus AssemblyMain PipelineResultsApplications

Other TopicsGRIUM/RALIOther AcademicFocus on Technology

Page 22: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Pipeline

1. HTML detagging2. PTB tokenizer3. Morfessor4. cclparser5. Naive Bayes

Page 23: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Morfessor

I Unsupervised morphome detectionI http://www.cis.hut.fi/projects/morpho/

I CacheRandom → Cache + RandomI GenericCache.DEFAULT_CAPACITY → Generic +

Cache. + DEFAULT_CAPACITYI someFileName → some + FileNameI PatternStreamInput → Pattern + Stream + Input

Page 24: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

CCL ParserI CCL Parser is an unsupervised parser that does

not require POS tagsI Unsupervised POS induction, incremental (can

deal with long sentences)I Yoav Seginer (2007), Fast Unsupervised

Incremental Parsing. ACL.I http://www.seggu.net/ccl/I GPLv2 – but current codebase does not save

trained modelsI ( ( ( ( ( ( ( ( ( ( (creates a) cache random)

instance (with a)) given) cache) capacity. (@param)) capacity the) capacity) of the)cache))

I As chunkerI [creates a] [cache random] [instance] [with a]

[given] [cache] [capacity.] [@ param][capacity the] [capacity] [of the] [cache][same] [as cache random] [generic] [cache.][default_capacity]

I [default] [constructor] [given] [default] [access][to prevent instantiation] [outside the][package]

Page 25: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Naive Bayes

I P(term|bytecodes)I In case of complex opcodes (e.g., ldc “This is a

very long string”), the count for the opcode issplit between:

I 0.5 for the full opcode, as a wholeI 0.5 / #parts for each subpart ({ldc, This, is, a, very,

long, string})

Page 26: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Outline

IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering

DetailsCorpus AssemblyMain PipelineResultsApplications

Other TopicsGRIUM/RALIOther AcademicFocus on Technology

Page 27: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Overall Results

I Top scoring terms, using one count per opcode

Term P R F@ param 0.73 0.64 0.685

0.73 0.63 0.679object 0.97 0.06 0.114

@ throws 0.72 0.05 0.099text 0.64 0.02 0.038

property 0.69 0.01 0.031description 0.72 0.01 0.029@ return the 0.78 0.01 0.028

0.80 0.01 0.026

Page 28: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Without Per-opcode Normalization

Term P R F@ generated 0.76 0.80 0.783

replaced 0.93 0.60 0.734@ param 0.64 0.74 0.690

icu 0.75 0.49 0.600o the 0.47 0.75 0.582

@ stable 0.72 0.45 0.561@ inheritdoc 0.42 0.60 0.495@ return the 0.41 0.52 0.463

receiver 0.72 0.31 0.440

Page 29: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Where to go from here

I The meaning in the bytecodes is not in thepresence of individual opcodes but in theirsequencing

I MOTIF analysis in bioinformatics

I Comparable SMTI Most systems (e.g., Munteanu and Marcu

(2006)) use either an aligned corpora or abilingual dictionary

I I can try to obtain that by asking developers towrite descriptions for segments of the code

I Alternatively, I can try to adapt TextTiling tobytecodes

I Suggested by another Foulaber (Danukeru)

Page 30: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Outline

IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering

DetailsCorpus AssemblyMain PipelineResultsApplications

Other TopicsGRIUM/RALIOther AcademicFocus on Technology

Page 31: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Applications in Reverse EngineeringI Hinting Subroutines

I The motivating example at the beginning.I “Beacon identification” in Software Engineering.

I Custom (malware) VMsI Identifying which methods correspond to

different VM operations (addition, jump, etc).

I Dalvik Word Clouds.I Use dex2jar, obtain word clouds for the whole

executable.I Maybe the user can tell if anything looks fishy

there?

I Flagging Suspicious Methods.I Finding methods that can be described with

keywords very different from the rest of theexisting methods.

I Can be done with dynamically generatedbytecodes.

Page 32: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Applications Outside Reverse Engineering

I Semantic SearchI Searching for methods related to certain English

termsI Query expansion using bytecodes

I Software Engineer DocumentationI Generating documentation from bytecodesI Long term goal

Page 33: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Outline

IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering

DetailsCorpus AssemblyMain PipelineResultsApplications

Other TopicsGRIUM/RALIOther AcademicFocus on Technology

Page 34: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Snippets and Sentence CompressionI Improving Information Retrieval user experience

and engine performance by having bettersnippets

I Working closely with Dr. Jing He.I Summarization snippets seem better than

regular snippets but are much longer⇒sentence compression

I Query: wine romeI Page:

http://penelope.uchicago.edu/%7Egrout/encyclopaedia_romana/wine/wine.html

I Bing snippet: Return to Notae. Wine and Rome.Now nearly extinct in the wild, grapes (vitisvinifera) grew throughout the ancientMediterranean, the juice readily fermenting asthe enzymes ...

I Summarization: Wine almost always was mixedwith water for drinking; undiluted wine merumwas considered the habit of provincials andbarbarians. The earliest work on wine andagriculture was written in Punic. Indeed, by 154BC, says Pliny, wine production in Italy wasunsurpassed.

Page 35: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Outline

IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering

DetailsCorpus AssemblyMain PipelineResultsApplications

Other TopicsGRIUM/RALIOther AcademicFocus on Technology

Page 36: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Taught Graduate Class in Argentina

I My alma materI Universidad Nacional de Cordoba

I Natural Language GenerationI

http://wiki.duboue.net/index.php/2011_FaMAF_Intro_to_NLGI Touched NLG from DBs, Summarization and

decoding in SMTI 12 students, about a fourth of the total PhD

students in the dept

I Large NLP GroupI http://pln.famaf.unc.edu.ar/I Possibilities for visiting people from Montreal

Page 37: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Student Projects

I Natural Language Generation for SoftwarePatches

I http://nlg4patch.com.ar/

I Natural Language Generation for UML diagramsI ongoing

I Referring Expression Evaluation using DBpediaI HLT-NAACL 2012 Short Paper “On The Feasibility

of Open Domain Referring ExpressionGeneration Using Large Scale Folksonomies”

I Surface Realization of Spanish using the SemParCorpus

Page 38: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Outline

IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering

DetailsCorpus AssemblyMain PipelineResultsApplications

Other TopicsGRIUM/RALIOther AcademicFocus on Technology

Page 39: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Free Software

I Debian scienceI apertium, transfer-based machine translation for

related language-pairs

I NLTKI Personal Projects

I Farmer text supportI php-nlgenI NLG in Puredata

I http://www.ohloh.net/accounts/DrDub

Page 40: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Tech Scene MontrealI Foulab

I Montreal oldest and more prestigioushackerspace

I http://foulab.orgI Hackerspaces are community-operated physical

places, where people can meet and work ontheir projects.

I http://hackerspaces.org for the full listI Open House every Tuesday night, everybody is

welcomed

I Hack-a-thonsI Upcoming:

http://quebecouvert.org/events/hackonslacorruption/

I Notman houseI The “House of the Web” in MontrealI http://notman.org/

Page 41: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Consulting

I R&D for start-upsI Focusing on companies with positive

contributionsI Quick turnaround from ideas to usersI http://honeypot.matchfwd.com

I Own venturesI 4opiniones.com

Page 42: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Summary

I I have presented a work-in-progress targettingthe automated documentation generation fromcompiled code

I Most recent progress is in unsupervisedterminology identification

I Currently working in improved ML

Page 43: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Acknowledgements

I GRIUMI Prof. Nie and Dr. Jing He

I FoulabI Danukeru

I REcon organizersI Subgraph.

I Annie Ying

Page 44: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université

Keywords forBytecodes

Dr. Duboue

IntroductionThe Speaker

Bytecodes as Semantics

Reverse Engineering

DetailsCorpus Assembly

Main Pipeline

Results

Applications

Other TopicsGRIUM/RALI

Other Academic

Focus on Technology

Summary

Contacting the Speaker

I Email: [email protected] Website: http://duboue.netI Twitter: @pabloduboueI LinkedIn: http://linkedin.com/in/pabloduboueI IRC: DrDub

http://keywords4bytecodes.org

I Always looking for new collaborationopportunities

I Very interested in teaching a class either inMontreal or on-line