ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · ludwig: a type-based...

15
L UDWIG: A TYPE - BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1 Yaroslav Dudin 1 Sai Sumanth Miryala 1 ABSTRACT In this work we present Ludwig, a flexible, extensible and easy to use toolbox which allows users to train deep learning models and use them for obtaining predictions without writing code. Ludwig implements a novel approach to deep learning model building based on two main abstractions: data types and declarative configuration files. The data type abstraction allows for easier code and sub-model reuse, and the standardized interfaces imposed by this abstraction allow for encapsulation and make the code easy to extend. Declarative model definition configuration files enable inexperienced users to obtain effective models and increase the productivity of expert users. Alongside these two innovations, Ludwig introduces a general modularized deep learning architecture called Encoder-Combiner-Decoder that can be instantiated to perform a vast amount of machine learning tasks. These innovations make it possible for engineers, scientists from other fields and, in general, a much broader audience to adopt deep learning models for their tasks, concretely helping in its democratization. 1 I NTRODUCTION AND MOTIVATION Over the course of the last ten years, deep learning models have demonstrated to be highly effective in almost every ma- chine learning task in different domains including (but not limited to) computer vision, natural language, speech, and recommendation. Their wide adoption in both research and industry have been greatly facilitated by increasingly sophis- ticated software libraries like Theano (Theano Development Team, 2016), TensorFLow (Abadi et al., 2015), Keras (Chol- let et al., 2015), PyTorch (Paszke et al., 2017), Caffe (Jia et al., 2014), Chainer (Tokui et al., 2015), CNTK (Seide & Agarwal, 2016) and MXNet (Chen et al., 2015). Their main value has been to provide tensor algebra primitives with efficient implementations which, together with the mas- sively parallel computation available on GPUs, enabled researchers to scale training to bigger datasets. Those pack- ages, moreover, provided standardized implementations of automatic differentiation, which greatly simplified model implementation. Researchers, without having to spend time re-implementing these basic building blocks from scratch and now having fast and reliable implementations of the same, were able to focus on models and architectures, which led to the explosion of new *Net model architectures of the last five years. With artificial neural network architectures being applied to a wide variety of tasks, common practices regarding how to 1 Uber AI, San Francisco, USA. Correspondence to: Piero Molino <[email protected]>. Preliminary work. Under review by the Systems and Machine Learning (SysML) Conference. Do not distribute. handle certain types of input information emerged. When faced with a computer vision problem, a practitioner pre- processes data using the same pipeline that resizes images, augments them with some transformation and maps them into 3D tensors. Something similar happens for text data, where text is tokenized either into a list of words or charac- ters or word pieces, a vocabulary with associated numerical IDs is collected and sentences are transformed into vectors of integers. Specific architectures are adopted to encode different types of data into latent representations: convo- lutional neural networks are used to encode images and recurrent neural networks are adopted for sequential data and text (more recently self-attention architectures are re- placing them). Most practitioners working on a multi-class classification task would project latent representations into vectors of the size of the number of classes to obtain logits and apply a softmax operation to obtain probabilities for each class, while for regression tasks, they would map latent representations into a single dimension by a linear layer, and the single score is the predicted value. Observing these emerging patterns led us to define abstract functions that identify classes of equivalence of model archi- tectures. For instance, most of the different architectures for encoding images can be seen as different implementations of the abstract encoding function T 0 h 0 ×w 0 ×c 0 = e θ (T h×w×c ) where T dims denotes a tensor with dimensions dims and e θ is an encoding function parametrized by parameters θ that maps from tensor to tensor. In tasks like image classi- fication, T 0 is pooled and flattened (i.e., a reduce function is applied spatially and the output tensor is reshaped as a vector) before being provided to, again, an abstract function that computes T 0 c = d θ (T h ) where T h is a one-dimensional arXiv:1909.07930v1 [cs.LG] 17 Sep 2019

Upload: others

Post on 07-Jul-2020

7 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX

Piero Molino 1 Yaroslav Dudin 1 Sai Sumanth Miryala 1

ABSTRACTIn this work we present Ludwig, a flexible, extensible and easy to use toolbox which allows users to train deeplearning models and use them for obtaining predictions without writing code. Ludwig implements a novel approachto deep learning model building based on two main abstractions: data types and declarative configuration files.The data type abstraction allows for easier code and sub-model reuse, and the standardized interfaces imposedby this abstraction allow for encapsulation and make the code easy to extend. Declarative model definitionconfiguration files enable inexperienced users to obtain effective models and increase the productivity of expertusers. Alongside these two innovations, Ludwig introduces a general modularized deep learning architecturecalled Encoder-Combiner-Decoder that can be instantiated to perform a vast amount of machine learning tasks.These innovations make it possible for engineers, scientists from other fields and, in general, a much broaderaudience to adopt deep learning models for their tasks, concretely helping in its democratization.

1 INTRODUCTION AND MOTIVATION

Over the course of the last ten years, deep learning modelshave demonstrated to be highly effective in almost every ma-chine learning task in different domains including (but notlimited to) computer vision, natural language, speech, andrecommendation. Their wide adoption in both research andindustry have been greatly facilitated by increasingly sophis-ticated software libraries like Theano (Theano DevelopmentTeam, 2016), TensorFLow (Abadi et al., 2015), Keras (Chol-let et al., 2015), PyTorch (Paszke et al., 2017), Caffe (Jiaet al., 2014), Chainer (Tokui et al., 2015), CNTK (Seide &Agarwal, 2016) and MXNet (Chen et al., 2015). Their mainvalue has been to provide tensor algebra primitives withefficient implementations which, together with the mas-sively parallel computation available on GPUs, enabledresearchers to scale training to bigger datasets. Those pack-ages, moreover, provided standardized implementations ofautomatic differentiation, which greatly simplified modelimplementation. Researchers, without having to spend timere-implementing these basic building blocks from scratchand now having fast and reliable implementations of thesame, were able to focus on models and architectures, whichled to the explosion of new *Net model architectures of thelast five years.

With artificial neural network architectures being applied toa wide variety of tasks, common practices regarding how to

1Uber AI, San Francisco, USA. Correspondence to: PieroMolino <[email protected]>.

Preliminary work. Under review by the Systems and MachineLearning (SysML) Conference. Do not distribute.

handle certain types of input information emerged. Whenfaced with a computer vision problem, a practitioner pre-processes data using the same pipeline that resizes images,augments them with some transformation and maps theminto 3D tensors. Something similar happens for text data,where text is tokenized either into a list of words or charac-ters or word pieces, a vocabulary with associated numericalIDs is collected and sentences are transformed into vectorsof integers. Specific architectures are adopted to encodedifferent types of data into latent representations: convo-lutional neural networks are used to encode images andrecurrent neural networks are adopted for sequential dataand text (more recently self-attention architectures are re-placing them). Most practitioners working on a multi-classclassification task would project latent representations intovectors of the size of the number of classes to obtain logitsand apply a softmax operation to obtain probabilities foreach class, while for regression tasks, they would map latentrepresentations into a single dimension by a linear layer,and the single score is the predicted value.

Observing these emerging patterns led us to define abstractfunctions that identify classes of equivalence of model archi-tectures. For instance, most of the different architectures forencoding images can be seen as different implementationsof the abstract encoding function T ′h′×w′×c′ = eθ(Th×w×c)where Tdims denotes a tensor with dimensions dims andeθ is an encoding function parametrized by parameters θthat maps from tensor to tensor. In tasks like image classi-fication, T ′ is pooled and flattened (i.e., a reduce functionis applied spatially and the output tensor is reshaped as avector) before being provided to, again, an abstract functionthat computes T ′c = dθ(Th) where Th is a one-dimensional

arX

iv:1

909.

0793

0v1

[cs

.LG

] 1

7 Se

p 20

19

Page 2: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

{ input_features: [ { name: img, type: image, encoder: vgg } ], output_features: [ { name: class, type: category } ]}

{ input_features: [ { name: img, type: image, encoder: resnet } ], output_features: [ { name: class, type: category } ]}

{ input_features: [ { name: utterance, type: text, encoder: rnn } ], output_features: [ { name: tags, type: set } ]}

Figure 1. Examples of declarative model definitions. The first two show two models for image classification using two different encoders,while the third shows a multi-label text classification system. Note that a part from the name of input and output features, which are justidentifiers, all that needs to be changed to encode with a different image encoder is jest the name of the encoder, while for changing tasksall that needs to be changed is the types of the inputs and outputs.

tensor of hidden size h, T ′c is a one-dimensional tensor ofsize c equal to the number of classes, and dθ is a decodingfunction parametrized by parameters θ that maps a hiddenrepresentation into logits and is usually implemented as astack of fully connected layers. Similar abstract encodingand decoding functions that generalize many different archi-tectures can be defined for different types of input data anddifferent types of expected output predictions (which in turndefine different types of tasks).

We introduce Ludwig, a deep learning toolbox based onthe above-mentioned level of abstraction, with the aim toencapsulate best practices and take advantage of inheritanceand code modularity. Ludwig makes it much easier forpractitioners to compose their deep learning models by justdeclaring their data and task and to make code reusable,extensible and favor best practices. These classes of equiva-lence are named after the data type of the inputs encoded bythe encoding functions (image, text, series, category, etc.)and the data type of the outputs predicted by the decodingfunctions. This type-based abstraction allows for a higherlevel interface than what is available in current deep learn-ing frameworks, which abstract at the level of single tensoroperation or at the layer level. This is achieved by defin-ing abstract interfaces for each data type, which allows forextensibility as any new implementation of the interfaceis a drop-in replacement for all the other implementationsalready available.

Concretely, this allows for defining, for instance, a modelthat includes an image encoder and a category decoder andbeing able to swap in and out VGG (Simonyan & Zisserman,2015), ResNet (He et al., 2016) or DenseNet (Huang et al.,2017) as different interchangeable representations of an

image encoder. The natural consequence of this level ofabstraction is associating a name to each encoder for aspecific type and enabling the user to declare what modelto employ rather than requiring them to implement themimperatively, and at the same time, letting the user add newand custom encoders. The same also applies to data typesother than images and decoders.

With such type-based interfaces in place and implementa-tions of such interfaces readily available, it becomes pos-sible to construct a deep learning model simply by speci-fying the type of the features in the data and selecting theimplementation to employ for each data type involved. Con-sequently, Ludwig has been designed around the idea of adeclarative specification of the model to allow a much wideraudience (including people who do not code) to be able toadopt deep learning models, effectively democratizing them.Three such model definition are shown in Figure 1.

The main contribution of this work is that, thanks to thishigher level of abstraction and its declarative nature, Ludwigallows for inexperienced users to easily build deep learn-ing models , while allowing experts to decide the specificmodules to employ with their hyper-parameters and to addadditional custom modules. Ludwig’s other main contribu-tion is the general modular architecture defined through thetype-based abstraction that allows for code reuse, flexibility,and the performance of a wide array of machine learningtasks under a cohesive framework.

The remainder of this work describes Ludwig’s architecturein detail, explains its implementation, compares Ludwigwith other deep learning frameworks and discusses its ad-vantages and disadvantages.

Page 3: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

Data point Pre-processor Encoder

Scores Metrics

Prediction Post-processor Decoder

Figure 2. Data type functions flow.

2 ARCHITECTURE

The notation used in this section is defined as follows. Letd ∼ D be a data point sampled from a dataset D. Eachdata point is a tuple of typed values called features. Theyare divided in two sets: dI is the set of input features anddO id the set of output features. di will refer to a specificinput feature, while do will refer to a specific output features.Model predictions given input features dI are denoted as dP ,so that there will be a specific prediction dp for each outputfeature do ∈ dO. The types of the features can be eitheratomic (scalar) types like binary, numerical or category, orcomplex ones like discrete sequences or sets. Each data typeis associated with abstract function types, as is explained inthe following section, to perform type-specific operations onfeatures and tensors. Tensors are a generalization of scalars,vectors, and matrices with n ranks of different dimensions.Tensors are referred to as Tdims where dims indicates thedimensions for each rank, like for instance Tl×m×n for arank 3 tensor of dimensions l, m and n respectively for eachrank.

2.1 Type-based Abstraction

Type-based abstraction is one of the main concepts that de-fine Ludwig’s architecture. Currently, Ludwig supports thefollowing types: binary, numerical (floating point values),category (unique strings), set of categorical elements, bagof categorical elements, sequence of categorical elements,time series (sequence of numerical elements), text, image,audio (which doubles as speech when using different pre-processing parameters), date, H3 (Brodsky et al., 2018) (ageo-spatial indexing system), and vector (one dimensionaltensor of numerical values). The type-based abstractionmakes it easy to add more types.

The motivation behind this abstraction stems from the ob-servation of recurring patterns in deep learning projects:pre-processing code is more or less the same given certaintypes of inputs and specific tasks, as is the code implement-ing models and training loops. Small differences makemodels hard to compare and their code difficult to reuse. By

modularizing it on a data type base, our aim is to improveboth code reuse, adoption of best practices and extensibility.

Each data type has five abstract function types associatedwith it and there could be multiple implementations of eachof them:

• Pre-processor: a pre-processing function Tdims =pre(di) maps a raw data point input feature di into atensor T with dimensions dims. Different data typesmay have different pre-processing functions and differ-ent dimensions of T . A specific type may, moreover,have different implementations of pre. A concrete ex-ample is text: di in this case is a string of text, therecould be different tokenizers that implement pre bysplitting on space or using byte-pair encoding and map-ping tokens to integers, and dims is s, the length ofthe sequence of tokens.

• Encoder: an encoding function T ′dims′ = eθ(Tdims)maps an input tensor T into an output tensor T ′ usingparameters θ. The dimensions dims and dims′ maybe different from each other and depend on the spe-cific data type. The input tensor is the output of a prefunction. Concretely, encoding functions for text, forinstance, take as input Ts and produce Th where h is anhidden dimension if the output is required to be pooled,or Ts×h if the output is not pooled. Examples of pos-sible implementations of eθ are CNNs, bidirectionalLSTMs or Transformers.

• Decoder: a decoding function Tdims′′ = dθ(T′dims′)

maps an input tensor T ′ into an output tensor T usingparameter θ. The dimensions dims′′ and dims′ maybe different from each other and depend on the specificdata type. T ′ is the output of an encoding function or ofa combiner (explained in the next section). Concretely,a decoder function for the category type would mapTh input tensor into a Tc tensor where c is the numberof classes.

• Post-processor: a post-processing function dp =

post(Tdims′′) maps a tensor T with dimensions dims′′

into a raw data point prediction dp. T is the output ofa decoding function. Different data types may havedifferent post-processing functions and different di-mensions of T . A specific type may, moreover, havedifferent implementations of post. A concrete exam-ple is text: dp in this case is a string of text, and therecould be different functions that implement post byfirst mapping integer predictions into tokens and thenconcatenating on space or using byte-pair concatena-tion to obtain a single string of text.

• Metrics: a metric function s = m(do, dp) producesa score s given a ground truth output feature do and

Page 4: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

INPUT COMBINER OUTPUT

Encoder

Encoder

Encoder

Encoder

Encoder

Encoder

Encoder

Encoder

Text

Category

Numerical

Binary

Sequence

Set

Image

Time Series

Audio

...

Encoder

Encoder

Decoder

Decoder

Decoder

Decoder

Decoder

Decoder

Decoder

Decoder

Decoder

Decoder

Text

Category

Numerical

Binary

Sequence

Set

Image

Time Series

Audio

...

Combiner

Figure 3. Encoder-Combiner-Decoder Architecture

predicted output dp of the same dimension. dp is theoutput of a post-processing function. In this context,for simplicity, loss functions are considered to belongto the metric class of function. Many different metricsmay be associated with the same data type. Concreteexamples of metrics for the category data type can beaccuracy, precision, recall, F1, and cross entropy loss,while for the numerical data type they could be meansquared error, mean absolute error, and R2.

A depiction of how the functions associated with a data typeare connected to each other is provided in Figure 2.

2.2 Encoders-Combiner-Decoders

In Ludwig, every model is defined in terms of encoders thatencode different features of an input data point, a combinerwhich combines information coming from the different en-coders, and decoders that decode the information from thecombiner into one or more output features. This genericarchitecture is referred to as Encoders-Combiner-Decoders(ECD). A depiction is provided in Figure 3.

This architecture is introduced because it maps naturallymost of the architectures of deep learning models and allowsfor modular composition. This characteristic, enabled bythe data type abstraction, allows for defining models by justdeclaring the data types of the input and output featuresinvolved in the task and assembling standard sub-modulesaccordingly rather than writing a full model from scratch.

A specific instantiation of an ECD architecture can havemultiple input features of different or same type, and the

same is true for output features. For each feature in the inputpart, pre-processing and encoding functions are computeddepending on the type of the feature, while for each featurein the output part, decoding, metrics and post-processingfunctions are computed, again depending on the type ofeach output feature.

When multiple input features are provided a combiner func-tion {T ′′} = cθ(T

′) that maps a set of input tensors {T ′}into a set of output tensors {T ′′} is computed. c has an ab-stract interface and many different functions can implementit. One concrete example is what in Ludwig is called concatcombiner: it flattens all the tensors in the input set, concate-nates them and passes them to a stack of fully connectedlayers, the output of which is provided as output, a set ofonly one tensor. Note that a possible implementation of acombiner function can be the identity function.

This definition of a decoder function allows for implemen-tations where subsets of inputs are provided to differentsub-modules which return subsets of the output tensors, oreven for a recursive definition where the combiner functionis a ECD model itself, albeit without pre-processors andpost-processors, since inputs and outputs are already tensorsand do not need to be pre-processed and post-processed.Although the combiner definition in the ECD architectureis theoretically flexible, the current implementations ofcombiner functions in Ludwig are monolithic (without sub-modules), non-recursive, and return a single tensor as outputinstead of a set of tensors. However, more elaborate com-biners can be added easily.

The ECD architecture allows for many instantiations by

Page 5: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

TEXT CLASSIFICATION

IdentityWordCNNText Multi-class Classifier Category

IMAGE CAPTIONING

IdentityResNetImage Multi-class Classifier Text

GENERIC REGRESSION VISUAL QUESTION ANSWERING

ConcatResNet

Bi-LSTM

Image

Text

Binary Classifier Binary

SPEAKER VERIFICATION MULTI-LABEL EVENT DETECTION

IdentityRNNTime Series Multi-label Classifier Set

Concat

Identity

Identity

Embed

Binary

Numerical

Category

Regressor Numerical

ConcatCNN

CNN

Speech

Speech

Binary Classifier

Binary

TEXT PART-OF-SPEECH TAGGING AND NAMED ENTITY RECOGNITION

IdentityTransformerText

Tagger Sequence

Tagger Sequence

SPEECH TRANSCRIPTION AND TRANSLATION

IdentityLSTMSpeech

LSTM Generator Text

LSTMGenerator

Text

Figure 4. Different instantiations of the ECD architecture for different machine learning tasks

combining different input features of different data typeswith different output features of different data types, as de-picted in Figure 4. An ECD with an input text feature andan output categorical feature can be trained to perform textclassification or sentiment analysis, and an ECD with aninput image feature and a text output feature can be trainedto perform image captioning, while an ECD with categor-ical, binary and numerical input features and a numericaloutput feature can be trained to perform regression taskslike predicting house pricing, and an ECD with numericalbinary and categorical input features and a binary outputfeature can be trained to perform tasks like fraud detec-tion. It is evident how this architecture is really flexibleand is limited only by the availability of data types and theimplementations of their functions.

An additional advantage of this architecture is its ability toperform multi-task learning (Caruana, 1993). If more thanone output feature is specified, an ECD architecture can betrained to minimize the weighted sum of the losses of eachoutput feature in an end-to-end fashion. This approach hasshown to be highly effective in both vision and natural lan-guage tasks, achieving state of the art performance (Ratneret al., 2019b). Moreover, multiple outputs can be corre-lated or have logical or statistical dependency with eachother. For example, if the task is to predict both parts ofspeech and named entity tags from a sentence, the namedentity tagger will most likely achieve higher performance ifit is provided with the predicted parts of speech (assumingthe predictions are better than chance, and there is corre-lation between part of speech and named entity tag). InLudwig, dependencies between outputs can be specified inthe model definition, a directed acyclic graph among them

is constructed at model building time, and either the lasthidden representation or the predictions of the origin outputfeature are provided as inputs to the decoder of the desti-nation output feature. This process is depicted in Figure 5.When non-differentiable operations are performed to ob-tain the predictions, for instance, like argmax in the case ofcategory features performing multi-class classification, thelogits or the probabilities are provided instead, keeping themulti-task training process end-to-end differentiable.

This generic formulation of multi-task learning as a directedacyclic graph of task dependencies is related to the hierar-chical multi-task learning in Snorkel MeTaL proposed byRatner et al. (2018) and its adoption for improving trainingfrom weak supervision by exploiting task agreements anddisagreements of different labeling functions (Ratner et al.,2019a). The main difference is that Ludwig can handleautomatically heterogeneous tasks, i.e. tasks to predict dif-ferent data types with support for different decoders, whilein Snorkel MeTaL each task head is a linear layer. On theother hand Snorkel MeTaL’s focus on weak supervision iscurrently absent in Ludwig. An interesting avenue of furtherresearch to close the gap between the two approaches couldbe to infer dependencies and loss weights automaticallygiven fully supervised multi-task data and combine weaksupervision with heterogeneous tasks.

3 IMPLEMENTATION

3.1 Declarative Model Definition

Ludwig adopts a declarative model definition schema that al-lows users to define an instantiation of the ECD architectureto train on their data.

Page 6: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.

output_features: [ {name: OF1, type: <any>, dependencies: [OF2]}, {name: OF2, type: <any>, dependencies: []}, {name: OF3, type: <any>, dependencies: [OF2, OF1]}]

CombinerOutput

OF2Decoder

OF3Decoder

OF1Decoder

OF2Loss

OF1Loss

OF3Loss

CombinedLoss

Figure 5. Different instantiations of the ECD architecture for different machine learning tasks

The higher level of abstraction provided by the type-basedECD architecture allows for a separation between what amodel is expected to learn to do and how it actually doesit. This convinced us to provide a declarative way of defin-ing the models in Ludwig, as the amount of potential userswho can define a model by declaring the inputs they areproviding and the predictions they are expecting, withoutspecifying the implementation of how the predictions areobtained, is substantially bigger than the amount of devel-opers who can code a full deep learning model on their own.An additional motivation for the adoption of a declarativemodel definitions stems from the separation of interests be-tween the authors of the implementations of the models andthe final users, analogous to the separation of interests ofthe authors of query planning and indexing strategies of adatabase and those users who query the database, whichallows the former to provide improved strategies withoutimpacting the way the latter interacts with the system.

The model definition is divided in five sections:

• Input Features: in this section of the model definition,a list of input features is specified. The minimumamount of information that needs to be provided foreach feature is the name of the feature that correspondsto the name of a column in the tabular data provided bythe user, and the type of such feature. Some featureshave multiple encoders, but if one is not specified, thedefault one is used. Each encoder can have its ownhyper-parameters, and if they are not specified, thedefault hyper-parameters of the specified encoder areused.

• Combiner: in this section of the model definition, thetype of combiner can be specified, if none is specified,the default concat is used. Each combiner can have itsown hyper-parameters, but if they are not specified, thedefault ones of the specified combiner are used.

• Output Features: in this section of the model defini-tion, a list of output features is specified. The minimum

amount of information that needs to be provided foreach feature is the name of the feature that correspondsto the name of a column in the tabular data providedby the user, and the type of such feature. The datain the column is the ground truth the model is trainedto predict. Some features have multiple decoders thatcalculate the predictions, but if one is not specified,the default one is used. Each decoder can have itsown hyper-parameters and if they are not specified, thedefault hyper-parameters of the specified encoder areused. Moreover, each decoder can have different losseswith different parameters to compare the ground truthvalues and the values predicted by the decoder and,also in this case, if they are not specified, defaults areused.

• Pre-processing: pre-processing and post-processingfunctions of each data type can have parameters thatchange their behavior. They can be specified in thissection of the model definition and are applied to allinput and output features of a specified type, and ifthey are not provided, defaults are used. Note that forsome use cases it would be useful to have different pro-cessing parameters for different features of the sametype. Consider a news classifier where the title and thebody of a piece of news are provided as two input textfeatures. In this case, the user may be inclined to seta smaller value for the maximum length of words andthe maximum size of the vocabulary for the title inputfeature. Ludwig allows users to specify processingparameters on a per-feature basis by providing theminside each input and output feature definition. If bothtype-level parameters and single-feature-level parame-ters are provided, the single-feature-level ones overridethe type-level ones.

• Training: the training process itself has parametersthat can be changed, like the number of epochs, thebatch size, the learning rate and its scheduling, and soon. Those parameters can be provided by the user, butif they are not provided, defaults are used.

Page 7: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

{ input_features: [ { name: utterance, type: text } ], output_features: [ { name: class, type: category } ]}

{ input_features: [ { name: title, type: text, encoder: rnn, cell_type: lstm, bidirectional: true, state_size: 128, Num_layers: 2, preprocessing: { length_limit: 20 } },

combiner: { type: concat, num_fc_layers: 2, }, output_features: [ { name: class, type: category }, { name: tags, type: set } ],

training: { epochs: 100, learning_rate: 0.01, batch_size: 64, early_stop: 10, gradient_clipping: 1, decay_rate: 0.95, optimizer: { type: rmsprop beta: 0.99 } }}

{ name: body, type: text, encoder: stacked_cnn, num_filters: 128, num_layers: 6, preprocessing: { length_limit: 1024 } } ],

Figure 6. On the left side, a minimal model definition for text classification. On the right side, a more complex model definition includinginput and output features and more model and training hyper-parameters.

The wide adoption of defaults allows for really concisemodel definitions, like the one shown on the left side ofFigure 6, as well as a high degree of control on both thearchitecture of the model and training parameters, as shownon the right side of Figure 6.

Ludwig adopts the convention to adopt YAML to parsemodel definitions because of its human readability, but aslong its nested structure is representable, other similar for-mats could be adopted.

For the ever-growing list of available encoders, combiners,and decoders, their hyper-parameters, the pre-processingand training parameter available, please consult Ludwig’suser guide1. For additional examples refer to the example2

section.

In order to allow for flexibility and ease of extendability,two well known design patters are adopted in Ludwig: thestrategy pattern (Gamma et al., 1994) and the registry pat-tern. The strategy pattern is adopted at different levels toallow different behaviors to be performed by different in-stantiations of the same abstract components. It is used bothto make the different data types interchangeable from thepoint of view of model building, training, and inference, andto make different encoders and decoders for the same typeinterchangeable. The registry pattern, on the other hand, isimplemented in Ludwig by assigning names to code con-structs (either variables, function, objects, or modules) andstoring them in a dictionary. They can be referenced by theirname, allowing for straightforward extensibility; adding anadditional behavior is as simple as adding a new entry in theregistry.

In Ludwig, the combination of these two patterns allowsusers to add new behaviors by simply implementing theabstract function interface of the encoder of a specific typeand adding that function implementation in the registry of

1http://ludwig.ai/user_guide/2http://ludwig.ai/examples

implementations available. The same applies for adding newdecoders, new combiners, and to add additional data types.The problem with this approach is that different implemen-tations of the same abstract functions have to conform tothe same interface, but in our case some parameters of thefunction may be different. As a concrete example, considertwo text encoders: a recurrent neural network (RNN) and aconvolutional neural network (CNN). Although they bothconform to the same abstract encoding function in termsof the rank of the input and output tensors, their hyper-parameters are different, with the RNN requiring a booleanparameter indicating whether to apply bi-direction or not,and the CNN requiring the size of the filters. Ludwig solvesthis problem by exploiting **kwargs, a Python functionalitythat allows to pass additional parameters to functions byspecifying their names and collecting them into a dictionary.This allows different functions implementing the same ab-stract interface to have the same signature and then retrievethe specific additional parameters from the dictionary usingtheir names. This also greatly simplifies the implementationof default parameters, because if the dictionary does notcontain the keyword of a required parameter, the defaultvalue for that parameters is used instead automatically.

3.2 Training Pipeline

Given a model definition, Ludwig builds a training pipelineas shown in the top of Figure 7. The process is not par-ticularly different from many other machine learning toolsand consists in a metadata collection phase, a data pre-processing phase, and a model training phase. The metadatamappings in particular are needed in order to apply exactlythe same pre-processing to input data at prediction time,while model weights and hyper-parameters are saved in or-der to load the same exact model obtained during training.The main notable innovation is the fact that every singlecomponent, from the pre-processing to the model, to thetraining loop is dynamically built depending on the declara-tive model definition.

Page 8: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.

Raw DataProcessed

DataPredictions and Scores

Data Pre Processing

Predicted Data

Post ProcessingModel

Prediction

CSV Numpy Numpy CSV

JSON

Raw DataProcessed

DataModel

Training

Metadata Pre Processing

CSV

Weights Hyperparameters

TensorJSON

Metadata Dictionary File

Data Pre Processing

HDF5 / Numpy

Processed Data Cache

Metadata Dictionary

Opt

iona

l

Training

Prediction

Figure 7. A depiction of the training and prediction pipeline.

One of the main use cases of Ludwig is the quick explorationof different model alternatives for the same data, so, afterpre-processing, the pre-processed data is optionally cachedinto an HDF5 file. The next time the same data is accessed,the HDF5 file will be used instead, saving the time neededto pre-process it.

3.3 Prediction Pipeline

The prediction pipeline is depicted in the bottom of Figure 7.It uses the metadata obtained during the training phase topre-process the new input data, loads the model reading itshyper-parameters and weights, and uses it to obtain predic-tions that are mapped back in data space by a post-processorthat uses the same mappings obtained at training time.

4 EVALUATION

One of the positive effects of the ECD architecture and itsimplementation in Ludwig is the ability to specify a poten-tially complex model architecture with a concise declarativemodel definition. To analyze how much of an impact thishas on the amount of code needed to implement a model(including pre-processing, the model itself, the training loop,and the metrics calculation), the number of lines of coderequired to implement four reference architectures usingdifferent libraries is compared: WordCNN (Kim, 2014), Bi-LSTM (Tai et al., 2015) - both models for text classificationand sentiment analysis, Tagger (Lample et al., 2016) - se-quence tagging model with an RNN encoder and a per-tokenclassification, ResNet (He et al., 2016) - image classifica-tion model. Although this evaluation is imprecise in nature(the same model can be implemented in a more or less con-cise way and writing a parameter in a configuration file issubstantially simpler than writing a line of code), it could

TensorFlow Keras PyTorch Ludwigmean mean mean. w/o w

WordCNN 406.17 201.50 458.75 8 66Bi-LSTM 416.75 439.75 323.40 10 68

Tagger 1067.00 1039.25 1968.00 10 68ResNet 1252.75 779.60 479.43 9 61

Table 1. Number of lines of code for implementing different mod-els. mean columns are the mean lines of code needed to writea program from scratch for the task. w and w/o in the Ludwigcolumn refer to the number of lines for writing a model definitionspecifying every single model hyper-parameter and pre-processingparameter, and without specifying any hyper-parameter respec-tively.

provide intuition about the amount of effort needed to im-plement a model with different tools. To calculate the meanfor different libraries, openly available implementations onGitHub are collected and the number of lines of code ofeach of them is collected (the list of repositories is availablein the appendix). For Ludwig, the amount of lines in the con-figuration file needed to obtain the same models is reportedboth in the case where no hyper-parameter is spacified andin the case where all its hyper-parameters are specified.

The results in Table 1 show how even when specifying all itshyper-parameters, a Ludwig declarative model configurationis an order of magnitude smaller than even the most concisealternative. This supports the claim that Ludwig can beuseful as a tool to reduce the effort needed for training andusing deep learning models.

5 LIMITATIONS AND FUTURE WORK

Although Ludwig’s ECD architecture is particularly well-suited for supervised and self-supervised tasks, how suitable

Page 9: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

it is for other machine learning tasks is not immediatelyevident.

One notable example of such tasks are Generative Adver-sarial Networks (GANs) (Goodfellow et al., 2014): theirarchitecture contains two models that learn to generate dataand discriminate synthetic from real data and are trainedwith inverted losses. In order to replicate a GAN withinthe boundaries of the ECD architecture, the inputs to bothmodels would have to be defined at the encoder level, thediscriminator output would have to be defined as a decoder,and the remaining parts of both models would have to be de-fined as one big combiner, which is inelegant; for instance,changing just the generator would result in an entirely newimplementation. An elegant solution would allow for dis-entangling the two models and change them independently.The recursive graph extension of the combiner describedin section 2.2 allows a more elegant solution by providinga mechanism for defining the generator and discriminatoras two independent sub-graphs, improving modularity andextensibility. WAn extension of the toolbox in this directionis planned in the future.

Another example is reinforcement learning. Although ECDcan be used to build the vast majority of deep architecturescurrently adopted in reinforcement learning, some of thetechniques they employ are relatively hard to represent,such as instance double inference with fixed weights inDeep Q-Networks (Mnih et al., 2015), which can currentlybe implemented only with a really custom and inelegantcombiner. Moreover, supporting the dynamic interactionwith an environment for data collection and more cleverways to collect it like Go-Explore’s (Ecoffet et al., 2019)archive or prioritized experience replay (Schaul et al., 2016),is currently out of the scope of the toolbox: a user wouldhave to build these capabilities on their own and call Ludwigfunctions only inside the inner loop of the interaction withthe environment. Extending the toolbox to allow for easyadoption in reinforcement learning scenarios, for exampleby allowing training through policy gradient methods likeREINFORCE (Williams, 1992) or off-policy methods, is apotential direction of improvement.

Although these two cases highlight current limitations of theLudwig, it’s worth noting how most of the current industrialapplications of machine learning are based on supervisedlearning, and that is where the proposed architecture fits thebest and the toolbox provides most of its value.

Although the declarative nature of Ludwig’s model defini-tion allows for easier model development, as the number ofencoders and their hyper-parameters increase, the need forautomatic hyper-parameter optimization arises. In Ludwig,however, different encoders and decoders, i.e., sub-modulesof the whole architecture, are themselves hyper-parameters.For this reason, Ludwig is well-suited for performing both

hyper-parameter search and architecture search, and blursthe line between the two.

A future addition to the model definition file will be an hyper-parameter search section that will allow users to definewhich strategy among those available to adopt to perform theoptimization and, if the optimization process itself containsparameters, the user will be allowed to provide them inthis section as well. Currently a Bayesian optimizationover combinatorial structures (Baptista & Poloczek, 2018)approach is in development, but more can be added.

Finally, more feature types will be added in the future, inparticular videos and graphs, together with a number of pre-trained encoders and decoders, which will allow training ofa full model in few iterations.

6 RELATED WORK

TensorFlow (Abadi et al., 2015), Caffe (Jia et al., 2014),Theano (Theano Development Team, 2016) and other simi-lar libraries are tensor computation frameworks that allowfor automatic differentiation and declarative model throughthe definition of a computation graph. They all provide sim-ilar underlying primitives and support computation graphoptimizations that allow for training of large-scale deep neu-ral networks. PyTorch (Paszke et al., 2017), on the otherhand, provides the same level of abstraction, but allowsusers to define models imperatively: this has the advantageto make a PyTorch program easier to debug and to inspect.By adding eager execution, TensorFlow 2.0 allows for bothdeclarative and imperative programming styles. In contrast,Ludwig, which is built on top of TensorFlow, provides ahigher level of abstraction for the user. Users can declarefull model architectures rather than underlying tensor op-erations, which allows for more concise model definitions,while flexibility is ensured by allowing users to change eachparameter of each component of the architecture if they wishto.

Sonnet (Reynolds et al., 2017), Keras (Chollet et al., 2015),and AllenNLP (Gardner et al., 2017) are similar to Lud-wig in the sense that both libraries provide a higher levelof abstraction over TensorFlow and PyTorch primitives re-spectively. However, while they provide modules whichcan be used to build a desired network architecture, whatdistinguishes Ludwig from them is its declarative natureand being built around data type abstraction. This allowsfor the flexible ECD architecture that can cover many usecases beyond the natural language processing covered byAllenNLP, and also doesn’t require to write code for bothmodel implementation and pre-processing like in Sonnetand Keras.

Scikit-learn (Buitinck et al., 2013), Weka (Hall et al., 2009),and MLLib (Meng et al., 2016) are popular machine learn-

Page 10: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

ing libraries among researchers and industry practitioners.They contain implementations of several different traditionalmachine learning algorithm and provide common interfacesfor them to use, so that algorithms become in most casesinterchangeable and users can easily compare them. Lud-wig follows this API design philosophy in its programmaticinterface, but focuses on deep learning models that are notavailable in those tools.

7 CONCLUSIONS

This work presented Ludwig, a deep learning toolbox builtaround type-based abstraction and a flexible ECD archi-tecture that allows model definition through a declarativelanguage.

The proposed tool has many advantages in terms of flexibil-ity, extensibility, and ease of use, which allow both expertsand novices to train deep learning models, employ them forobtaining predictions, and experiment with different archi-tectures without the need to write code, but still allowingusers to easily add custom sub-modules.

In conclusion, Ludwig’s general and flexible architectureand its ease of use make it a good option for democratizingdeep learning by making it more accessible, streamliningand speeding up experimentation, and unlocking many newapplications.

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M.,Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard,M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Lev-enberg, J., Mane, D., Monga, R., Moore, S., Murray, D.,Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever,I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan,V., Viegas, F., Vinyals, O., Warden, P., Wattenberg, M.,Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.URL https://www.tensorflow.org/. Softwareavailable from tensorflow.org.

Baptista, R. and Poloczek, M. Bayesian optimization ofcombinatorial structures. In Dy, J. G. and Krause, A.(eds.), Proceedings of the 35th International Conferenceon Machine Learning, ICML 2018, Stockholmsmassan,Stockholm, Sweden, July 10-15, 2018, volume 80 of Pro-ceedings of Machine Learning Research, pp. 471–480.PMLR, 2018. URL http://proceedings.mlr.press/v80/baptista18a.html.

Brodsky, I. et al. H3. https://github.com/uber/h3, 2018.

Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F.,Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P.,Gramfort, A., Grobler, J., Layton, R., VanderPlas, J.,Joly, A., Holt, B., and Varoquaux, G. API design for ma-chine learning software: experiences from the scikit-learnproject. In ECML PKDD Workshop: Languages for DataMining and Machine Learning, pp. 108–122, 2013.

Caruana, R. Multitask learning: A knowledge-basedsource of inductive bias. In Utgoff, P. E. (ed.),Machine Learning, Proceedings of the Tenth Interna-tional Conference, University of Massachusetts, Amherst,MA, USA, June 27-29, 1993, pp. 41–48. MorganKaufmann, 1993. doi: 10.1016/b978-1-55860-307-3.50012-5. URL https://doi.org/10.1016/b978-1-55860-307-3.50012-5.

Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao,T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexibleand efficient machine learning library for heterogeneousdistributed systems. arXiv preprint arXiv:1512.01274,2015.

Chollet, F. et al. Keras. https://keras.io, 2015.

Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O.,and Clune, J. Go-explore: a new approach for hard-exploration problems. CoRR, abs/1901.10995, 2019.URL http://arxiv.org/abs/1901.10995.

Gamma, E., Helm, R., Johnson, R., and Vlissides,J. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley ProfessionalComputing Series. Pearson Education, 1994. ISBN9780321700698. URL https://books.google.com/books?id=6oHuKQe3TjQC.

Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P.,Liu, N. F., Peters, M., Schmitz, M., and Zettlemoyer, L. S.Allennlp: A deep semantic natural language processingplatform. 2017.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In Advances in neuralinformation processing systems, pp. 2672–2680, 2014.

Hall, M. A., Frank, E., Holmes, G., Pfahringer, B., Reute-mann, P., and Witten, I. H. The WEKA data miningsoftware: an update. SIGKDD Explorations, 11(1):10–18, 2009. doi: 10.1145/1656274.1656278. URL https://doi.org/10.1145/1656274.1656278.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In Proceedings of the IEEEconference on computer vision and pattern recognition,pp. 770–778, 2016.

Page 11: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

Huang, G., Liu, Z., van der Maaten, L., and Weinberger,K. Q. Densely connected convolutional networks. InThe IEEE Conference on Computer Vision and PatternRecognition (CVPR), July 2017.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,Girshick, R., Guadarrama, S., and Darrell, T. Caffe:Convolutional architecture for fast feature embedding. InProceedings of the 22nd ACM international conferenceon Multimedia, pp. 675–678. ACM, 2014.

Kim, Y. Convolutional neural networks for sentence classi-fication. In Moschitti, A., Pang, B., and Daelemans, W.(eds.), Proceedings of the 2014 Conference on EmpiricalMethods in Natural Language Processing, EMNLP 2014,October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT,a Special Interest Group of the ACL, pp. 1746–1751. ACL,2014. URL http://aclweb.org/anthology/D/D14/D14-1181.pdf.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami,K., and Dyer, C. Neural architectures for named entityrecognition. In Knight, K., Nenkova, A., and Rambow,O. (eds.), NAACL HLT 2016, The 2016 Conference of theNorth American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies, SanDiego California, USA, June 12-17, 2016, pp. 260–270.The Association for Computational Linguistics, 2016.URL http://aclweb.org/anthology/N/N16/N16-1030.pdf.

Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman,S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen,S., et al. Mllib: Machine learning in apache spark. TheJournal of Machine Learning Research, 17(1):1235–1241,2016.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-ness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A.,Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C.,Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier-stra, D., Legg, S., and Hassabis, D. Human-level con-trol through deep reinforcement learning. Nature, 518(7540):529–533, 2015. doi: 10.1038/nature14236. URLhttps://doi.org/10.1038/nature14236.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,A. Automatic differentiation in pytorch. 2017.

Ratner, A., Hancock, B., Dunnmon, J., Goldman, R. E.,and Re, C. Snorkel metal: Weak supervision formulti-task learning. In Schelter, S., Seufert, S., andKumar, A. (eds.), Proceedings of the Second Work-shop on Data Management for End-To-End MachineLearning, DEEM@SIGMOD 2018, Houston, TX, USA,June 15, 2018, pp. 3:1–3:4. ACM, 2018. doi: 10.

1145/3209889.3209898. URL https://doi.org/10.1145/3209889.3209898.

Ratner, A., Hancock, B., Dunnmon, J., Sala, F., Pandey,S., and Re, C. Training complex models with multi-task weak supervision. In The Thirty-Third AAAIConference on Artificial Intelligence, AAAI 2019, TheThirty-First Innovative Applications of Artificial Intel-ligence Conference, IAAI 2019, The Ninth AAAI Sym-posium on Educational Advances in Artificial Intelli-gence, EAAI 2019, Honolulu, Hawaii, USA, January27 - February 1, 2019., pp. 4763–4771. AAAI Press,2019a. URL https://aaai.org/ojs/index.php/AAAI/article/view/4403.

Ratner, A. J., Hancock, B., and Re, C. The role of mas-sively multi-task and weak supervision in software 2.0.In CIDR 2019, 9th Biennial Conference on InnovativeData Systems Research, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings. www.cidrdb.org, 2019b.URL http://cidrdb.org/cidr2019/papers/p58-ratner-cidr19.pdf.

Reynolds, M., Barth-Maron, G., Besse, F., de Las Casas,D., Fidjeland, A., Green, T., Puigdomenech, A.,Racaniere, S., Rae, J., and Viola, F. Open sourc-ing Sonnet - a new library for constructing neu-ral networks. https://deepmind.com/blog/open-sourcing-sonnet/, 2017.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prior-itized experience replay. In Bengio, Y. and LeCun, Y.(eds.), 4th International Conference on Learning Rep-resentations, ICLR 2016, San Juan, Puerto Rico, May2-4, 2016, Conference Track Proceedings, 2016. URLhttp://arxiv.org/abs/1511.05952.

Seide, F. and Agarwal, A. Cntk: Microsoft’s open-sourcedeep-learning toolkit. In Proceedings of the 22nd ACMSIGKDD International Conference on Knowledge Dis-covery and Data Mining, pp. 2135–2135. ACM, 2016.

Simonyan, K. and Zisserman, A. Very deep convolutionalnetworks for large-scale image recognition. In Bengio,Y. and LeCun, Y. (eds.), 3rd International Conferenceon Learning Representations, ICLR 2015, San Diego,CA, USA, May 7-9, 2015, Conference Track Proceed-ings, 2015. URL http://arxiv.org/abs/1409.1556.

Tai, K. S., Socher, R., and Manning, C. D. Improved se-mantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rdAnnual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Conferenceon Natural Language Processing of the Asian Federa-tion of Natural Language Processing, ACL 2015, July

Page 12: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

26-31, 2015, Beijing, China, Volume 1: Long Papers,pp. 1556–1566. The Association for Computer Lin-guistics, 2015. URL https://www.aclweb.org/anthology/P15-1150/.

Theano Development Team. Theano: A Python frameworkfor fast computation of mathematical expressions. arXive-prints, abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688.

Tokui, S., Oono, K., Hido, S., and Clayton, J. Chainer: anext-generation open source framework for deep learning.In Proceedings of workshop on machine learning systems(LearningSys) in the twenty-ninth annual conference onneural information processing systems (NIPS), volume 5,pp. 1–6, 2015.

Williams, R. J. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning. Ma-chine Learning, 8:229–256, 1992. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696.

Page 13: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

A FULL LIST OF GITHUB REPOSITORIES

repository loc model noteshttps://github.com/dennybritz/cnn-text-classification-tf

308 WordCNN

https://github.com/randomrandom/deep-atrous-cnn-sentiment

621 WordCNN

https://github.com/jiegzhan/multi-class-text-classification-cnn

284 WordCNN

https://github.com/TobiasLee/Text-Classification

335 WordCNN cnn.py + files in utils directory

https://github.com/zackhy/TextClassification

405 WordCNN cnn classifier.pt + train.py +test.py

https://github.com/YCG09/tf-text-classification

484 WordCNN all files minus the rnn related ones

https://github.com/roomylee/rnn-text-classification-tf

305 Bi-LSTM

https://github.com/dongjun-Lee/rnn-text-classification-tf

271 Bi-LSTM

https://github.com/TobiasLee/Text-Classification

397 Bi-LSTM attn bi lstm.py + files utils direc-tory

https://github.com/zackhy/TextClassification

459 Bi-LSTM rnn classifier.pt + train.py +test.py

https://github.com/YCG09/tf-text-classification

506 Bi-LSTM all files minus the cnn related ones

https://github.com/ry/tensorflow-resnet 2243 ResNethttps://github.com/wenxinxu/resnet-in-tensorflow

635 ResNet

https://github.com/taki0112/ResNet-Tensorflow

472 ResNet

https://github.com/ShHsLin/resnet-tensorflow

1661 ResNet

https://github.com/guillaumegenthial/sequence_tagging

959 Tagger

https://github.com/guillaumegenthial/tf_ner 1877 Taggerhttps://github.com/kamalkraj/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs

365 Tagger

Table 2. List of TensorFlow repositories used for the evaluation.

Page 14: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

repository loc model noteshttps://github.com/Jverma/cnn-text-classification-keras

228 WordCNN

https://github.com/bhaveshoswal/CNN-text-classification-keras

117 WordCNN

https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras

258 WordCNN

https://github.com/junwang4/CNN-sentence-classification-keras-2018

295 WordCNN

https://github.com/cmasch/cnn-text-classification

122 WordCNN

https://github.com/diegoschapira/CNN-Text-Classifier-using-Keras

189 WordCNN

https://github.com/shashank-bhatt-07/Keras-LSTM-Sentiment-Classification

425 Bi-LSTM

https://github.com/AlexGidiotis/Document-Classifier-LSTM

678 Bi-LSTM

https://github.com/pinae/LSTM-Classification

547 Bi-LSTM

https://github.com/susanli2016/NLP-with-Python/blob/master/Multi-Class%20Text%20Classification%20LSTM%20Consumer%20complaints.ipynb

109 Bi-LSTM

https://github.com/raghakot/keras-resnet 292 ResNethttps://github.com/keras-team/keras-applications/blob/master/keras_applications/resnet50.py

297 ResNet Only model, no preprocessing

https://github.com/broadinstitute/keras-resnet

2285 ResNet

https://github.com/yuyang-huang/keras-inception-resnet-v2

560 ResNet

https://github.com/keras-team/keras-contrib/blob/master/keras_contrib/applications/resnet.py

464 ResNet Only model, no preprocessing

https://github.com/Hironsan/anago 2057 Taggerhttps://github.com/floydhub/named-entity-recognition-template

150 Tagger

https://github.com/digitalprk/KoreaNER 501 Taggerhttps://github.com/vunb/anago-tagger 1449 Tagger

Table 3. List of Keras repositories used for the evaluation.

Page 15: Ludwig: a type-based declarative deep learning toolbox · 2019-10-05 · LUDWIG: A TYPE-BASED DECLARATIVE DEEP LEARNING TOOLBOX Piero Molino 1Yaroslav Dudin Sai Sumanth Miryala ABSTRACT

Ludwig: a type-based declarative deep learning toolbox

repository loc model noteshttps://github.com/Shawn1993/cnn-text-classification-pytorch

311 WordCNN

https://github.com/yongjincho/cnn-text-classification-pytorch

247 WordCNN

https://github.com/srviest/char-cnn-text-classification-pytorch

778 WordCNN ignored model CharCNN2d.py

https://github.com/threelittlemonkeys/cnn-text-classification-pytorch

499 WordCNN

https://github.com/keishinkickback/Pytorch-RNN-text-classification

414 Bi-LSTM

https://github.com/Jarvx/text-classification-pytorch

421 Bi-LSTM

https://github.com/jiangqy/LSTM-Classification-Pytorch

324 Bi-LSTM

https://github.com/a7b23/text-classification-in-pytorch-using-lstm

188 Bi-LSTM

https://github.com/claravania/lstm-pytorch 270 Bi-LSTMhttps://github.com/hysts/pytorch_resnet 447 ResNethttps://github.com/a-martyn/resnet 286 ResNethttps://github.com/hysts/pytorch_resnet_preact

535 ResNet

https://github.com/ppwwyyxx/GroupNorm-reproduce/tree/master/ImageNet-ResNet-PyTorch

1095 ResNet

https://github.com/KellerJordan/ResNet-PyTorch-CIFAR10

199 ResNet

https://github.com/mbsariyildiz/resnet-pytorch

450 ResNet

https://github.com/akamaster/pytorch_resnet_cifar10

344 ResNet

https://github.com/ZhixiuYe/NER-pytorch 1184 Taggerhttps://github.com/sgrvinod/a-PyTorch-Tutorial-to-Sequence-Labeling

840 Tagger

https://github.com/epwalsh/pytorch-crf 3243 Taggerhttps://github.com/LiyuanLucasLiu/LM-LSTM-CRF

2605 Tagger

Table 4. List of Pytorch repositories used for the evaluation.