automatic detection of geometrical anomalies in composites …

122
AUTOMATIC DETECTION OF GEOMETRICAL ANOMALIES IN COMPOSITES MANUFACTURING: A DEEP LEARNING-BASED COMPUTER VISION APPROACH by Abtin Djavadifar B.A.Sc., Amirkabir University of Technology, 2017 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE COLLEGE OF GRADUATE STUDIES (Mechanical Engineering) THE UNIVERSITY OF BRITISH COLUMBIA (Okanagan) April 2020 © Abtin Djavadifar, 2020

Upload: others

Post on 27-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

AUTOMATIC DETECTION OF GEOMETRICAL ANOMALIES IN COMPOSITES

MANUFACTURING:

A DEEP LEARNING-BASED COMPUTER VISION APPROACH

by

Abtin Djavadifar

B.A.Sc., Amirkabir University of Technology, 2017

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF APPLIED SCIENCE

in

THE COLLEGE OF GRADUATE STUDIES

(Mechanical Engineering)

THE UNIVERSITY OF BRITISH COLUMBIA

(Okanagan)

April 2020

© Abtin Djavadifar, 2020

ii

The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, a thesis/dissertation entitled:

Automatic Detection of Geometrical Anomalies in Composites Manufacturing: A Deep Learning-Based Computer Vision Approach

submitted by Abtin Djavadifar in partial fulfillment of the requirements for

the degree of Master of Applied Science

in Mechanical Engineering

Examining Committee:

Homayoun Najjaran, School of Engineering Supervisor

Abbas Milani, School of Engineering Supervisory Committee Member

Zheng Liu, School of Engineering Supervisory Committee Member

Additional Examiner

iii

Abstract

This thesis focuses on the development of a machine learning-based vision system for

quality control of composite manufacturing processes. Deep Convolutional Neural Networks

(DCNNs) are used to build a real-time end-to-end solution for the complex process of draping of

the fiber-reinforced cut-pieces by conducting online visual inspection. The visual inspection will

ultimately help with manufacturing of double-curved composite parts such as aircraft’s rear

pressure bulkhead. The developed solution provides accurate and robust measurement without the

need for expensive coordinate measuring machines (CMM) in the shop floor. The development of

inspection software is completed in the following two stages.

In stage I, after creating a hand-labeled visual dataset acquired from a fabric layup robotic

system in the German Aerospace Center (DLR), a DCNN was designed, trained and tested for

image classification. Then, the idea of combining images from multiple cameras for generalization

of the designed model to different wrinkle properties and environments was evaluated. The

proposed method employs computer vision techniques and Dempster-Shafer Theory (DST) to

enhance wrinkle detection accuracy without the need for any additional hand-labeling or re-

training of the model. By the application of the DST rule of combination, the overall wrinkle

detection accuracy was greatly improved.

In stage II, four state-of-the-art image segmentation DCNN models (DeepLab V3+, U-Net,

Mask-RCNN, IC-Net) were evaluated to accurately identify the gripper, fabric, and any probable

wrinkle on a dry fiber product. The results show using a DCNN model and transfer learning can

lead to acceptable results while training on a small and inaccurately annotated dataset. Also, the

impact of human annotation quality on the performance of DCNN models was evaluated by

iv

comparing two human-annotated datasets. Then, an approach for detection of wrinkles at the early

stages of formation was developed and evaluated. Finally, the challenges of using synthetically

generated data for training the models were assessed by conducting complementary experiments.

The developed solution can be practically used for visual inspection of the draping process

in composite manufacturing facilities. The presented method can be readily adopted to train DCNN

models using other datasets and perform visual inspection tasks in different automated

manufacturing processes.

v

Lay Summary

The overall goal of this study is to develop a solution for visual quality inspection of certain

steps of automated composites manufacturing processes using robots. To do so, multiple cameras

are installed in proper positions in front of a gripper robot that grabs the fiber-reinforced cut-pieces

and places inside a mold. The images captured by the cameras are then processed by a neural

network model to detect wrinkles, misalignment or other defects in the cut-piece which may occur

during the draping process.

This research is focused on creating the automatic visual inspection system by generating

the required datasets and training the neural networks. Different types of neural network models

are designed and implemented to perform the wrinkle detection task. To enhance the performance

of these models, novel and practical machine learning techniques such as combining multi-views

inputs from different cameras were developed and successfully evaluated.

vi

Preface

This thesis presents the result of research collaboration between the Advanced Control and

Intelligent Systems (ACIS) laboratory at the School of Engineering, the University of British

Columbia, and the Center for Light Weight Production (ZPL) of the German Aerospace Center

(DLR) in Augsburg, Germany.

A version of Stage II (described in Chapters 3 and 4) has been submitted to Robotics and

Computer-Integrated Manufacturing Journal, and is currently under review (A. Djavadifar, JB.

Graham-Knight, M. Körber, and H. Najjaran, “Automated Visual Detection of Geometrical

Defects in Composite Manufacturing Processes Using Deep Convolutional Neural Networks”).

A version of Chapter 4 has been published in the International Journal of Assembly

Technology and Management, 2019 [1] (K. Gupta, M. Körber, A. Djavadifar, F. Krebs, and H.

Najjaran, “Wrinkle and boundary detection of fiber products in robotic composites

manufacturing”).

A version of Stage I (described in chapters 3 and 4) has been published and presented in

the International Conference on Smart Multimedia (ICSM), 2019 [2] (A. Djavadifar, JB. Graham-

Knight, K. Gupta, M. Körber, P. Lasserre, and H. Najjaran, “Robot-assisted composite

manufacturing based on machine learning applied to multi-view computer vision”.

A version of Stage I (described in chapters 3 and 4) has been presented in the IEEE/RSJ

International Conference on Intelligent Robots and Systems (IROS), 2019 as a poster (A.

Djavadifar, JB. Graham-Knight, M. Körber and, H. Najjaran, “Robot-assisted composite

manufacturing using deep learning and multi-view computer vision”).

vii

In all of the papers except [1], I, Abtin Djavadifar was responsible for leading the research,

programming, algorithms’ development, and writing the paper. John Brandon Graham Knight was

involved in the concept formation stage, programming, implementation of developed algorithms,

and writing and edition of manuscript. Marian Körber, who works as a researcher at DLR,

collaborated in preparing the experimental setup and conducting the tests using DLR facilities.

Kashish Gupta was the first author of [1] and has helped in concept formation of other papers.

Patricia Lasserre has helped with providing computational resources needed for training the

models at UBC. Homayoun Najjaran, the last and supervisory author, was involved throughout the

project in concept formation and manuscript edition.

viii

Table of Contents

Abstract ......................................................................................................................................... iii

Lay Summary .................................................................................................................................v

Preface ........................................................................................................................................... vi

Table of Contents ....................................................................................................................... viii

List of Tables ................................................................................................................................ xi

List of Figures .............................................................................................................................. xii

List of Abbreviations ................................................................................................................. xiv

Acknowledgments ...................................................................................................................... xvi

Dedication .................................................................................................................................. xvii

Chapter 1 : Introduction ...............................................................................................................1

1.1 Motivations ..................................................................................................................... 1

1.2 Objectives ....................................................................................................................... 2

1.3 Contributions................................................................................................................... 3

1.4 Thesis Outline ................................................................................................................. 4

Chapter 2 : Literature Review ......................................................................................................8

2.1 Composite Manufacturing .............................................................................................. 8

2.2 Quality Control in Composites Manufacturing............................................................... 9

2.3 Automated Handling of Carbon-Fiber Reinforced Plastics at DLR ............................. 11

2.4 Limitations of Traditional Image Processing Techniques ............................................ 18

2.5 Visual Sensing Systems ................................................................................................ 21

2.6 Object Recognition Algorithms .................................................................................... 23

ix

2.7 Transfer Learning.......................................................................................................... 33

2.8 Stereo Vision Image Processing ................................................................................... 34

2.9 Multi-View Computer Vision ....................................................................................... 36

2.10 Depth Estimation .......................................................................................................... 37

Chapter 3 : Infrastructure for Dataset Development ...............................................................48

3.1 Hardware Setup ............................................................................................................. 48

3.1.1 Modular Gripper ....................................................................................................... 48

3.1.2 IDS XS Camera......................................................................................................... 49

3.2 Software Setup .............................................................................................................. 49

3.2.1 Data Generation for Training Image Classification Models ..................................... 49

3.2.2 Data Generation for Training Image Segmentation Models ..................................... 51

3.2.3 Synthetic Dataset Generation .................................................................................... 54

3.3 Computational Resources ............................................................................................. 58

Chapter 4 : Deep Learning for Wrinkle and Fabric Boundary Detection .............................59

4.1 Stage I – Wrinkle Detection Using an Image Classification Model ............................. 60

4.1.1 Phase 1 - Model Development .................................................................................. 61

4.1.2 Phase 2 - Model Generalization ................................................................................ 63

4.1.3 Phase 3 - Multi-view Inferencing ............................................................................. 64

4.2 Stage II – Wrinkle and Boundary Detection Using Image Segmentation Models ....... 69

4.2.1 Training Image Segmentation Models ...................................................................... 70

4.2.2 Assessing Human Annotation Quality ...................................................................... 72

4.2.3 Evaluation of Wrinkle Detection Performance ......................................................... 73

4.2.4 An Approach for Early Detection of Wrinkles ......................................................... 74

x

4.2.5 Training on Synthetic Data ....................................................................................... 75

Chapter 5 : Experimental Results ..............................................................................................76

5.1 Stage I - Wrinkle Detection Using an Image Classification Model .............................. 76

5.2 Stage II - Wrinkle and Boundary Detection Using Image Segmentation Models ........ 79

Chapter 6 : Conclusions ..............................................................................................................88

6.1 Summary ....................................................................................................................... 88

6.2 Future Work .................................................................................................................. 91

Bibliography .................................................................................................................................94

xi

List of Tables

Table 3.1 - Technical specifications of computational resources ................................................. 58

Table 4.1 - Training parameters for different models ................................................................... 71

Table 5.1 - Accuracy of detection on the initial dataset ............................................................... 76

Table 5.2 - Accuracy of detection against the new dataset ........................................................... 77

Table 5.3 - Accuracy of detection after multi-view inferencing ................................................... 79

Table 5.4 - Inferencing results for image segmentation models ................................................... 80

Table 5.5 - Performance of trained models on the test dataset ..................................................... 81

Table 5.6 - IoU between two human-annotated datasets .............................................................. 82

Table 5.7 - Wrinkle detection scores for DeepLab V3+ ............................................................... 83

Table 5.8 – Inferencing results for DeepLab V3+ trained on single class datasets (IoU scores) . 84

Table 5.9 - Inferencing results on the synthetic test dataset ......................................................... 86

Table 5.10 - Inferencing results on the real test dataset ................................................................ 87

xii

List of Figures

Figure 1.1 - Organization of the thesis ............................................................................................ 7

Figure 2.1 - Manufacturing of CFRP components in aerospace manufacturing .......................... 10

Figure 2.2 - Preform process of aircraft's rear pressure bulkhead ................................................ 12

Figure 2.3 – Draping of cut-pieces at DLR................................................................................... 12

Figure 2.4 – Comparison between the boundary of the fabric in the simulation and the actual

process........................................................................................................................................... 15

Figure 2.5 - Validation of the draping process simulations .......................................................... 16

Figure 2.6 - Simulation of the draping process inside the mold ................................................... 17

Figure 2.7 - Fabric on the gripper ................................................................................................. 18

Figure 2.8 - Fabric with contrast-enhanced using histogram equalization (left) and local

histogram equalization (right) methods ........................................................................................ 19

Figure 2.9 - Filtered contrast-enhanced fabric without first using [12] for frequency-domain

removal of periodic noise (left) and with noise removal first (right) ........................................... 20

Figure 2.10 - Overview of object recognition tasks in computer vision area ............................... 23

Figure 2.11 - Transfer learning diagram for visual inspection of the draping process using a pre-

trained model ................................................................................................................................ 34

Figure 3.1 - Main components of the modular gripper ................................................................. 49

Figure 3.2 - Frames before and after the hand-labeling process, with the colored grids in the

second image coded to represent the different classification categories: wrinkle (cyan), gripper

(green), fabric (yellow) and background (black) .......................................................................... 51

Figure 3.3 - Arrangement of cameras in front of the gripper setup .............................................. 52

xiii

Figure 3.4 - A sample image of the dataset (left) and its annotation mask created in Amazon

SageMaker (right) ......................................................................................................................... 53

Figure 3.5 - Conversion of an RGB mask (left) to its gray-scale equivalent (right) .................... 53

Figure 3.6 - Data augmentation workflow chart ........................................................................... 54

Figure 3.7 - Simulation of the draping process in Blender ........................................................... 56

Figure 3.8 - Samples of generated images in Blender .................................................................. 56

Figure 3.9 - Synthetic data annotation process: main image (top left); image with marked

wrinkles (top right); RGB mask (bottom left); gray-scale mask (bottom right) ........................... 57

Figure 4.1 - Examples of wrinkles in the images used for training (left); wrinkles in the new

dataset (right) ................................................................................................................................ 64

Figure 4.2 - Five co-temporal masks of the same fabric ............................................................... 68

Figure 4.3 - Five overlaid, co-temporal fabric masks before correction (left) and after correction

(right) ............................................................................................................................................ 68

Figure 4.4 - Calculation of IoU metric for evaluation of object localization accuracy ................ 72

Figure 5.1 - A sample image of the dataset (top image) and its two annotation masks ............... 82

Figure 5.2 - Effect of overlapping threshold on wrinkle detection scores .................................... 84

Figure 6.1 - Wrinkle detection accuracy (%) by phase ................................................................. 89

xiv

List of Abbreviations

Abbreviation Definition

2D Two Dimensional

3D Three Dimensional

ACIS Advanced Control and Intelligent Systems

ASPP Atrous Spatial Pyramid Pooling

CFRP Carbon-Fiber Reinforced Plastic

CNN Convolutional Neural Network

DCNN Deep Convolutional Neural Network

DST Dempster-Shafer Theory

FCN Fully Convolutional Networks

FOV Field of View

FRP Fiber-Reinforced Plastic\Polymer

GPU

HDR

Graphical Processing Unit

High Dynamic Range

HOG Histogram of Oriented Gradients

IBR Image-Based Rendering

ILSVRC ImageNet Large Scale Visual Recognition Challenge

IoU Intersection Over Union

LIDAR Light Imaging, Detection, and Ranging

mAP Mean Average Precision

xv

mIoU Mean Intersection Over Union

ML Machine Learning

RGB Red, Green, Blue

RGB-D Red, Green, Blue - Depth

RoI Region of Interest

SIFT Scale-Invariant Feature Transform

SLAM Simultaneous Localization and Mapping

UAV Unmanned Aerial Vehicle

xvi

Acknowledgments

First, I would like to appreciate my research advisor Dr. Homayoun Najjaran of the School

of Engineering at the University of British Columbia. Prof. Najjaran was always kindly available

whenever I faced a problem in my research or had a question regarding my writing. He allowed

me to make this thesis my own work but guided me in the proper direction by giving me his

constant advice.

I also like to thank the professors who helped me in the validation process of this research:

Dr. Abbas Milani and Dr. Zheng Liu. This work could not have been successfully conducted

without their sensational involvement and input.

Finally, I must be grateful to my parents for their immeasurable support and persistent

encouragement during my studies. This achievement would not have been possible without them.

xvii

Dedication

I dedicate this accomplishment to my parents who were the shining lights of my life,

constantly showing me the right direction and supporting me to overcome the difficulties of this

journey and reach my goals.

I also dedicate this work to my great friends who always supported me and helped me with

writing this dissertation. I highly appreciate all they have done, especially John Brandon Graham-

Knight, Marian Körber, Kyle Low, and Kashish Gupta for assisting me during my research.

1

Chapter 1 : Introduction

1.1 Motivations

The application of Fiber-Reinforced Plastics or Polymers (FRP) has constantly been

diversified from racing cars and sports equipment to helicopters and aircrafts during the last four

decades. In each composite structure, there are two or more dissimilar materials that are used

together to either combine best properties or impart a new set of characteristics which neither of

the components could achieve on their own. The most important aspect of manufacturing advanced

fiber-reinforced composites is that the material properties and the structure are created at the same

time. So, any probable defect that may happen during the manufacturing process will directly

influence the stiffness and strength of both material and structure. This increases the importance

of taking an efficient quality control strategy which helps with finding the defects as quickly as

possible to either fix them or stop manufacturing the defective part to avoid more waste. Defect

detection on dry fiber fabrics in the aviation industry is one of the complex issues slowing down

the manufacturing flow of various products such as aircraft's rear pressure bulkhead. So,

developing an automated quality control solution that visually finds these defects and removes

them has been of much interest in recent years.

Computer vision applications in manufacturing processes have increased in importance

with the development of powerful computer systems and high-resolution and fast camera systems.

Vision applications can be used for process control, quality assurance and documentation

purposes. To process the gathered visual data, classical algorithms and also more recently Machine

Learning (ML) methods are used. In many applications, components or products are processed at

high rates, which leads to a high volume of data. This data volume enables efficient algorithm

2

development and serves as a training data source for the development of ML models. A special

challenge arises when only a small amount of data is or can become available. The reason for this

can be low manufacturing batches; this is common in the aerospace industry. Particularly in the

development of ML applications, special methods must be used to compensate for the lack of

extensive training data. Thus, the process of finding the appropriate tools and algorithms for

addressing a new problem such as geometrical defect detection in a composite manufacturing

process is new and challenging. This motivates us to research a novel method that takes advantage

of deep learning techniques and modern computer vision systems to detect geometrical defects

happening on fiber fabrics during composite manufacturing processes using only a small dataset.

1.2 Objectives

This thesis aims to develop a data-driven vision-based automated wrinkle detection setup

which will be an end-to-end solution to solve the issues related to the multi-variable process of

draping the fiber-reinforced cut-pieces by conducting online quality control. This setup can be

used for quality inspection of the manufacturing process of aircraft’s rear pressure bulkhead. The

developed vision system will be an alternative to expensive and complicated technologies like

high-resolution cameras, RGB-D1 sensors, LiDAR2 sensors, etc. which may not be affordable for

the companies or not adaptable to different industrial environments.

1 Red, Green, Blue - Depth 2 Light Imaging, Detection, and Ranging

3

Specifically, the proposed technique employs deep learning tools along with Convolutional

Neural Networks (CNNs) to comprehend the process through images and understands its setup

interaction under different conditions and scenarios. For this purpose:

1. A practical setup of visual sensors needs to be designed and assembled.

2. A data acquisition, registration and preprocessing pipeline must be defined to convert the

captured visual data to a perceptible format for the CNN.

3. A suitable object recognition model must be adopted and optimized or developed to detect

the gripper and fabric plus the wrinkles happening during the process quickly, accurately,

and robustly.

1.3 Contributions

This thesis aims to develop a visual quality assessment tool for composite manufacturing

processes, particularly wrinkle detection on fiber fabric cut-pieces grasped by modular gripper

during the draping process using simple optical sensors. A deep learning vision-based method for

detecting the fabric, gripper and probable wrinkles is developed as an end-to-end solution that can

run in real-time with acceptable accuracy. For this purpose:

1. A practical setup of visual sensors was designed, assembled and mounted in front of the

modular gripper to provide visual data during the draping process.

2. A data generation and preparation pipeline was defined for 1) gathering the captured visual

data from the sensors, 2) annotating, augmenting, and cleaning the data, 3) projecting

images from multi-views in one anchor image (for multi-view setup), and 4) converting

the processed data to an appropriate format for training deep convolutional neural networks

(DCNN). Finally, two datasets were generated for training the DCNN models and

4

evaluating their performance. A virtual simulation method for synthetic data generation

was also developed and implemented.

3. Two practical object recognition models were designed and optimized to detect the fabric

boundary and wrinkles happening during the process quickly, accurately, and robustly.

a. First, an object classification neural network was designed and implemented. To

generalize the developed model, a technique for combining co-temporal views and

Dempster–Shafer theory of evidence were used for employing data captured from

different views. It is shown that this technique significantly increases accuracy.

b. Second, four state-of-the-art image segmentation networks were trained on the

custom generated dataset and tested to label the gripper, fabric, and wrinkles in

pixel-level. The achieved results by the best-performing model show a remarkable

performance on gripper and fabric detection task. The limitations of human

annotations were also evaluated to explain the reason behind the low score obtained

for wrinkle class. Further evaluations proved that the model is able to provide

acceptable results for wrinkle detection if an appropriate evaluation metric is used.

A method for detection of wrinkles at their early stages of formation was also

developed and evaluated. Finally, the challenges of using a synthetic dataset, that

eliminates the intensively time-consuming data annotation stages, for training the

DCNN models were assessed by training and testing the models on such a dataset.

1.4 Thesis Outline

The thesis is organized as follows.

5

Chapter 2 reviews the background research conducted on quality control of composite

manufacturing processes, as well as studies concerning employment of computer vision systems

for performing object recognition tasks.

Chapter 3 describes the dataset development infrastructure used for this project by

explaining the modular gripper configuration, the camera setup used for capturing the images, the

data processing pipeline including the data annotation and augmentation stages, the developed

virtual simulation method for synthetic data generation, and the computational resources used for

training and testing of the neural networks.

Chapter 4 first explains the limitations of traditional image processing techniques and then

demonstrates two different methods used to develop a deep learning-based vision system for

boundary and wrinkle detection, in two separate sections.

Section 4.1 explains the steps taken to design and implement an image classification neural

network. It also describes how multi-views of the scene were combined and Dempster-Shafer

theory of evidence was used for fusing the votes coming from different views to generalize the

developed model to other datasets that are visually different from the custom-created dataset.

Section 4.2 describes the training procedure of four state-of-the-art DCNNs for performing

instance segmentation and semantic segmentation tasks on the custom generated dataset. Then, the

limitations of human annotations and their effects on the evaluation process are discussed. The

wrinkle detection performance is evaluated again by using more appropriate metrics. A method

for detection of wrinkles at their early stages of formation is also developed. At the end of this

section, a method for training the models on synthetic images is presented and the raised challenges

and reasons behind them are explained.

6

Chapter 5 uses figures and tables to demonstrate the results of each stage explained in

chapter 3, in separate phases. It then discusses the obtained results in each phase and explains how

the findings of one phase specified the next steps needed to be taken in the later phases of the

project.

Chapter 6 shortly summarizes the works done in this thesis and highlights the contributions

and achievements. Finally, it discusses the future directions in the research path and suggests

further improvements. Figure 1.1 illustrates the organizational framework of the thesis.

7

Figure 1.1 - Organization of the thesis

8

Chapter 2 : Literature Review

2.1 Composite Manufacturing

Composites manufacturing has been distinguished as a key manufacturing technology with

potential to impact road industries in the last four decades because of light weight, high stiffness,

and supreme strength of composite materials. A composite is a structure combined of two or more

materials that form a new material with improved properties while preserving the micro-structure

of each constituent [3]. Production of FRP composites starts by combining strong, reinforcing

fibers with polymer resin. The use of lightweight materials helps improve carbon emission and

energy saving in many applications; for example, more efficient operation of wind turbines that

are an alternative source of energy, and higher fuel-saving due to lighter weight vehicles and

compressed gas tanks.

Typically, a composite material is made of two main components: 1) reinforcement that

transfers load in the composite and provides the mechanical strength, 2) matrix which keeps the

reinforcement material bonded and aligned to protect them from environmental effects and

abrasion. This combination creates products lighter than monolithic materials (e.g. metals) with

the same or better properties. FRP materials utilization is increasing in a variety of applications

such as automotive, compressed gas turbine storage, industrial equipment like pipeline and heat

exchangers, structural materials for buildings, wind turbine blade, hydrokinetic energy generation,

shipping containers, and support structures for any kind of system that can take advantage of lower

cost, higher strength, lighter weight, boosted stiffness, and better corrosion-resistance of composite

materials.

9

Carbon Fiber-Reinforced Plastic (CFRP) composites have even higher stiffness-to-weight

and strength-to-weight ratios that lead to significant energy savings during production and improve

their performance. These characteristics make CFRPs a suitable candidate to be used in more

specific applications like aircraft manufacturing in the aviation industry.

2.2 Quality Control in Composites Manufacturing

A major issue with the production of composite materials is the detection of manufacturing

defects in the structures and composite components. The fact that the constituents of a composite

material maintain their primary properties after forming a new material, makes it a difficult task

to discern defects in an inhomogeneous composite material. Unsought defects in composites can

considerably reduce part quality, cause a fatal failure, or lead to a significant waste of money and

energy in the case of late detection. Thus, the development of non-destructive evaluation methods

and in-situ sensors for process control are highly required to understand as-manufactured part

performance and hinder defect formation. Although some technologies are being already used to

evaluate the quality of composites in a non-destructive manner, the development of novel methods

and improving current technologies is essential to increase the production speed and facilitate the

production of larger components. Figure 2.1 shows a typical workflow used in the aviation

manufacturing industry.

10

Figure 2.1 - Manufacturing of CFRP components in aerospace manufacturing

Figure 2.1 shows how quality control in mid-stages, e.g. preforming stage, can prevent

further costs by rejecting a defective part before entering subsequent stages e.g. vacuum bagging

and infiltration, curing, and machining stages.

As the volume of advanced composite materials used in the aviation industry is rising

constantly and main structural parts are increasingly made from composites, it is extremely

important to develop practical means for quality control of design practices, materials, and

production processes employed to build composite structures in this industry. As production rates

have been expanding markedly in recent years, the quality target must also be adjusted to zero

rework and repair, zero defects, and zero scraps.

Composites manufacturing procedures are typically manual, so defects or unwanted

features in moldings are usually the result of human error that can be prevented by more rigorous

quality control at every step of the process. In practice, the situation is more complicated because

of many interactions between the process design, part design decisions, and the variabilities in

processes and materials. Thus, it would be challenging to find a precise approach for identifying

the sources of variability and the probable defects that can arise from them [4]. To decrease the

probability of generating poor quality and costly parts, the mentioned defects must be properly

found and effectively removed during the manufacturing process. Also, further evaluations can be

11

done to understand the variabilities in development and design stages which can result in defect

formation.

The use of a visual inspection approach offers a framework within which these defects can

be identified quickly and properly. So, immediate actions can be taken to remove the defective

parts as early in the process as possible to avoid more costs. Further analysis will help to make

rational decisions about potential routes to develop zero-defect manufacturing processes, design,

and materials.

2.3 Automated Handling of Carbon-Fiber Reinforced Plastics at DLR

With the increasing importance of fiber-reinforced materials in the modern aviation

industry, many composite manufacturing facilities including the German Aerospace Center (DLR)

desire to develop a full-scale closed production chain from the raw materials to the finished

component. Manual handling of composite plies is capable of producing high performance,

complex parts; however, manual handling, is also an expensive, time-consuming process. Delays

in handling may lead to delays in the manufacturing flow. In addition, manually handling the

material on large-scale composite panels introduces the possibility of human error. Hence, large-

scale part manufacturing can greatly benefit from automated solutions to inform and control the

process [5]. However, the robot-assisted handling of these materials in the manufacturing process

of CFRP components is challenging due to the probable defects, such as wrinkles, that may happen

on the fabric, or misalignment of the actual and targeted boundary of fabric [6].

DLR conducts a large amount of research involving the use of robotic arms for smart

automation of composites manufacturing tasks. One of the key challenges in this way is to handle

12

and drape dry fiber fabrics to form vast double-curved components which are essential parts of

various products such as the aircraft's rear pressure bulkhead (Figure 2.2).

Figure 2.2 - Preform process of aircraft's rear pressure bulkhead

A significant part of this task is the automated production of a preform created from dry

carbon fiber cut-pieces. The complexity of this handling process lies in the double-curved target

geometry. The cut-pieces are gripped in the flat state and transferred to the target geometry during

the deformation process, also known as the draping process (Figure 2.3).

Figure 2.3 – Draping of cut-pieces at DLR

13

To drape the cut-pieces, an end-effector system called "Modular Gripper" was developed

in the AZIMUT project [7], which can deform its gripper surface to a double curved geometry

with the aid of a rib-spine design. The modular gripper is capable of picking up a flat fiber cut-

piece, transferring it to a double-curved geometry and depositing it at the intended position in the

mold [7]. The spine consists of two glass fiber rods connected to the gripper structure by three

linear actuators. The rods are bent by shortening the linear axes, causing an inhomogeneous

curvature of the suction surface. The 15 ribs are able to create independent curvatures; together,

ribs and spine produce a double curvature. The suction surface consists of 127 suction units that

are individually adjustable in their suction intensity. The ribs and spine are deformed to selectively

manipulate the carbon fiber fabric so that the cut-pieces match the predefined boundary edge

geometry as well as the predefined fiber orientation [7].

During deformation, undesired wrinkles can form, which need to be identified and

removed. The aerospace industry, having little tolerance for error, requires a high product quality.

This makes it imperative to establish an automated method to identify wrinkles if they form on the

fabric throughout the draping process.

The automated draping process by the modular gripper has proven to be very flexible but

also very complex. A total of 145 process parameters must be selected for draping a cut-piece: 3

parameters for the spine motor position, 15 parameters for the rib motor position, and 127

parameters for the intensity of each suction unit which have a significant impact on the quality of

the draping. This is due to two effects that occur during the deformation of the gripper surface.

The stresses arising in the material are either relieved in the fiber direction by buckling of the

material or by shearing the textile transverse to the fiber direction. Both effects can be handled

with the aid of a targeted selection of suction intensities by less firmly gripping the area of the

14

textile with expected relative movements. The suction intensity must be increased in areas where

the shearing force is applied, to make the grip tighter. The quality of the drapery is characterized

by checking two main items: 1) if any fold forms in the fiber cut-piece, 2) how accurate the edge

contour of the cut-piece corresponds to the predefined contour.

Misalignment of actual and predefined fabric boundary can result in cut-pieces overlapping

each other inside the mold and decrease the final product quality. Moreover, wrinkles can affect

the mechanical properties of a carbon-reinforced component negatively by letting the fiber fold

when the vacuum is applied. Folded fibers cannot resist tensions and the strength of the final

component in the fold area reduces.

Development of an automated solution that visually finds these misalignments and defects

early in the process has been of much interest recently. Such a solution can be used for online

quality inspection of the draping process either when the fabric is grasped by the gripper and is

being transferred to the mold or later when it is placed inside the mold (middle and right images

in Figure 2.3). Also, finding the exact location of the fabric on the gripper can help to adjust the

gripper motion planning and increase the accuracy of fabric placement inside the mold.

Irrespective of how these 127 suction intensities are optimized, whether manually or

automatically, a process is required that validates the drapery according to the selected intensities.

The first step towards such validation is the detection of the cutting-edge curve and any wrinkles

that may have formed on the fabric surface in the draping process. The user can use this data to

monitor the composite production process and apply the necessary adjustments online.

Due to the high number of parameters affecting the process, identifying the underlying

relationship between 145 process parameters (i.e., suction unit intensities and the gripper surface

geometry) and the final mechanical and physical properties of the dry fabric after the draping

15

process is a challenging task. To tackle this issue, a finite element-based simulation was done by

Montazerian et al. in [8]. To validate the simulation results, the boundary edges of the fabric in

both the simulation and the actual process must be compared. The simulation provides the

boundary edge of the textile over each suction surface by dividing each suction unit into 100

smaller portions along both x and y directions (see Figure 2.4). The fabric contour is then

documented according to this coordinate system. Although the actual contour can be documented

by manually inspecting the images captured from the setup, an automated evaluation method is

highly required as the manual process is extremely time-consuming and costly. The developed

method in this thesis can be used to find the actual contour of the fabric.

Figure 2.4 – Comparison between the boundary of the fabric in the simulation and the actual process

As shown in Figure 2.5, the actual boundary can be identified by the DCNN model using

the real images and then get compared with the simulated boundary. The validation result lets the

user tune the suction intensities according to the observed deviations. Increasing the suction

intensity in a selected area will prevent the fabric from sliding on the suction surface. Reducing

the suction intensity allows the textile to move across the surface. Using this approach, the user

can assure the desired boundary edge is achieved. The formation of the wrinkles can be also

16

controlled by adjusting the suction configurations as the folds always indicate an excessive

accumulation of material in an area.

Figure 2.5 - Validation of the draping process simulations

To validate the draping quality inside the mold, Kӧrber et al. in [9] designed a method to

calculate the target contours with the aid of CAD-based optimization tools. They have shown how

the gripper’s deformation and position inside the mold can be iteratively calculated and optimized

by simulating the process in the CAD environment. For this purpose, the mold model and a

simplified gripper model were integrated into a CAD model using CATIA. With the help of several

CATScript macros and python framework, the gripper model was moved onto the location of the

targeted cut-piece. Afterward, the ribs and spine of the gripper model were deformed and the

distances between the suction surfaces and the mold surface were measured after every

deformation step. The optimization was terminated when a termination criterion was met.

Figure 2.6 shows the mold and the gripper model, the targeted cut-piece location before

the simulation run (left image) and the optimized location (right image). The colored surfaces

17

indicate the distance between the suction surfaces and the mold surface. Green areas are above the

mold surface, yellow areas are in contact with it and red areas penetrate it. This simulation can

provide the desired position and orientation of the gripper, the proper deformation parameters, and

the target contour of the cut-piece on the gripper. The position and deformation parameters are

needed for the automation of the process while the target contour is required for the validation

purposes. To validate the simulation, the actual draping result needs to be captured using a visual

system. The developed method in this thesis can be useful at the validation step by providing the

boundary of the fabric inside the mold.

Figure 2.6 - Simulation of the draping process inside the mold

The Center for Lightweight Production Technologies (ZLP) in Augsburg, Germany,

developed an optical sensor system which provides information of relative movements between

the suction surfaces and the fabric during the draping process [10]. However, this setup was limited

in its ability to detect the fabric boundary geometry and probable wrinkles. The complicated shear

stresses happening while draping a flexible fabric material on a curvature can lead to wrinkling of

the fabric. Wrinkles greatly affect the manufacturing process and reduce the quality of the final

product. Figure 2.7 is an example showing the fabric, gripper, and linear wrinkles that possibly

appear within the draping process (one of the wrinkles is shown with a red bounding box).

Gupta et al. in [1], [11] aimed to address this issue by automatically finding the 3D

geometry and boundary edges of the cut-piece while it was mounted on a yoga ball. Then, they

18

used this data to predict the material behavior and the draping quality within the deformation. They

gathered both RGB and infrared images to take advantage of both color features and depth

measurements in the detection of boundary of composite products and the wrinkles that may

happen.

Figure 2.7 - Fabric on the gripper

Their experimental results proved their solution's robustness to changes in parameters which were

estimated according to the experimental setup condition; however, it was highly dependent on

tuned parameters that must be re-calibrated for every new situation.

2.4 Limitations of Traditional Image Processing Techniques

Wrinkle detection is based on lighting effects, where one side of the wrinkle will be

significantly lighter than the other. In the original image, subtle wrinkles are difficult to identify,

even to the human eye. To enhance these lighting effects, contrast enhancement through histogram

equalization is performed on the masked area of the fabric. Because the fabric is not oriented in a

19

fixed position relative to the lighting, one portion of the fabric can have an overall higher light

intensity than another. To help distinguish large-scale from small-scale lighting effects (i.e.

wrinkles), a localized histogram equalization technique is used. The results of this process can be

seen in Figure 2.8.

Figure 2.8 - Fabric with contrast-enhanced using histogram equalization (left) and local histogram

equalization (right) methods

Aggressive contrast enhancement succeeds in making the luminosity differences of

wrinkles more visible; however, it also makes the texture of the fabric apparent. The texture can

be seen in the top-right to bottom-left diagonal stripes in Figure 2.8. These stripes are highly

periodic, and as such are a good candidate for filtering in the frequency domain. The method

presented in [12] was utilized, with one change: it was found that tuning the normalizing divisor

(δ) for that method was difficult and image-specific. As such, rather than using a constant divisor,

the magnitude of any identified peak is set to the average of the kernel area. A low-pass Gaussian

filter and several passes with a median filter are then applied to remove high-frequency

components. Figure 2.9 shows the results of these filtering operations with and without the

frequency-domain filtering.

20

Figure 2.9 - Filtered contrast-enhanced fabric without first using [12] for frequency-domain removal of

periodic noise (left) and with noise removal first (right)

Considering the steps taken in this process, using traditional image processing techniques

has many difficulties from selecting the right technique at each stage to tuning the parameters of

applied filters which needs expertise and practical knowledge. Also, the parameters need to be re-

tuned if any change happens in lighting conditions, quality of images, etc. which hinders the

performance of such methods in real-time. Thus, it is essential to look for novel methods that can

handle these issues more efficiently.

It should be noted that detecting fabric and gripper during the draping process is simpler

than wrinkle detection because of their regular geometry and color features. So, traditional

techniques can be employed for detecting classes with fewer complexities as complementary to

more advanced methods like convolutional neural networks used for the detection of challenging

objects.

21

2.5 Visual Sensing Systems

With the increasing demand for efficient, fast and accurate robots in industrial

manufacturing, it is imperative for roboticists to employ unconventional and innovative practices

to solve some non-trivial problems, one of them being the use of Two-Dimensional (2D) image

data to perceive Three-Dimensional (3D) surroundings. The menace posed by the traditional 2D

methods greatly limits the operability of robots in a practical working environment. The

representation of a 3D scene in 2D leads to the loss of some essential features that can play a vital

role in facilitating the issue.

Robotic engineers have always been inspired by the occurrence of processes in nature, and

human binocular vision tends to solve the problems faced by the modern-day industries. This

motivates researchers to imitate the human vision system to address the challenges on the way of

achieving a robust and well-performing machine vision platform by enabling a more

comprehensive insight into the process. Apart from its deeper semantic understanding of the scene,

such a technique offers the ability to enhance performance on some complex tasks like object

recognition, grasping, handling, manipulating, assembling, etc. Besides, tasks requiring an

external human supervisor can be automated through a practical estimation of the object's pose

and dimension. Thus, the ability to perceive the surrounding environment is crucial to a wide

variety of industries e.g. automation, inspection, process control, robot guidance, etc.

To reach this goal, there is a need to collect visual data by the means of appropriate passive

optical sensors and develop a powerful tool to process and analyze the input to obtain the desired

output in real-time. The technique is also expected to reduce the costs, enhance the performance

by optimizing the process, and avoid high computation and extensive execution timelines.

22

To design a vision system capable of performing the object recognition task, some key

parameters need to be chosen.

First, the number, type, and arrangement of visual sensors that will be used. One can use

1) a single visual sensor, 2) a stereo vision system having two visual sensors, or 3) a multi-view

setup including three or more sensors. The used visual sensor can be 1) an RGB3 camera that

perceives the color features in the images, 2) an RGB-D camera that uses depth features in addition

to the color features, or 3) LIDAR and LIDAR-like systems which use a laser light to measure the

distance from the target. In the case of using more than one sensor, the arrangement of sensors in

the scene can vary considering their relative angle and distance (narrow-baseline or wide-baseline

setups).

Second, choosing the appropriate method for performing object detection and localization

task. There are many options available ranging from traditional image processing techniques,

Histogram of Oriented Gradients (HOG), Scale-Invariant Feature Transform (SIFT), to recent deep

learning approaches which are mostly based on CNNs such as Region Proposals, Single Shot

MultiBox Detector (SSD), etc.

Also, it needs to be decided if there is enough labeled data to use supervised learning

methods or a semi-supervised or unsupervised method should be used because of the lack of data.

In the case of using depth features for detection, it should be decided if depth will be added as

another input channel to color channels or it will be used for creating a point cloud or a 3D model

of the scene. The following sections review different methods that can be used for object

recognition in different computer vision applications.

3 Red, Green, Blue

23

2.6 Object Recognition Algorithms

Before introducing various object recognition algorithms, it is noteworthy to discern

different but similar computer vision tasks like image classification, object detection, object

localization, instance segmentation, and semantic segmentation, to avoid any confusion that may

happen later. An image classification method assigns a class label to a whole image, whilst object

localization draws a bounding box around each object present in the image. Object detection is a

combination of these two tasks and assigns a class label to each object of interest after drawing a

bounding box around it. Object recognition is a general term used for referring to all of these tasks

together [13]. Figure 2.10 shows how these concepts are related to each other.

Figure 2.10 - Overview of object recognition tasks in computer vision area

Object detection models are able to create a bounding box for every object present in the

image. But they cannot provide any information about the object’s shape as the bounding boxes

have a rectangular or square shape. In contrast, image segmentation models can create a mask in

pixel level for every object that appeared in the image. This method provides us with a more

24

comprehensive understanding of the object(s) present in the image. The image segmentation task

itself can be performed in two ways: 1) semantic segmentation which assigns a label to each pixel

of the image according to the class of the object surrounded inside the pixel (class-aware labeling),

2) instance segmentation which identifies object boundaries in pixel level and distinguishes

between separate objects from the same class (instance-aware labeling).

In the following sections, some of the well-known object recognition methods are

described.

I. Scale-Invariant Feature Transform (SIFT)

SIFT is a feature detection algorithm in computer vision that detects and demonstrates local

features in each image. It was first published by David Lowe in 1999 [14] and then patented in

Canada by the University of British Columbia in 2004. SIFT applications include object

recognition, 3D modeling, robotic navigation and mapping, video tracking, gesture recognition,

and match moving.

There are multiple steps involved in the SIFT algorithm: 1) finding the potential location

of features (scale-space peak selection), 2) locating the feature key points (key point localization),

3) assigning an orientation to key points (orientation assignment), 4) describing the key points as

a high dimensional vector (key point description), and 5) key point matching.

SIFT starts with extracting each object key points from a group of reference images and

then saves them in a database. In a new image, each available feature must be individually

compared with the elements of the database to recognize an object based on probable similarities.

Then, Euclidean distance between their feature vectors must be calculated to detect candidate

matching features. Those subsets of the thorough set of matches that their key points have the same

25

object, object orientation, location, and scale in the new image are recognized to find acceptable

matches. Then, an implementation of the generalized Hough transform as a hash table is used for

the determination of consistent clusters. Every cluster having at least 3 features with the same

object and object pose should either pass further detailed model verification or be discarded as an

outlier. In the end, the number of probable false matches and the accuracy of fit must be considered

to compute the chance that observing a set of features indicate the presence of an object. An object

match will be confidently marked as correct if it passes all these tests successfully [15].

II. Viola-Jones Object Detection Framework

The Viola-Jones object detection framework was introduced in 2001 by Paul Viola and

Michael Jones [16], [17] and could achieve competitive real-time object detection rates. It was

primarily developed to address the problem of face detection but also could be used to detect

various object classes. Viola-Jones algorithm is known as a robust algorithm by always providing

a very high true-positive rate and a very low false-positive rate. The ability of Viola-Jones in

processing at least two frames per second makes it a good candidate for practical applications in

real-time. However, its performance is limited to the detection task only. So, it cannot be

considered as a good object recognition method as detection is just the primary step in the whole

recognition process.

Viola-Jones is popular for its three key contributions. First, it introduced a new

representation of the image, known as integral image, that allows the quick computation of the

features used by the detector. Second, an AdaBoost-based learning algorithm was developed,

which yields significantly effective classifiers by choosing a few crucial visual features from a

greater set [18]. Third, a method was developed to combine more intricate classifiers progressively

26

in a cascade allowing the quick discard of the image background regions while concentrating the

computation on regions with a higher probability of having an object. The cascade statistically

ensures that ignored regions are improbable to possess the desired object [16].

III. Histogram of Oriented Gradients (HOG)

The concepts behind the HOG was first described by Robert K. McConnell in a patent

application in 1986. However, it was not known much until 2005 when Dalal et al. did some

supplementary work to improve HOG descriptors and presented their work at the Conference on

Computer Vision and Pattern Recognition (CVPR).

HOG is a feature descriptor that is often utilized for feature extraction from image data to

facilitate object detection in a variety of computer vision tasks. HOG descriptor focuses on the

shape or the structure of an object and can provide both the edge features and edge direction, which

makes it different from older methods that only could identify if the pixel is an edge or not. To do

so, it extracts the gradient and orientation (also called magnitude and direction) of the edges. The

orientations are calculated in localized portions by breaking down the complete image into smaller

regions and calculating the gradients and orientation for each region and generating a histogram

for each of these regions separately. The name ‘histogram of oriented gradients’ is chosen because

the histograms are created using the gradients and orientations of the pixel values.

IV. Convolutional Neural Networks (CNNs)

CNNs have shown a prominent performance on both complex and low-level vision tasks

such as instance segmentation [19], stereo depth perception [20], [21], object detection [22], [23],

27

pose estimation [24], [25], image classification [26], optical flow prediction [27], and stereo

estimation [28].

The roots of CNNs for image classification started in the 1980s. This early work focused

on the identification of hand-written digits, specifically as related to automated zip code detection

[29]–[31]. This work continued through the late 1990s, but with little adoption [30]. The method

worked well but suffered from a lack of parallel compute power available at the time [32].

Advancement in GPU4 computation and increased availability of datasets led to a renewal of

interest in 2006 and led to technological advances such as the first application of maximum pooling

for dimensionality reduction [33]–[37].

The current enthusiasm for DCNNs was sparked by the ImageNet Large Scale Visual

Recognition Challenge (ILSVRC), wherein the winning entry in 2012 was a DCNN [26]. DCNNs

have since dominated the ILSVRC, and specifically the image classification component [13].

Some of the well-known DCNNs that have been frequently used in recent years are introduced in

the following sections:

i. Region Proposals (R-CNN)

R-CNN [22] is a short term for “Region-based Convolutional Neural Networks”. The R-

CNN structure is based on two steps. First, selecting a feasible number of bounding-box object

regions as candidates (also known as RoI5) using a selective search approach. Second, extracting

CNN features to perform classification while taking the features from each region independently.

4 Graphical Processing Unit 5 Region of Interest

28

To overcome the expensive and slow training process of the R-CNN model, Girshick et al.

enhanced the training process by joining three independent models together and training the new

framework, called Fast R-CNN [38]. This model aggregates CNN feature vectors and makes one

CNN forward pass for the whole image. The feature vectors share the feature matrix instead of

extracting it independently for every region proposal. In the end, to learn the bounding-box

regressor and the object classifier, the same feature matrix is extracted to be used which leads to

speeding up R-CNN by taking the advantage of computation sharing.

Although Fast R-CNN was noticeably faster during training and testing, the achieved

improvement was not significant because of the high expenses of separately generating the region

proposals by another model. Faster R-CNN [39] tried to speed up this process by integrating the

region proposal algorithm into the CNN model by constructing a single, joined model made of fast

R-CNN and RPN (Regional Proposal Network) having the same convolutional feature layers.

Later, He et al. extended Faster-RCNN by appending a branch which predicts segmentation

masks in each RoI [40], making it able to identify each object instance for every known object

within an image.

Pixel-level segmentation needs more accurate alignment compared to bounding boxes. So,

mask R-CNN tried to improve the RoI pooling layer to provide better and more precise mapping

of the RoI to the regions of the original image. Mask-RCNN beat all the records in different parts

of the COCO suite of challenges [41]. Its success has led to its employment in a variety of

applications from the detection and segmentation of oral diseases [42] to detection and

classification of road damages using images captured by smartphones [43].

In 2019, a new framework called TensorMask was introduced by Facebook AI for

extremely accurate instance segmentation tasks [44]. It uses a dense, sliding-window technique

29

and novel architectures and operators to capture the 4D geometric structure with rich and effective

representations for dense images. The main idea is that although the direct sliding-window

paradigm can accurately detect objects in a single stage without requiring a follow-up refinement

step, it is not effective for instance segmentation tasks, where instance masks are complex 2D

geometric structures, not simple rectangles. So, high-dimensional 4D tensors with scale-adaptive

sizes are required to assure effective representation of instance masks while they slide densely on

a 2D regular grid. TensorMask accomplishes this using structured, high-dimensional 4D geometric

tensors, which are composed of sub-tensors having axes with well-defined units of pixels. These

sub-tensors enable geometrically meaningful operations, such as coordinating transformations, up-

scaling, down-scaling, and the use of scale pyramids.

ii. Single Shot Multi-Box Detector (SSD)

SSD was introduced by C. Szegedy et al. in [45] in 2016 and could reach a new record in

object detection task performance. Despite the RPN-based approaches like R-CNN series that

generate region proposals and detect each proposal's object in two separate stages, SSD detects

multiple objects present in the image in only one shot which lets it perform faster.

Single shot means that object classification and localization tasks are both done in

a single forward pass of the CNN. MultiBox is a bounding box regression method developed by

the authors. The detector network first acts as an object detector and then classifies the detected

objects itself. SSD takes the bounding boxes' output space and discretizes them into a group of

default boxes over various aspect ratios and then scales them for every feature map location. While

predicting objects in new images, the network assigns a score to the probability of each object

class presence in every default box and then adjusts each box to perfectly match with the shape of

30

the object. Also, predictions coming from various feature maps having diverse resolutions are

combined to make the model able to accept objects of different sizes. SSD proved its competitive

accuracy against other techniques with an object proposal step by achieving better results on the

MS COCO, PASCAL VOC, and ILSVRC datasets.

iii. You Only Look Once (YOLO)

Contrary to R-CNN family that locate the objects present in an image using regions by only

looking at the areas of the images where are more probable to contain an object, YOLO

framework considers the whole image as its input and makes predictions about the coordinates of

the bounding boxes and their probabilities. Three main advantages of YOLO are 1) its incredibly

fast speed by processing images at 45 fps6 in real-time, 2) its ability to perform global reasoning

and seeing the entire image while making predictions despite the region proposal-based and sliding

window techniques 3) its capability to understand generalized object representations. Working

principle of the YOLO unified model is very simple: a single convolutional network predicts

multiple bounding boxes and their class probabilities simultaneously. YOLO uses full images for

training which makes it able to directly optimize detection performance.

After the introduction of YOLO by Redmon et al. in 2016 [46], various versions of YOLO

have been developed trying to enhance its performance in different aspects. Fast YOLO is a version

of YOLO using 9 convolutional layers instead of 24 which performs about 3 times faster than

YOLO but has lower mAP (mean Average Precision) scores [47]. YOLO VGG-16 uses VGG-16

as its backbone instead of the original YOLO network which makes it more accurate but slower

6 Frames per second

31

than to be applied in real-time. YOLOv2 is focused on reducing the significant number of

localization errors and improving the low recall of original YOLO while maintaining classification

accuracy [48]. The original YOLO can only detect 20 classes which are not enough for a large

group of object detection applications while YOLO9000 is a real-time framework for detecting

more than 9000 object categories by jointly optimizing classification and detection [48]. YOLOv3

is the latest member of the YOLO family with some improvements compared to YOLOv2 such as

a better feature extractor, DarkNet-53 backbone with shortcut connections, and a better object

detector with feature map up-sampling and concatenation [49].

iv. DeepLab

DeepLab is a revolutionary semantic segmentation model designed by Google. It uses

atrous convolutions to simply upsample the output of the last convolutional layer and computes a

pixel-wise loss to make dense predictions. Atrous convolution is a term used for a convolution

with up-sampled filters that allows the network to efficiently expand the filters field of view while

not raising the computation load and quantity of parameters.

DeepLab V1 made some improvements over the previous models by the employment of

Fully Convolutional Networks (FCN) which led to overcoming two main challenges: 1) reduced

feature resolution caused by multiple pooling and down-sampling layers in DCNNs, 2) reduced

localization accuracy caused by DCNNs invariance.

DeepLab V2 attempted to further enhance the performance of DeepLab V1 by addressing

another challenge, the existence of objects at multiple scales. It used a novel technique called

Atrous Spatial Pyramid Pooling (ASPP), wherein multiple atrous convolutions with distinct

sampling rates were applied to the input feature map and the outputs were combined [50].

32

The next challenge remaining in front of the DeepLab was capturing sharper object

boundaries. DeepLab V3 architecture employed a novel encoder-decoder with atrous separable

convolution which could obtain boundaries of sharp object to tackle the above issue. DeepLab V3

also applied depth-wise separable convolutions to increases computational efficiency [51].

After the great success of DeepLab V2 and DeepLab V3 on the PASCAL VOC 2012

semantic image segmentation benchmark [52], Chen et al. extended their previous work by adding

a decoder module to improve the segmentation performance, particularly along object boundaries

[53]. They took advantage of both ASPP and decoder modules to achieve a more efficient encoder-

decoder architecture and thus improved their performance on the PASCAL VOC 2012 challenge.

v. U-Net

U-Net is an encoder-decoder based architecture developed by Ronneberger et al. for

biomedical image segmentation [54]. One of the important challenges in the medical computer

vision field is the limited number of datasets as many are not publicly available to protect patient

confidentiality. Moreover, accurately annotating medical images requires trained personnel. U-

Net employs data augmentation techniques to effectively use the available labeled samples which

makes it a useful tool in cases without large annotated datasets. U-Net has also been shown to

perform well on grayscale datasets.

vi. IC-Net

Zhao et al. introduced an image cascade network, called IC-Net, that combines branches

with different resolutions under appropriate label orientation to solve the real-time semantic

segmentation challenge [55]. The importance of architectures like IC-Net is that they achieve an

33

acceptable trade-off between accuracy and efficiency to perform image segmentation tasks in real-

time which is a necessity for a variety of applications.

2.7 Transfer Learning

Machine learning models are developed assuming that the training and test data have a

similar feature space, and the same distribution. So, we need to develop a new model and collect

a new set of training data if the feature space or the data distribution changes. Collecting the needed

training data and rebuilding the models is expensive and time-consuming. Thus, finding a way of

transferring knowledge between task domains is of much interest. The idea of transfer learning in

deep learning tasks is initiated by the fact that humans can use their previous knowledge gained

from learning other tasks to learn new tasks or address new issues better and faster. In the computer

vision field, it will be like having one CNN that tries to learn basic features of the images such as

shape, corners, illumination, etc., in various tasks and then employs what has learned to more

efficiently understand new classes in images.

To implement Transfer learning, the last predicting layers of the pre-trained model must

be removed and substituted with adjusted predicting layers. During the training, weights of the

preserved part of the model are kept fixed and do not get updated, which makes the pre-trained

model able to act as a feature extractor. Thus, training only changes weights of the last layers

which try to learn the task-specific features. Using transfer learning improves performance with

less training time and reduces the need for huge training sets having tens of thousands of images

to smaller sets with only a few hundreds of images. Figure 2.11 shows how transfer learning

technique can be applied to make a neural network that is pre-trained on ImageNet dataset with

34

over 14 million images able to detect new objects such as gripper, fabric, and wrinkles by re-

training the last layers on a small dataset with only about 8000 images.

Figure 2.11 - Transfer learning diagram for visual inspection of the draping process using a pre-trained

model

2.8 Stereo Vision Image Processing

The increasing demand for machines making 3D models of the surrounding environment

has led to the further development of the stereo vision systems which use a technique aimed at

inferring depth information from two (or more) images. Scene understanding includes many

different aspects ranging from object detection and image classification to pose estimation and

depth perception that leads to obtaining the physical geometry and is a key feature for many natural

and artificial systems. There is a wide variety of practical applications for this field such as obstacle

avoidance in autonomous driving vehicles, object detection and manipulation in robotics,

35

navigation and localization of the mobile systems, surface detection for the landing of drones and

UAV7s, etc.

A stereo vision system is made up of two cameras which are placed in a fixed distance

related to each other. The final goal of the system is to perceive the depth of each point on the

image and create a 3D model of the scene by processing the two images captured simultaneously

based on trigonometry or other methods like machine learning. However, this process is subject to

many problems like occlusion, photometric distortions and noise, specular surfaces,

foreshortening, uniqueness constraint, perspective distortions, uniform (ambiguous) regions,

repetitive patterns, transparent objects, discontinuities, etc., which make it tough to find the

correspondences in two images.

There are lots of proposed methods utilizing only a singular vision sensor for this purpose.

However, as a perfect instance of a stereo vision system, a human's eyesight can practically guess

the size and depth of an object in a very short time. So, to approach the level of the human eye,

there is yet another tool that can be used.

Recently, several research works have been done to improve different aspects of the stereo

vision technique. Osswald et al. developed a spiking neural network architecture that exploits an

event-based representation to address the stereo correspondence issue [56]. A novel benchmark

was presented by Fererra et al. to compare the limitations and performance of monocular and stereo

vision systems for target detection tasks by studying the performances and limitations of 3D pose

estimation of a known target in intensely rough conditions [57]. Many novel approaches have been

used to obtain the depth of a particular object from a singular image [58]–[61]. But, using a stereo

7 Unmanned Aerial Vehicle

36

vision system for object detection is usually better than using a single camera since it is simpler to

calibrate and produces more precise results [62]. Eigen et al. proposed a quick and simple

multiscale model architecture for convolutional networks that performs excellently on three

modalities including surface normal, depth, and semantic labels [63].

Some other researches have focused on developing the practical applications of stereo

vision systems. Sangeetha et al. designed and implemented a stereo vision system to handle robotic

arms in space applications [64]. In another work, the size and depth of the objects in the

surrounding environment were used for the localization and navigation of the mobile systems [62].

McGuire et al. presented a computationally efficient stereo optical flow algorithm for obstacle

avoidance and velocity estimation on micro aerial vehicles [65]. Balter et al. employed an adaptive

kinematic control method plus a stereo vision system to control a robotic venipuncture device as a

piece of evidence that these systems can enter the realm of modern medicine as a powerful

technique [66].

2.9 Multi-View Computer Vision

The fusion of multi-views from the same scene is one of the main techniques in the field

of robotics, pose estimation and 3D reconstruction. However, it is not a common option in

performing semantic segmentation tasks due to the 1) lack of annotated multi-view datasets, 2)

difficulties of relating the location of each image to the reference coordinate, 3) complexities of

the fusion techniques, especially while performing feature-level image fusion.

In such a system, a single dynamic camera can be used to aggregate several views and

create a semantic reconstruction of the environment, or multiple fixed cameras can be employed

to obtain various aspects of the scene simultaneously.

37

Ma et al. proposed a new deep learning method performing semantic segmentation tasks

by taking RGB-D images from multi-views as input [67]. They prove that multi-view consistency

in training time leads to a great enhancement in fusion at test time compared to training the

networks on single views and then fusing the predictions.

In some cases, such as wrinkle detection during the draping process, a single view of the

scene will not be enough for segmentation as the object is constantly moving and the camera will

not be able to always provide a clear sight on the scene. So, employing a setup of multiple cameras

fixed in various locations in front of the scene in addition to a multi-view deep learning algorithm

that analyzes multiple views in feature level can be proposed as a promising solution.

2.10 Depth Estimation

The rapid development of computer vision systems in addition to the emergence of

machine learning techniques have been two key factors in improving machines' vision level in

recent years. Although the visual sensors can provide high-quality images from the scene in

different environments and conditions, getting the depth of objects is not as easy as capturing

images and needs extra computational effort. A robust depth estimation technique can be helpful

to various applications like medical imaging, industrial automation, autonomous driving vehicles,

3D reconstruction of urban environments, etc.

Researchers have been taken different approaches to tackle this issue. Depth estimation

using a single image is attracting much attention as it only needs one camera and is needless of

synchronizing different sensors. However, the required algorithm for this purpose is more

complicated than multi-view systems and is highly dependent on object size, global view, and

environmental condition. On the other hand, systems composed of multiple cameras are more

38

robust and accurate because of working with geometric relations between images. Nonetheless,

they are more expensive, and the calibration process of cameras is time-consuming, complex and

error-prone.

From the machine learning point of view, although DCNNs have made impressive progress

during recent years, their need to large annotated datasets for training is a barrier against adapting

them to different conditions. That is why researchers prefer to take semi-supervised and

unsupervised approaches instead of basic supervised learning methods. However, these techniques

are usually based on human knowledge and expertise, which means they need to be hand-tuned

for every new situation.

To find the best method for performing depth estimation tasks, one needs to consider different

parameters like:

• Number of available visual sensors

• Range of depth values in the scene

• Desired accuracy and permitted error

• Expected time for performing the task

• The availability of a computational unit for training the neural networks

• The amount of available annotated data for using a supervised learning method

• The presence of experts to tune the network by their knowledge and expertise while

employing semi-supervised or unsupervised learning approaches

• The environmental condition

• The possibility of occlusions and shadows

• The baseline of cameras

39

The following sections review some of the pivotal researches which focus on the depth

estimation task by employing different machine learning techniques and a different number of

views. After introducing the RGB-D image processing paradigm, the methods are classified into

two main categories; depth estimation using a single image or multiple images. Each category then

is divided into three subcategories based on the employed learning approach; supervised, semi-

supervised, or unsupervised. Finally, research works that have used multiple cameras and a

supervised learning method are divided into smaller subcategories based on the width of their

cameras' baseline (narrow-baseline or wide-baseline).

I. RGB-D Image Processing

One of the fundamental targets in the computer vision research area is making the

computers able to look at the images in a human-like manner. So, understanding the image in pixel

level, known as semantic image segmentation, has attracted much attention in different categories

like object detection [41], [52] and scene understanding [68], [69]. In recent years, the development

of DCNNs [23], [26], [70] in addition to the accessibility of large-scale annotated image datasets

have been led to a great enhancement in performance of semantic segmentation algorithms [50],

[71], [72]. RGB semantic pixel-wise labeling can be considered as a start point of semantic

segmentation during recent years [54], [63], [71]–[76].

Although the emergence of different semantic segmentation algorithms helped the

computers to conceive the environment in a similar way to humans, some geometric information

was still missing in color channels which could only be extracted from depth information. In

meanwhile, the availability of additional depth channel, which could help to a better realization of

geometric information in the scene, resulted in increasing interest in semantic segmentation of

40

RGB-D images [77], [78]. As a straightforward solution, depth data, gathered by cheap depth

sensors, was added to the input color channels as an extra channel [19], [72]. Considering the

merits of employing depth data, the next works were focused on the development of networks that

were able to jointly learn from depth and color information [79], [80]. These networks used depth

data to separate objects/scenes while using color channels to extract semantic information [81].

After the introduction of CNNs and fully connected networks, they were employed for

extracting the depth features to empower the segmentation of RGB-D images. Couprie et al.

trained a CNN on a mixture of RGB and depth images to synthesize information at distinct

receptive field resolutions [82]. Gupta et al. based their work on the R-CNN network [22] and used

a novel idea to detect objects [19]. They added a depth image to available three channels and called

it HHA, which retains each pixel's height above ground, horizontal disparity, and the angle of the

local surface normal. Then, they attained semantic segmentation of the image by training a

classifier using the features extracted by CNN. Long et al. fused both RGB and HHA images and

proposed an FCN with an up-sampling stage that outputs a high-resolution segmentation by

combining low-resolution predictions [72]. Their method boosts the accuracy of semantic

segmentation compared to other methods that directly fuse the segmentation scores. Eigen et al.

used a multi-task network to estimate surface normal, depth, and semantics for RGB-D images

[63]. Trying to develop an end-to-end solution better than the use of HHA or a direct concatenation

of depth and RGB features, Fusenet was introduced as an encoder-decoder network which tries to

fuse complementary depth information and color cues into a semantic segmentation framework

[83]. The encoder part simultaneously uses RGB and depth images for feature extraction and then

combines RGB feature maps and depth features in different levels of fusion (dense and sparse

fusion) as the network goes deeper. Some other research groups have also focused on the

41

development of more complex CNN architectures for conducting single image semantic

segmentation [84], [85].

Although CNNs are known for their good performance on single image segmentation,

applying them to 3D reconstruction tasks using multi-view images is not usual. Riegler et al.

presented a novel 3D representation of CNN, called OctNet, for deep learning with high-resolution

inputs that can be used in various tasks such as pose estimation, object categorization, and semantic

segmentation on voxels [86]. McCormac et al. combined CNNs and a SLAM8 system, called

ElasticFusion, which produces a handy semantic 3D map by fusing the semantic predictions made

by the CNN on multiple views into a map [87]. He et al. proposed a super pixel-based multi-view

CNN which employs information obtained from complementary views of the same scene for single

image segmentation [88]. Ma et al. trained a network on multi-view consistency and fused the

outputs from different viewpoints as complementary to the mentioned monocular CNN methods

[67].

II. Depth Estimation Using A Single Image

Recently, many robust algorithms have been developed for finding the depth of points in

an image using stereo vision. However, estimating the depth of an object using only a single image

remains an open issue due to the weak performance of developed algorithms.

Monocular depth estimation techniques work with object sizes, global views, line angles and

environmental conditions which are not easy to achieve. Also, each set of parameters corresponds

to more than hundreds of possible world scenes that are hard to choose between.

8 Simultaneous Localization and Mapping

42

On the other hand, monocular vision setups are relatively cheaper and faster. They also are

not dependent on the synchronization of cameras, which is one of the major sources of error in

multi-view setups.

i. Supervised Learning

Development of depth datasets like KITTI, RGB-D Scenes, Sun 3D and NYU Depth, has

facilitated the use of supervised learning techniques for performing depth estimation tasks. Most

of the available RGB-D datasets are composed of images and depth information of objects which

are usually found in houses, offices, streets, etc. Mentioned datasets can be useful for most of the

depth estimation tasks. However, for more specific cases, researchers must either create a new

annotated dataset that needs human effort or take semi-supervised or unsupervised approaches that

have more technical complexities.

Eigen et al. presented a supervised monocular depth estimation method that uses two neural

networks working successively [89]. The first network predicts the depth at a coarse-scale using

original input. Then, the second network takes both the original image and output of the coarse

network and refines the output within local regions. In contrast to stereo matching, local views are

not enough for detecting dominant features in monocular techniques. So, a global understanding

of the scene is necessary to employ cues like object locations, vanishing points, etc. Finding this

global understanding is the main goal of the coarse-scale network while the fine-scale network

works locally.

43

ii. Semi-supervised Learning

Although supervised deep learning algorithms made huge progress in performing depth

estimation tasks, they are highly dependent on having a large amount of training data to perform

well. Creating the required data needs employment of depth sensors like RGB-D cameras for

indoor environments and 3D laser scanners for outdoor settings. When using these extra sensors,

the fact that the error and noise of the sensors will be added to the system is deniable. Correlating

the data gathered by lasers with images is another problem as the nature of laser output is sparser

than raw images. Finally, tuning the cameras and finding the exact values of intrinsic and extrinsic

parameters at each moment is another challenge in this process. To overcome these problems,

semi-supervised learning approaches employ a few numbers of labeled data plus a large set of

unlabeled data to train the neural network in a wise manner.

Kuznietsov et al. proposed a semi-supervised, encoder-decoder based deep learning

method for monocular depth perception [90]. The introduced method was based on a deep residual

network architecture which had an encoder-decoder template. It used stereo vision geometry

principles to learn the estimation of depth directly from binocular cameras in an unsupervised

manner.

iii. Unsupervised Learning

The vast demand for hand-labeled data to train the neural networks is the most challenging

issue that depth estimation algorithms are struggling with. Unsupervised learning is a machine

learning algorithm used to make predictions on datasets without annotated images. This technique

makes the network able to learn the depth directly and establishes an end-to-end solution to object

detection tasks.

44

Grag et al. presented an unsupervised manner which can perform needless to pre-training

and an annotated ground-truth [91]. The main idea of the designed network is taking the structure

of an auto-encoder neural network and training it on a group of pair images. The only difference

between each pair of images is that the second one is taken after a small, known camera motion.

So, each pair can be considered as a stereo pair captured by two relatively close visual sensors at

the same time.

III. Depth Estimation Using Multiple Images

Depth estimation from multiple images is one of the most challenging tasks in the field of

computer vision. A variety of applications like object grasping using robotic arms, distance

estimation for autonomous driving, 3D reconstruction of urban areas and buildings, product

handling in industrial environments, medical imaging, movie and game making industry, etc.,

require a robust and accurate depth estimation of the scene as an essential component. The

performance of available solutions largely depends on parameters like lighting condition, the type

of cameras, the arrangement of cameras, the rate of changes in sequential images and depth range.

Thus, finding a robust and comprehensive solution to this issue has been of much interest in recent

years.

The single image reconstruction methods are suffering from some restrictions in viewing

conditions, reflectance, and symmetry of the images. So, using multiple cameras and finding the

relations between images by calculating the disparities becomes an option. The advantage of these

methods is that they are quicker and do not require much computational effort. Also, they usually

provide better depth accuracy compared to single cameras.

45

However, multi-view learning algorithms need at least several ten to hundreds of properly

captured images. Furthermore, they have some issues in cases with objects occluded in several

views, similar texture and color variations.

i. Supervised Learning (Narrow-baseline)

Although multi-view setups take advantage of looking at the scene from different views

and provide more details, the limited space to mount the cameras in some of object detection

applications is an issue. So, developing an algorithm to percept the depth by using a narrow-

baseline setup of cameras has become of interest.

Wang et al. developed a deep neural network to achieve a real-time multi-view depth

estimation system that can work with two or more images captured from diverse types of cameras

[92]. Their model tries to bring up a new solution inspired by classic multi-view systems with

improved features. It uses a combination of encoder-decoder networks in the same manner as auto-

encoders. But these two kinds of systems differ in some technical details.

ii. Supervised Learning (Wide-baseline)

In some applications like industrial manufacturing processes, autonomous driving vehicles,

etc., cameras can be located relatively far from each other resulting in a wide-baseline setup. It has

been one of the trending techniques during recent years as it provides a larger Field of View (FOV)

and allows for coverage of a broad area rather than a limited region.

Jorissen et al. presented a novel solution that uses the Image-Based Rendering (IBR)

technique to create enough images for stereo matching [93]. The introduced algorithm extracts the

structures from captured images by a sparse linear camera setup. Starting the extraction process

46

from the central view and then rendering the depth maps for the adjacent views resolves the partial

occlusions. Stereo matching algorithms usually work with matching windows and try to match

pixels by finding the correspondences between images. This action happens after looping over

predefined disparities. To deal with occlusions in multiple camera setups, a confidence value will

be assigned to each pixel and then pixels with a value lower than the threshold will be recalculated

using data from neighborhood pixels.

iii. Unsupervised Learning

Creation of annotated datasets using images captured by multiple cameras is even harder

as the cameras should be perfectly synchronized during the whole data creation process. Some

researches use additional depth sensors to improve the accuracy of depth information. However, a

well-established unsupervised learning algorithm can remove all these barriers and reach to an

end-to-end solution needless of human effort.

Li et al. proposed a new approach that requires coarse point-cloud samples to create a dense

depth map of the image [94]. The used methodology requires an image, as well as sampled depth

points. These points are generated from the ground truth provided by the dataset by obtaining only

a portion of the overall depth points.

Wang et al. developed a method using unsupervised learning and a perceptual loss to

estimate the depth with a twofold setup [95]. In this setup, the images are first passed through the

depth and pose estimation network which then passes the data to the perception network. The depth

network obtains raw feed from the input images. Using Depth-net, the depth map of each image is

determined. The first image and the next are then sent to Pose-net which helps to recreate a pose

47

estimation for the first image. Finally, two networks are put together to create a reconstructed view

of pose and depth.

IV. Depth Estimation Challenges

The main problems challenging the researches in this field come from lack of annotated

datasets, errors in calibration and synchronization of cameras, lack of expert knowledge in some

specific applications, and the complexities of neural network architectures.

It is usually preferred to use supervised learning for depth estimation as it is more robust

to noise and unwanted changes compared to the unsupervised learning methods. However, in more

specific applications that are not explored much yet, it is easier to focus on unsupervised

approaches rather than creating new datasets.

The number of employed cameras depends on the application and varies according to the

available space for mounting the cameras, available visual sensors, size of the region of interest,

and required estimation accuracy.

To improve existing solutions, one can focus on enhancing the architecture of the neural

networks. Recent researches show that using an appropriate loss function can lead to notable

improvements in network performance. Furthermore, the data preprocessing stage has a great

impact on the estimation results. It is expected that most of the future works in this field focus on

adjusting the architecture of CNNs and try to find depth clues from intermediate layers and employ

them for final estimation. Also, the expansion of annotated datasets in the future can be helpful to

enhance the performance of currently developed solutions.

48

Chapter 3 : Infrastructure for Dataset Development

3.1 Hardware Setup

3.1.1 Modular Gripper

To drape the cut-pieces, an end-effector system called the "Modular Gripper" was

developed at DLR, which can deform its gripper surface to a double curved geometry with the aid

of a rib-spine design. The process can be broken down into the following steps. First, a textile cut-

piece is placed on a table. Second, the gripper picks up the blank by sucking the fabric against its

surface using an array of suction units and preforms the blank to the double-curved target geometry

by manipulating the suction units. Lastly, the cut-piece is precisely placed into the mold [1].

Gripping system functions based on two main components (Figure 3.1). On one side, 127

suction units form a suction surface which grasps the textiles. For this purpose, each suction unit

has a 100𝑚𝑚𝑚𝑚 × 100𝑚𝑚𝑚𝑚 effective area and its suction intensity can be individually varied in eight

stages. The second component consisting of a rib-spine system ensures the deformation of the

suction surface from a flat state to a double-curved geometry. The spine consists of two flexible

glass fiber rods and three linear actuators. The linear actuators pull on the rods and thus create a

curvature that affects the adjustment of the connected suction units. The second curvature is

achieved with 15 ribs. The suction units of one rib are mechanically connected by a shaft. If this

shaft is rotated by an actuator, the distance between two lever arms is shortened, resulting in a

homogeneous change in the angle of attack of all connected active surfaces. This mechanism

makes the rib-spine system able to create various double-curved geometries.

49

Figure 3.1 - Main components of the modular gripper

3.1.2 IDS XS Camera

IDS XS camera with autofocus is an industrial camera that can be easily used for a variety

of applications. The XS comes with USB 2.0 and Mini B USB 2.0 connectors which facilitate its

integration. The camera is armed with OmniVision's 5 Megapixel CMOS sensor with 1.4 µm pixel

size which provides high-quality images with high color reproduction accuracy in both normal and

harsh light conditions. The small size of the XS, its light weight, compact design, integrated power

supply, and its ability to capture videos at 15 fps (2592 × 1944 pixel) make it a perfect choice

for embedded systems and industrial applications.

3.2 Software Setup

3.2.1 Data Generation for Training Image Classification Models

50

To train the deep neural networks, a dataset of approximately 16,000 images was created.

A stereo imaging module consisting of two IDS XS cameras was used to capture the movements

of the robot holding the gripper with differently sized and shaped fabrics. The image pair (left and

right) from the imaging module was recorded in 14 different scenarios, each at 14 fps while the

duration of each video was around 40 seconds. Then, each frame of the image pair was divided

into 128 (64 for left and 64 for right) smaller images to enhance the features local to the area. Each

acquired frame from the stereo imaging module was 1280 × 720 pixels in size, which was then

subjected to a grid of 8 × 8 to facilitate hand-labeling. This process yielded 64 smaller images

(160 × 90 pixels in size) per frame.

Hand-labeling 16,000 images is time-consuming and challenge. To accelerate this process,

the slow movement of the experimental setup was exploited; images were batched into groups

where a little movement had occurred between frames and thus labels were expected to be similar.

The first image in the batch was labeled manually (Figure 3.2), and the label was copied to every

other image in the batch. The batch was then visually inspected for errors. In Figure 3.2, an image

frame is shown before and after the manual labeling process. The image is divided into 64 units,

each indexed based on its position in the image matrix. The indexed images are labeled as

"Background", "Gripper", "Fabric" or "Wrinkle," with each label shown in a different color for

visual inspection. The annotation process was done using MATLAB R2018a software.

51

Figure 3.2 - Frames before and after the hand-labeling process, with the colored grids in the second image

coded to represent the different classification categories: wrinkle (cyan), gripper (green), fabric (yellow) and

background (black)

3.2.2 Data Generation for Training Image Segmentation Models

To create a multi-view dataset for use in next steps, an imaging module, consisting of 3

IDS XS cameras, was placed in front of the gripper setup as shown in Figure 3.3. Then 206 images

were taken while gripper was moving, to capture different poses of the setup. This stage had to be

quick enough to not interfere with the company's manufacturing timeline. The whole process of

installing the cameras and capturing the images took less than two hours to complete which is a

reasonable and feasible downtime for many composite manufacturing processes.

Annotating the images was done using the Amazon SageMaker Ground Truth Labeling

tool.

52

Figure 3.3 - Arrangement of cameras in front of the gripper setup

At this step, Labelers were asked to mark exact boundaries of gripper, fabric and wrinkle

classes in each image. Amazon’s labeling tool facilitated this step by providing both polygon and

brush tools for annotating images in pixel-level. Also, SageMaker offers the option to create a

labeling task and ask a group of users to participate in the task or supervise the annotation process.

The annotation process was done by 5 of our colleagues in 2 hours which is roughly equal to

performing the task in 10 hours by a single person. An example of generated mask images is shown

in Figure 3.4.

The generated RGB masks were then converted to gray-scale masks to facilitate the later

conversion of annotations to required formats for training various neural networks (Figure 3.5).

A gray-scale image is an image that only has shades of gray and each pixel value is presented with

a number between 0 (black) to 255 (white). Gray-scale images store all the visual data in one

channel instead of three which makes them easier to be processed in comparison to RGB images.

53

Figure 3.4 - A sample image of the dataset (left) and its annotation mask created in Amazon SageMaker

(right)

Figure 3.5 - Conversion of an RGB mask (left) to its gray-scale equivalent (right)

The dataset was then divided into a test set including 31 images (15% of the entire dataset),

and a train/validation set including 175 images. Data augmentation was used to increase the

quantity of train set. First, 5% of all images in the train/validation set were randomly put into the

validation set and 95% into the training set. The train set including 166 images was augmented to

7442 images through vertical flips, horizontal flips, and random cropping. The data augmentation

flow is shown in Figure 3.6.

54

Figure 3.6 - Data augmentation workflow chart

3.2.3 Synthetic Dataset Generation

Building large-scale labeled datasets is a challenging task considering both collecting and

annotating stages. It is hard to guarantee the gathered images provide a large enough diversity of

scenes and conditions. Annotating the images in pixel-level is even more costly because each

worker has to spend a lot of time adjusting the created label for every single image [96]. Also,

sometimes only the experts in that field are able to find and annotate the target objects. Tumor

detection in medical CT-scans and anomaly detection in manufacturing processes are two

examples of such tasks.

Synthetic data generation is a delightful alternative to the manual annotation in pixel level.

Recent developments in computer graphics have made it possible to automatedly create such

synthetic images and semantic per-pixel labels using virtual 3D environments. Synthetic data

generation for training DCNNs seems to be a tempting technique to avoid manual annotation costs.

However, the domain distribution shift and mismatch in appearance usually result in a considerable

performance reduction when the trained model on synthetic data is tested on real data. Recently,

55

some domain adaptation techniques are developed and presented to address this issue. However,

there is still a lot of work to be done in this field.

To measure the performance drop while using synthetic data instead of real data, the

draping process was simulated in Blender 2.8. Blender is an open-source 3D creation software

supporting the 3D pipeline modeling and simulation, rigging, rendering, motion tracking and

compositing, creation of 2D animation pipelines, and video editing.

To simulate the draping process, the gripper, fabric, and wrinkles were modeled as solid

objects. Then the images taken from the real setup were used to texturize the model. Four virtual

cameras were then placed in front of the model to capture images from different angles. Also, three

spotlights were added to the scene based on the three-point lighting standard method. This

technique uses three lights; a key light, fill light and backlight (also called rim light), that all

together create a well-illuminated photo with controlled shading and shadows. An image taken

from the actual DLR facility environment was also used as the simulation background. Blender

allows using HDR9 maps as background images. However, capturing an HDR image from the

actual setup needs a professional camera and photographer which was not available at the time of

conducting this research. Figure 3.7 shows a snapshot of the simulation process in Blender.

To increase data diversity, different fabric textures, wrinkle shapes, and lighting conditions

were used in 10 simulation runs. In each run, the gripper setup conducted the same movement and

four cameras captured the images from different angles. In total, 1384 images were generated.

Each run took about 10 minutes to complete. Some of these generated images are shown in Figure

3.8.

9 High Dynamic Range

56

Figure 3.7 - Simulation of the draping process in Blender

Figure 3.8 - Samples of generated images in Blender

57

To annotate the images, it was only needed to mark the gripper, fabric, and wrinkles with

distinct colors in the 3D model and then run the simulation again. So, instead of hand-labeling

1384 images, only 10 cases had to be manually labeled. The generated masks were finally

converted to gray-scale images, the same as the real dataset, to ease the data preprocessing process.

For conversion to gray-scale format, pixels were classified into gripper, fabric, or wrinkle if the

highest value between their RGB intensities was blue, green, or red, respectively. Figure 3.9 shows

a sample of a synthetic image and its RGB and gray-scale labels. Finally, the generated dataset

was distributed into training and test sets including 1315 and 69 images, respectively.

Figure 3.9 - Synthetic data annotation process: main image (top left); image with marked wrinkles (top

right); RGB mask (bottom left); gray-scale mask (bottom right)

58

3.3 Computational Resources

For efficient training of deep convolutional neural networks, powerful GPUs are needed.

All the model trainings in this project were done using two computers. The technical specifications

of these computers are described in Table 3.1.

Table 3.1 - Technical specifications of computational resources

Computer ID CPU GPU RAM Hard disk

EME 2211 Intel Xeon®

CPU E5520 @ 2.27 GHz * 16

GeForce GTX 1070/PCIe/SSE2

24 GiB

2 TB

ASC 310

Intel Xeon® W-2133 CPU

@ 3.60GHz * 12

TITAN Xp/PCIe/SSE2

64 GiB

2 TB

For managing the training steps, Supervisely was used which is a free and open-source

platform focused on computer vision tasks. Supervisely provides an end-to-end solution from

image annotation to the deployment of neural network models by taking advantage of parallel

computing techniques. After adding the mentioned computers as Supervisely agents, it was

possible to run data preprocessing and model training tasks in parallel, simultaneously. Also, many

of the state-of-the-art DCNN models for performing various image recognition tasks are already

implemented and can be freely used on Supervisely. It is remarkable that finding the proper

implementation of these DCNN architectures and adapting them to different cases is a time-

consuming stage which can be eliminated by using Supervisely pre-implemented models.

59

Chapter 4 : Deep Learning for Wrinkle and Fabric Boundary Detection

This chapter describes two different approaches taken in this thesis to develop a deep

learning-based vision system for fabric boundary and wrinkle detection during the draping process

of fiber-reinforced materials with an industrial robot.

This thesis is mainly focused on methods that use convolutional neural networks to avoid

dealing with the limitations of more traditional methods such as HOG and SIFT. The limitations

and complexities of using traditional image processing techniques are assessed in section 2.4

which emphasizes the importance of employing novel methods to address these issues.

Due to the lack of RGB-D and 3D datasets, difficulties of RGB-D and 3D data generation,

and high cost of accurate depth sensors, developed solutions are not dependent on depth features

for detection and only need 2D RGB images. Furthermore, the currently available CNNs which

require depth data or 3D models as input are not robust enough yet and cannot be easily adapted

to perform different tasks such as wrinkle detection. Considering that the goal of this thesis is to

develop a reliable, feasible and practical solution, it was decided to focus on perceiving the 3D

environment using only 2D images to come up with a more accurate and robust solution for visual

inspection of draping process in composite manufacturing. It is noteworthy that Gupta et al. have

already developed a method in [1], [11] which uses RGB-D features to find the location of fabric

on a yoga ball and then identifies the boundary of wrinkles in a local reference frame. However,

adapting this method to the varying geometry of the modular gripper has its own challenges and

cannot be easily achieved.

60

Finally, it was decided to only focus on supervised learning methods in this thesis as

currently developed unsupervised/semi-supervised image segmentation methods cannot be easily

applied to new tasks and developing a totally new algorithm is challenging.

4.1 Stage I – Wrinkle Detection Using an Image Classification Model

At stage I, after developing a preliminary, hand-labeled dataset (described in section 3.2.1)

captured on a functioning robotic system used at DLR composite manufacturing facility, a well-

performing DCNN was designed and implemented from scratch to perform image classification.

Also, the idea of combining images from multiple cameras for generalization of the designed

model to different wrinkle properties and environments was evaluated. The proposed method

employs computer vision techniques and belief functions to enhance accuracy without the need

for any additional hand-labeling or re-training of the model. Co-temporal views of the same fabric

are extracted, and individual detection results obtained from the DCNN are fused using the

Dempster-Shafer theory. By the application of the DST rule of combination, the overall wrinkle

detection accuracy was greatly improved in this composite manufacturing facility.

The primary contribution of this stage is a method for combining multiple, co-temporal

views of fabric with defects into a single representation. This combined view can then be mapped

back to each of the original views, providing increased defect detection accuracy. To facilitate this

multi-view inferencing, a method of extracting components and detecting defects using traditional

computer vision techniques is presented. A technique for combining these views is also developed

and described.

The developed method currently has minimal interaction from an operator in selecting key

points used in perspective correction. The future work will remove this need and produce a fully

61

automated solution. For practical applicability, automated detection is expected to be independent

of the fabric's shape, orientation, size and even visually dominant characteristics such as color.

Furthermore, detection performance should be robust to change in experimental conditions such

as lighting and location of the fabric with reference to the gripper. To accommodate such design

requirements, the proposed solution enhances a data-driven, automated wrinkle detection module,

developed independently of the hand-crafted features [11].

The work in this paper is executed in 3 phases:

1. Model development, training, and evaluation (Section 4.1.1)

2. Evaluation of the algorithm's ability to generalize (Section 4.1.2)

3. Development and evaluation of an approach for combining co-temporal views to

improve the algorithm's accuracy in various conditions (Section 4.1.3)

4.1.1 Phase 1 - Model Development

I. DCNN Architecture

The network developed was motivated by the current dominance of CNN image classifiers

and modeled on these recent advances.

The network used for training can be broadly classified into two major parts. The first part

consists of convolutional neural network layers along with batch normalization and max-pooling

which together act as a feature extractor. This part of the network extracts meaningful information

and presents it in a feature latent space to the rest of the network to map the extracted features to

a particular label. The second part of the network is made of fully connected layers that complete

the extracted latent space by matching it to a label.

62

The input mechanism and the dataset itself consist of highly sequential information from

the camera. As such, the network tends to learn only the local features of each segment while

driving itself to local minima. To break the self-correlation in the input stream and the dataset, the

training is executed on a mini dataset, consisting of randomly sampled input-output pairs from the

original dataset. The mini dataset is created in batches and sent to the network until the learning

gradient starts to disappear.

The training network employs a data loading mechanism that inputs data from the mini

dataset and performs a random transformation before feeding it to the network. The random

transformation consists of random resize, flip and crop with a mean and standard normalization.

These transformations help to make the learned features by the model independent of the object’s

position, size, and orientation. The data loader then converts the input to a tensor for the network.

The input is fed through 14 units of the neural network with each unit consisting of a

convolution layer, batch normalization, and a ReLU10. Max pooling is executed every 3 to 4 units

and finally, an average pooling is applied at the end of the 14th unit. The output of average pooling

is fed through a fully connected layer that outputs the classification probability of each class. Adam

optimizer is used for training and optimization with learning rate decaying from 10−4 to 10−9 with

respect to the number of epochs. Each image input is fed through a series of convolutional layers

with the kernel size = 3, stride = 1 and padding = 1. The detection was conducted with batch size

= 32 and 4 CPU workers for data loading. The units (consisting of a CNN, a normalizer, and an

activation layer) incremented from 32 to 128 output features before forming the input for the fully

10 Rectified Linear Unit

63

connected layer. To practically implement the developed architecture and use it for training

purposes, the PyTorch machine learning library for Python was used.

II. Training

The network was trained on three datasets with different sizes to evaluate the effect of

training set's size on accuracy of detection. The small, medium and large training sets contained

750, 1500 and 3000 images, respectively, all sampled from the custom-created dataset which was

introduced in section 3.2.1. Then each dataset was split into training and testing sets, including

80% and 20% of the entire dataset, respectively.

4.1.2 Phase 2 - Model Generalization

To evaluate the generalization ability of the developed solution, the DCNN was applied to

a new fabric shape with different wrinkle properties. The same gripper was used to grasp the fabric,

and all labels exhibited properties similar to the images in the first dataset. The new images were

hand-labeled as explained in Section 3.2.1. The new dataset was used strictly for evaluation

purposes and the neural network was not re-trained with these new labels. Figure 4.1 shows

samples from old and new datasets, indicating differences in wrinkle shape and lighting conditions

between two datasets.

64

Figure 4.1 - Examples of wrinkles in the images used for training (left); wrinkles in the new dataset (right)

4.1.3 Phase 3 - Multi-view Inferencing

The observed loss of detection accuracy while testing in new situations motivated Phase 3,

wherein multiple views of the scene are combined to create a robust algorithm. The dataset created

in phase 2, consisting of co-temporal views of the same fabric, was re-used. The pre-trained neural

network model from Phase 1 was used to label each view of the scene separately. To combine

multiple views, an algorithm using traditional computer vision techniques was developed to extract

and overlap fabrics from multiple images corresponding to different viewpoints on a stationary

scene. The Dempster-Shafer combination rule ([97], [98]) was then applied to find the most

probable label of each pixel in the anchor image based on individual views probabilities. Finally,

the combined labels were mapped back to each original view.

65

I. Fabric Detection

Fabric detection is facilitated by the observation that the gripper forms a nice frame,

reliably separating the fabric from the background. Fabric detection, therefore, begins with

locating and closing the gripper. The largest hole in the gripper mask is then the presence of fabric.

Values such as intensity threshold and morphological structuring elements were selected manually

through trial-and-error. Such a method presents problems in finding values that generalize to

various images. The values which produced good results for the largest number of images in the

sample set are presented here. The same values are used for all images and all views.

The contour of the gripper was found by combining three methods:

1. Identification of white gripper squares

2. Identification of green gripper squares

3. Edge detection to supplement boundary detection

Item (1) is accomplished by thresholding in the CIE L × a × b color space. By accepting low

color values, only white/gray pixels are retained. High values of lightness are retained to separate

white pixels from gray. Good results were achieved by rejecting all pixels where either color

channel is greater than 140 and rejecting lightness values below 180.

Item (2) is accomplished by thresholding in the HSV color space. Since the goal is to retain

only green colors, hue values outside of the range 60 - 100 are rejected, and very low saturation

(less than 80) or very high saturation (greater than 200) are also rejected.

Item (3) is accomplished by performing denoising using the non-local means method [99] with

an aggressive h value of 10. Canny edge detection [100] is then employed to identify significant

edges in the image.

66

These three masks are combined, where any pixel which is retained in any mask is overall

retained. The gripper squares are joined through an alternating series of morphological closing

with a disk structuring element of radius 3, and morphological dilation with a disk of radius 1. A

general solution for the number of iterations is difficult. For the sample dataset, 35 iterations were

performed on all images.

The gripper mask is inverted such that the fabric is kept, and the gripper rejected. The main

component is then separated from the other components and retained. All connected components

are identified, and components not meeting specific requirements are discarded. These

requirements are:

• An area of at least 150,000 pixels

• A maximum distance between the image center and the nearest non-zero pixel in the

component

• A density less than 0.4 (number of white pixels in the bounding box, normalized by

bounding box area)

• A height span or width span greater than 0.7, where the span is the dimension of the

bounding box divided by the dimension of the image.

This filtration removes excessively large or small components. The components are then

ranked by score, and the highest-scoring component is deemed to be the fabric. The score combines

closeness to the center of the image and area. If any component overlaps the center of the image,

it is given maximum score so that it is always retained.

The morphological closing to connect the gripper squares is then reversed: multiple

iterations of morphological opening with a disk of radius 3 and erosion with a disk of radius 1 are

performed. One fewer iteration than the number of closings is performed, which gives a nice edge

67

to the fabric. Finally, any holes in the mask are filled by tracing the contour and then flood filling

at the centroid.

II. Centering and Rotating the Image

The orientation and position of the gripper in the image relative to the camera changes

across images, so the fabric mask is of non-uniform position and non-uniform rotation in the final

images. To normalize these differences, the centroid of the mask is found and translated to the

center of the image. The image is then rotated in the range -90 to 90 degrees, and the best

orientation is used. An orientation is preferred if it maximizes the non-zero pixels in the center

column of the image, plus the maximum number of non-zero pixels in any row of the image. The

image is cropped to the rectangle with minimum bounds which contains non-zero pixels.

III. Correlating Multiple Views

Perspective correction is performed to normalize, as much as possible, non-uniform fabric

orientation between images. Since the end-result is highly susceptible to small changes in

perspective correction key points, human intervention is required to make identification as accurate

as possible. The user is first presented with all camera views available for a period (shown in

Figure 4.2) and must select the one with the least skewing and curvature. Control points are

collected by having the user select four points on each image (top left of the fabric, top right of the

fabric, bottom left of the fabric, and bottom right of the fabric). These points are then used to create

a perspective warp transform for each image.

68

Figure 4.2 - Five co-temporal masks of the same fabric

Figure 4.3 - Five overlaid, co-temporal fabric masks before correction (left) and after correction (right)

Some views are insufficient to be included in processing. That is, sometimes the fabric is

too small, too skewed, or has some other unacceptable features which prevent inclusion. In such a

case, the image is blacklisted and not included in results.

Fabric size is normalized by resizing each image to the maximum dimension (height and

width) in all images. The aspect ratio is preserved in this resizing, and each image is then zero-

padded as required to make an absolute dimension match for all images. Padding is done evenly

from all sides to keep the fabric centered.

Every acquired view consists of a tensor of probabilities mapping each pixel to a certain

class. Once multiple views are overlapped, each pixel consists of multiple class probability tensors

(𝒎𝒎���⃗ 𝒊𝒊𝒊𝒊) coming from different views.

69

𝒎𝒎���⃗ 𝒊𝒊𝒊𝒊 =

⎣⎢⎢⎢⎡

𝑚𝑚𝑖𝑖𝑖𝑖 (𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺)𝑚𝑚𝑖𝑖𝑖𝑖 (𝐹𝐹𝐹𝐹𝐹𝐹𝐺𝐺𝐺𝐺𝐹𝐹)𝑚𝑚𝑖𝑖𝑖𝑖 (𝑊𝑊𝐺𝐺𝐺𝐺𝑊𝑊𝑊𝑊𝑊𝑊𝐺𝐺)

𝑚𝑚𝑖𝑖𝑖𝑖 (𝐵𝐵𝐹𝐹𝐹𝐹𝑊𝑊𝐵𝐵𝐺𝐺𝑜𝑜𝑜𝑜𝑊𝑊𝑜𝑜)⎦⎥⎥⎥⎤ ( 1 )

Where 𝐺𝐺 is the camera id, 𝐺𝐺 𝜖𝜖 {1,2, … ,5} and 𝑗𝑗 is the pixel location in the image,

𝑗𝑗 𝜖𝜖 (2592 × 1944). These probabilities are then combined from different cameras, using the

Dempster-Shafer rule of combination.

𝒎𝒎���⃗ 𝒊𝒊 = 𝒎𝒎���⃗ 𝟏𝟏𝒊𝒊 ⊕ 𝒎𝒎���⃗ 𝟐𝟐𝒊𝒊 ⊕ … ⊕ 𝒎𝒎���⃗ 𝟓𝟓𝒊𝒊 ( 2 )

Wherein ⊕ is the orthogonal sum (also known as Dempster rule of combination) which

combines belief constraints that come from independent belief sources. This produces a single

tensor of classification probability (𝒎𝒎���⃗ 𝒊𝒊), which is mapped back to each source view by reversing

the processing steps described above. Each source image is then relabeled based on these new

probabilities.

4.2 Stage II – Wrinkle and Boundary Detection Using Image Segmentation Models

The introduction of well-performing baseline methods like FCN [72], YOLO [46], R-CNN

[22], and Fast/Faster R-CNN [38], [39] has resulted in a huge improvement in the performance of

object detection and semantic segmentation tasks performed by computers. It encourages the

employment of semantic or instance segmentation methods to label images at the pixel level.

Further actions can be taken in the future to associate the obtained visual information to relative

suction units on the gripper surface while considering the spatial information.

70

At stage II, four state-of-the-art deep convolutional neural networks for image

segmentation are trained and tested on the custom-captured dataset. The achieved IoU11 scores by

DCNN models are compared against each other and the best-performing model is selected to be

used in next steps. The obtained results show how using a DCNN model and transfer learning can

lead to acceptable results while training on a small and inaccurately annotated dataset. Next, the

effect of human annotation quality on the performance of DCNN models has been evaluated by

comparing two different annotations and re-training the models on the union and intersection of

these datasets. The wrinkle detection performance is assessed by calculating the precision, recall,

and F1 metrics while considering the intersection of two annotated datasets as ground truth. Then,

an approach for detecting wrinkles at the early stages of formation is developed and assessed.

Finally, the limitations of using synthetic data for training the models are evaluated. The presented

method can be readily adopted to train DCNN models using other datasets and perform visual

inspection tasks in different manufacturing processes.

4.2.1 Training Image Segmentation Models

As 6688 images were not enough to start training a model from scratch, transfer learning

was used to train the models. In the first stage of training, four state-of-the-art architectures known

for instance segmentation (Mask-RCNN) or semantic segmentation (DeepLab V3+, U-Net, IC-

Net) were trained on the custom-created dataset. Table 4.1 shows the training parameters used for

each architecture. All the training was done on ASC 310 computer (see section 3.3) which has a

single NVIDIA TITAN Xp GPU. One common issue happening while training on an augmented

11 Intersection over Union

71

dataset is that model overfits to the training data and cannot perform well on test set. To ensure

that the trained models are not overfitted, the training and validation loss values were assessed at

the end of each training run to confirm these two values are converging.

Table 4.1 - Training parameters for different models

Training parameters Models Learning rate Training epochs validations/epoch Batch size Input size

DeepLab V3 0.0001 5 2 1 512*512 U-Net 0.001 5 2 1 256*256 IC-Net 0.0001 5 2 1 2048*1024

Mask RCNN 0.001 5 2 1 256*256

For training DeepLab V3, Xception65 backbone was used, weight decay, atrous rates, and

output stride were set to 0.00004, (8,12,18), and 16, respectively. For training U-Net and IC-Net,

momentum, patience, and learning divisor were set to 0.9, 1000, and 5, respectively. Other training

parameters were set to their default values suggested by the original papers' authors.

After training four models, the test dataset was used for evaluating the trained models and

Intersection over Union (IoU) metric was used for measuring the segmentation performance.

Basically, IoU can be used to evaluate any algorithm that predicts the location of objects by

providing bounding boxes as output. To employ IoU for evaluation of a prediction, only two inputs

are needed, the ground-truth bounding boxes as well as the predicted bounding boxes made by the

model (Figure 4.4).

72

Figure 4.4 - Calculation of IoU metric for evaluation of object localization accuracy

To calculate the IoU, the area of overlap between two bounding boxes must be divided by

the area of the union. IoU value can vary between 0 to 1 indicating totally wrong and totally right

predictions, respectively. The IoU score is usually calculated for every class independently and

then the average of all classes is calculated to provide a mean IoU score for the whole semantic

segmentation process.

4.2.2 Assessing Human Annotation Quality

Human labeling of the images is itself, a difficult and time-consuming task. The labeling

tools used were not particularly accurate while there was also ambiguity around what a wrinkle

exactly is, so annotations were not precise. This lack of precision was tested by running a second

hand-labeling of the same dataset and calculating the IoU score between the two human-annotated

datasets.

73

4.2.3 Evaluation of Wrinkle Detection Performance

The achieved results in section 4.2.1 show that the developed algorithm can perform very

well on gripper and fabric detection tasks. However, it is still tough to reach high IoU scores for

wrinkle detection due to the unclear definition of a wrinkled area. It is noteworthy that although

the IoU scores are relatively low, the visual inspection of the inferencing results shows that the

model can detect most of the wrinkles by labeling the wrinkle core; this is the main target in

performing such a task. The major cause of the low IoU score is the ambiguity in defining a wrinkle

boundary which makes it hard to provide an accurate pixel-level annotation for both humans and

machines.

During the manufacturing process, it is only needed to find the corresponding suction units

located behind each wrinkle area and adjust their suction intensity in a way to remove the wrinkles

while preventing the formation of new wrinkles in the neighboring regions. This simply requires

a region-level annotation instead of an accurate pixel-level annotation, which suggests that the

achieved performance is well enough.

To provide a better assessment of wrinkle detection performance, the intersection of two

human annotations was found and considered as the core of the wrinkles that are the main target

to be detected. DeepLab V3+ was selected because of its significant performance in section 4.2.1

and its test results were used to assess the wrinkle detection quality. For every image in the test

set, connected components were found in both ground truth image and the prediction provided by

DeepLab V3+; each component was considered as a single wrinkle. A single wrinkle in the

prediction was deemed a true positive if there was a 70% overlap with one or more wrinkles in the

ground truth. The overlapping threshold was found experimentally through a grid search from 5%

to 100% with a step of 5%.

74

Wrinkles in the ground truth having insufficient overlap with the wrinkles in the prediction

were tagged as false-negative predictions. Wrinkles in the prediction which did not sufficiently

overlap the ground truth were considered as a false-positive detection. Precision, recall, and F1

scores were calculated for each image separately. The average of each metric was also calculated

for all images to evaluate the total wrinkle detection performance.

Visually investigating the false-positive decisions showed that a common mistake made by

the DCNN model in this task was labeling some small areas as wrinkles that could not actually be

a wrinkle due to their size. To prevent this phenomenon, the average wrinkle size in the training

set was calculated and any prediction made by the DCNN model which was smaller than 20% of

the average wrinkle size was ignored. The precision, recall, and F1 scores were calculated again

after filtering the small components.

4.2.4 An Approach for Early Detection of Wrinkles

The obtained results in section 4.2.2 show ambiguities around what a wrinkle is as the two

human annotations have many differences. In section 4.2.3, The intersection of these two

annotation sets was considered as the core of the wrinkle which includes the pixels where both

human annotations agree on the existence of a wrinkle; this is called the wrinkle class. The union

of these two annotations minus their intersection represents the areas where one of the operators

is reporting the existence of a wrinkle while the other one does not agree; this is called the maybe

wrinkle class. These regions may either be a result of the operator annotation error or indicate a

wrinkle at its early formation stage. Early detection of probable defects in composite

manufacturing processes has significant importance in performing in-situ monitoring methods and

can help to reduce both production time and waste by preventing the defects from happening.

75

To further evaluate the performance of the model on the detection of “wrinkle” and “maybe

wrinkle” classes, three new pairs of training and test sets were created. Each new dataset included

only a single class: “wrinkle”, “maybe wrinkle”, or “wrinkle + maybe wrinkle”. DeepLab V3+

was trained on each of these three datasets separately and then tested on all three test sets

individually.

4.2.5 Training on Synthetic Data

To check the feasibility of using synthetic data for training the neural networks, all the four

models were trained on the synthetic dataset generated in section 3.2.3. The trained models then

were tested on the real test set and was also used for testing the models trained on the real data.

The obtained results (see section 5.2) show a significant decrease in models' performance as

expected. The trained models were then tested on a test set taken from the synthetic dataset to

make sure the models have learned the features correctly.

76

Chapter 5 : Experimental Results

5.1 Stage I - Wrinkle Detection Using an Image Classification Model

I. Phase 1

In phase 1, a CNN model was designed and trained to classify input images as wrinkle,

fabric, gripper, or background. The training and test datasets were generated by acquiring images

during the draping process. Each image then was divided into smaller regions to facilitate the

efficient hand-labeling of data. Next, Supervised learning was employed on the generated sub-

images to train an image classifier. Three training datasets having different sizes were evaluated

and it was observed that past a point, increased size offered diminishing returns. The obtained

results in this phase are presented in Table 5.1. The accuracy metric was used for evaluating the

classification performance at this stage. The accuracy score for each class is easily calculated by

dividing the number of correct predictions of that class by the total number of instances from that

class existing in the dataset.

Table 5.1 - Accuracy of detection on the initial dataset

Accuracy of detection (%) Label Small Medium Large

Wrinkle 92.5 95.5 95.5 Fabric 94.5 98.5 99.5

Background 95.5 97.5 98.5 Gripper 94.5 97.5 97.5

The best-trained neural network model achieved 95.5% accuracy in detecting wrinkles.

Accuracy for other classes was even higher. The slightly lower level of accuracy in wrinkle

77

detection can be explained by the fact that, at a very small scale (32 × 32), it is hard to distinguish

between a fabric and a wrinkle.

The results show diminishing returns between the medium and large training sets. This

means increasing the size of the dataset will not enhance the performance beyond a point.

II. Phase 2

In phase 2, a complementary dataset was generated with altering lighting conditions,

changing wrinkle shapes and sizes, differing fabric shapes, and significantly varying distance and

location of the camera with respect to the scene. The pre-trained network from phase 1 was

evaluated on this dataset to test the ability of the model to generalize. Results are presented in

Table 5.2. Notably, the classification accuracy of wrinkles decreased to 66.6%, which is

considerably less than the 95.5% accuracy obtained in phase 1.

Table 5.2 - Accuracy of detection against the new dataset

Accuracy of detection (%) Wrinkle 66.6 Fabric 12.3

Background 85.5 Gripper 37.6

The significant difference in the two results indicates that the model had some difficulty in

generalizing to the new dataset. Gripper and fabric detection rates were down significantly because

of the unbalanced quantity of classes in the new dataset plus the ambiguities of labeling images

having portions of different classes. It is noteworthy that the model was trained to label an image

as wrinkle if wrinkles were covering only 30% of the image area. This tendency to wrinkle class

lead to more accurate wrinkle detection but sacrificed the detection accuracy of other classes. The

78

primary goal of this phase was to achieve an acceptable wrinkle detection accuracy and the low

scores for gripper and fabric detection were not explored much.

Visual inspection of wrinkle detection failed cases showed that this is largely due to

occlusion of wrinkles combined with poor lighting and shadow effects in failed cases. These

occlusion and lighting issues changed for different views of the multi-view setup, which led to

certain wrinkles being highly visible, and thus detectable, in one view while missed in another.

III. Phase 3

In phase 3, an approach was devised to increase accuracy when generalizing the model

through the correlation of multiple co-temporal views of the same fabric. View correlation consists

of detecting the fabric using traditional computer vision techniques and overlaying those fabrics

as closely as possible to allow inference across multiple views. Once the fabrics are overlaid, the

Dempster-Shafer theory of evidence [97], [98] was applied to combine votes on each pixel from

each view.

The results of this process are shown in Table 5.3. Detection accuracy was improved

considerably when compared with phase 2, achieving 85.9% success in wrinkle detection. This

was accomplished without the need for any re-training or fine-tuning of the pre-trained model.

Since wrinkle detection accuracy is the main goal, gripper and background labels were not

preserved in phase 3 and are combined in the Other label. As in Phase 2, fabric detection accuracy

is very low. However, the problems with fabric and gripper detection can be solved by employing

alternative methods developed in the other stage of this project (see section 5.2).

79

Table 5.3 - Accuracy of detection after multi-view inferencing

Accuracy of detection (%) Wrinkle 85.9 Fabric 8.6 Other 89.7

5.2 Stage II - Wrinkle and Boundary Detection Using Image Segmentation Models

In Stage II, four well-performing DCNN models (Mask-RCNN, U-Net, DeepLab V3+, and

IC-Net) were applied to the custom-created dataset. The best-performing model was found to be

DeepLab V3+ that could achieve acceptable results for gripper and fabric detection. The lower

performance of the wrinkle detection task was explained by comparing two human-annotations of

the same dataset and showing that humans achieve an IoU of only 0.41; this discrepancy is mainly

due to geometrical defects, as wrinkles do not have a clearly defined boundary. The model was

then evaluated using a binary predictor based on the Jaccard index between components rather

than pixels; the model achieves a recall rate of 0.71 and a precision score of 0.76. Two

complementary approaches were also introduced for the detection of wrinkles at the early stages

of formation as well as the completely formed wrinkles. Finally, the limitations of using synthetic

data for training the models were evaluated.

I. Training Image Segmentation Models

Four well-performing algorithms (Mask-RCNN, U-Net, DeepLab V3+, and IC-Net) are

trained and tested on a custom-captured dataset and the intersection over union (IoU) metric was

used to evaluate the segmentation performance. Table 5.4 shows IoU values for different classes

gained by each model.

80

Table 5.4 - Inferencing results for image segmentation models

Intersection over union Models Gripper Fabric Wrinkle

DeepLab V3 0.9237 0.8564 0.4037 U-Net 0.8928 0.8549 0.2394 IC-Net 0.8869 0.8373 0.3263

Mask-RCNN 0.7518 0.7704 0.3527

Trained models at this phase produce acceptable IoU scores of up to 0.92 and 0.86 for

gripper and fabric, respectively. Wrinkle detection achieves a significantly lower IoU score of 0.40

in the best case. The reasons behind this lower score are explained in next section.

Table 5.5 shows the generated segmentation masks by trained models for 5 of the images

in the test dataset. Gripper, fabric, and wrinkle labels are shown with blue, orange, and green

masks, respectively.

The results show that DeepLab V3+ works best overall, and particularly in detecting

wrinkles, which is the most challenging class to detect in this task. All the networks, except Mask-

RCNN, perform very well when segmenting gripper and fabric. Mask-RCNN has difficulties

identifying sharp edges and boundaries between classes, which led to poor performance in gripper

and fabric segmentation. In principle, instance segmentation only works for "objects" with clearly

defined shapes and boundaries and does not work on "things" with amorphous background regions.

In the targeted task, the wrinkles might act like things, while the gripper and fabric can be

considered as objects. Given that the task is not to identify one wrinkle from the others, Mask-

RCNN cannot be a suitable choice. Finally, DeepLab V3+ was selected to be used in the next

sections due to its good performance, especially in wrinkle detection.

81

Table 5.5 - Performance of trained models on the test dataset

ID Image Ground Truth DeepLab V3+ U-Net IC-Net Mask RCNN

1

2

3

4

5

82

I. Assessing Human Annotation Quality

Table 5.6 shows the IoU scores between two human-annotated datasets.

Table 5.6 - IoU between two human-annotated datasets

Intersection over union Models Gripper Fabric Wrinkle Human 0.9296 0.8553 0.4097

Human annotation performs relatively poorly, even when compared with the automated

models. The difficulties in using the keyboard and mouse for annotation certainly led to some

inaccuracies between the two datasets. Furthermore, there is also ambiguity around what exactly

is a wrinkle. This matter is visible in Figure 5.1 where not only are the two larger wrinkles different

lengths, but there is also a third smaller wrinkle in one image which is not present in the other at

all. It is notable that the way people define, locate, and label objects in an image can significantly

vary from one person to another.

Figure 5.1 - A sample image of the dataset (top image) and its two annotation masks

83

II. Evaluation of Wrinkle Detection Performance

The average precision, average recall, and average F1 score values for wrinkle detection

are provided in Table 5.7.

Table 5.7 - Wrinkle detection scores for DeepLab V3+

DeepLab V3+ Average precision Average recall Average F1 score

Before removing small components 0.6741 0.7145 0.6714

After removing small components 0.7649 0.7145 0.7218

Overall, the best average precision score shows that about 76% of the model predictions

have been correct. The average recall score indicates that the model has been able to detect more

than 71% of the present wrinkles.

It can be seen that filtering the small components has significantly increased the average

precision score from 0.6741 to 0.7649 by reducing the number of false-positive predictions. It is

important to keep the number of false-positive predictions as low as possible to prevent

unnecessary pauses in the manufacturing flow.

The effect of the overlapping threshold on the average precision, average recall, and F1

scores is shown in Figure 5.2.

84

Figure 5.2 - Effect of overlapping threshold on wrinkle detection scores

Lower overlapping threshold values (between 0 to 0.4) will provide better scores by

ignoring some of the false-positive and false-negative detections and increasing the number of

true-positive detections even if the network has not been able to properly detect a wrinkle. Also,

higher threshold values (0.8 to 1) will decrease the scores by requiring highly accurate predictions.

Evaluating the above chart proves that thresholds below 70% can be selected as an acceptable

overlapping threshold value as the scores will start to rapidly decrease after that point.

III. An Approach for Early Detection of Wrinkles

DeepLab V3+ was trained on three single class training sets and then tested on all three

single class test sets. The inferencing results are presented in Table 5.8.

Table 5.8 – Inferencing results for DeepLab V3+ trained on single class datasets (IoU scores)

Test Train

Wrinkle + Maybe wrinkle Wrinkle Maybe wrinkle

Wrinkle + Maybe wrinkle 0.5149 0.4037 0.2308 Wrinkle 0.3326 0.4609 0.1034

Maybe wrinkle 0.1399 0.0970 0.1299

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5 5 0 5 5 6 0 6 5 7 0 7 5 8 0 8 5 9 0 9 5 1 0 0

Scor

es

Overlapping threshold (%)

Average precision

Average recall

Average F1

85

Training the model on the union of two human annotation sets (“wrinkle + maybe wrinkle

class”) provides better results on the tasks where the model is looking for both wrinkle and maybe

wrinkle class instances; this includes detection of wrinkles at their early stages of formation

(“maybe wrinkle” class).

In applications where redundant pauses in the manufacturing flow are costly and the model

is only looking for explicit wrinkles, it appears better to train the model on the intersection of two

human annotation sets (“wrinkle” class).

The model trained only on the “maybe wrinkle” class performs poorly in all cases as this

dataset includes lots of noise and mislabeled portions caused by operator error and makes it hard

for the DCNN model to extract meaningful features during the training.

IV. Training on Synthetic Data

Table 5.9 shows the IoU scores of four models trained on the synthetic training dataset and

tested on the synthetic test dataset. It can be seen that DeepLab V3+ performs very well and

achieves 0.98, 0.94, and 0.51 IoU scores for the segmentation of gripper, fabric, and wrinkle,

respectively.

The lower score for wrinkle detection can be explained by the complexities of defining the

wrinkle boundaries. U-Net and Mask-RCNN are performing weaker than DeepLab V3+ but still

provide acceptable scores. A probable reason for IC-Net failure in this task can be that it is an

architecture designed to perform image segmentation tasks in real-time and tries to use its pre-

learned feature space for finding the target objects as quick as possible. As a result, it will not be

an appropriate architecture for learning new feature spaces with drastic distribution changes.

86

Table 5.10 demonstrated the IoU scores for inferencing the models on the real test set that

was used in first parts of this stage as well. As expected, the scores have intensely decreased

because of the domain distribution shift and mismatch in appearance. This issue can be tackled by

creating larger synthetic datasets with richer features and using domain adaptation techniques to

reduce the distribution shift between synthetic and real feature spaces.

Table 5.9 - Inferencing results on the synthetic test dataset

Intersection over union Models Gripper Fabric Wrinkle

DeepLab V3 0.9822 0.9392 0.5082 U-Net 0.8955 0.5743 0.4002 IC-Net 0.3234 0.5668 0.1232

Mask-RCNN 0.8549 0.8395 0.3669

Mask RCNN performs better than the other three models on the segmentation of gripper

and fabric classes. It can be explained by the popularity of Mask RCNN for its generalization and

easy adaptation to various datasets that make it have a better understanding of object shapes and

boundaries. U-Net achieves 0.1649 scores for wrinkle detection which is still very low but is more

than twice the score of other models. U-Net is an architecture particularly designed for medical

imaging purposes and is known for its great performance on a variety of anomaly detection tasks

like tumors or kidney stones detection in medical images such as CT-scans. So, it is not highly

dependent on color features and performs very well on the cases that the geometrical features are

more dominant in distinguishing classes, like wrinkle detection task.

87

Table 5.10 - Inferencing results on the real test dataset

Intersection over union Models Gripper Fabric Wrinkle

DeepLab V3 0.1327 0.2800 0.0706 U-Net 0.3061 0.4939 0.1649 IC-Net 0.0 0.0598 0.0268

Mask-RCNN 0.3477 0.6487 0.0694

88

Chapter 6 : Conclusions

6.1 Summary

The main contribution of this thesis is the development of new solutions for quality control

of composite manufacturing processes, specifically, the visual inspection of the draping process of

fiber-reinforced cut-pieces for manufacturing the rear pressure bulkhead of aircraft in the aviation

industry.

The thesis starts by reviewing visual inspection methods that can be employed to tackle the

issues happening during the draping process. It can be concluded that the conventional

programming techniques to analyze spatio-temporal data from multiple sensory inputs usually

involve hand-crafting certain features of the setup. These techniques are usually limited to specific

applications and will need a great deal of tuning for adapting to new tasks involving. On the other

hand, using machine learning algorithms makes the process overcome such limitations by allowing

machine-based self-interpretation of the process. So, it was decided to apply convolutional neural

networks plus deep learning techniques on 2D images to perform object recognition tasks.

To generate the required datasets for training the DCNN models, a setup of multi-cameras

was installed in front of the modular gripper robot that is the main component performing the

draping process. Then, two datasets were generated, annotated, and augmented for training and

testing purposes as described in Chapter 3.

In the first stage of the project, a deep convolutional neural network was designed and

implemented to perform the image classification task and label small areas of each image as

wrinkle, fabric, gripper, or background. This model was trained and tested on generated datasets

and could provide acceptable results (Phase 1). Also, the effect of dataset size was evaluated at

89

this phase. To measure the ability of this model to generalize, it was tested on the second dataset

that was not seen by the network during the training. Testing the pre-trained model on a new dataset

with different wrinkle size and shape and different lighting conditions led to a noticeable decrease

in detection performance (Phase 2). The neural network had learned the features which existed in

both datasets, but not features exhibited only in the new dataset. To handle this issue, the idea of

utilizing a combination of multiple low-price cameras and employing their images from different

views to get a higher dimensional feature map was proposed (Phase 3). Intuitively, looking at the

scene from different angles would provide us with more details and would lead to less occlusion

and can be rectified for photometric distortions and noise, discontinuities, etc. Images from five

different views were fed into the trained model and predictions from each view were combined

using the Dempster-Shafer [97] theory of evidence.

Wrinkle detection in the generalized inferencing case was 85.9% when evidential

reasoning was used to fuse multi-view information (Figure 6.1). This is an increase of 19.3%

compared to the base case established in Phase 2.

Figure 6.1 - Wrinkle detection accuracy (%) by phase

The results of this stage are very promising and show that a combination of computer vision

and evidential reasoning techniques can be used to help increase feature detection accuracy when

90

a DCNN is generalized to previously unseen, but similar, images. Also, the setup can be easily

adapted and trained for different experimental scenarios subjected to a different arrangement of

cameras.

In Stage 2, four state-of-the-art image segmentation models were trained on the generated

dataset to identify the fabric, gripper, and wrinkles during the draping process. The obtained results

were very promising at this stage. Overall, the detection of gripper and fabric was successful with

IoU scores of approximately 0.92 and 0.86 respectively for the best-performing model. DeepLab

V3+ was selected as the best candidate because of its good performance, especially on the wrinkle

detection task.

Wrinkle detection accuracy was significantly lower compared with other classes, with an

IoU score of approximately 0.40 for the best-performing model. In order to explain the poor

performance of the wrinkle-detection task, a second human-annotated dataset was compared with

the first; results, in this case, show a wrinkle detection IoU of only 0.41, which indicates similar

performance to the automated methods. This suggests that limitations in the human annotations

used as ground truth are a significant factor in the final performance of the model. These limitations

are both the result of the crude input of a computer mouse and ambiguity as to how humans

interpret wrinkles.

Further evaluation of wrinkle detection performance showed that the model could

successfully detect more than 71% of the present wrinkles. The average precision score also

showed that the DCNN model was right in 67% of its all predictions. To enhance the average

precision score, a filtration stage was conducted by ignoring any prediction having a wrinkle size

smaller than 20% of the average wrinkle size in the training set. The average precision score

increased to more than 76% without affecting recall. A detection was defined to be successful if

91

the provided segmentation by DCNN was overlapping with the ground truth more than the

overlapping threshold. The optimum value for the overlapping threshold found to be 70% through

a grid search experiment.

An approach was also introduced for the detection of wrinkles at the early stages of

formation. It was shown that training on the union of two human-annotated datasets (“wrinkle +

maybe wrinkle” set) was preferred for cases that require early detection of wrinkles. In contrast,

training on the intersection (“wrinkle” set) is a better option for the cases where redundant pauses

in the manufacturing process will be costly and it is preferred to take an action only when a defect

is present with high certainty.

Finally, the limitations of using synthetic data for training the models were evaluated by

generating a synthetic dataset using a 3D modeling software and then training and inferencing the

models on it. The results of testing the DCNN models on both real and synthetic test sets showed

a considerable performance drop while testing on real data because of the domain distribution shift

and mismatch in appearance.

The developed solutions can be used for a variety of composite manufacturing processes

or adapted to other similar tasks by only generating a small dataset and then applying the

techniques proposed in this thesis.

6.2 Future Work

Considering that this thesis is focused on finding deep learning-based approaches that

specifically use 2D images as input, other possible methods have remained unexplored.

Using RGB-D images or 3D models as the input of convolutional neural networks to

perform object detection during the draping process can be an interesting topic of research.

92

Considering the fact that each wrinkle is just an abnormality in the geometry of the fabric,

developing a model that uses depth features for detection can be very helpful to tackle the issues

related to locating the wrinkle boundary. The introduction of novel deep learning architectures

plus the promotion of 3D and RGB-D datasets in the future can be a key to this problem.

Semi-supervised and unsupervised learning methods can be also evaluated to eliminate the

need for generation and annotation of large datasets.

In Stage I of this thesis, the certainty of classification was increased by combining multiple

votes from several observations of the weaker classifier. A logical extension of this possibility

would be re-training the neural network on the new dataset, allowing it to learn the new features.

The arbitrary position and orientation of the gripper relative to the camera necessitated a process

to overlap fabrics. Key point detection for perspective correction was performed by hand, which

is not desired in an automated process. Moreover, arbitrary geometries result in images with

rotation occlusion, small fabric size, or other undesirable properties that can affect the results. A

fixed configuration of gripper and camera geometry may be able to maximize lighting effects while

minimizing warping, perspective and occlusion issues. Homography could then be used in

transformations, making the overall process automatable, faster, and more robust. The machine

learning model could also be trained on a wider variety of wrinkles to allow greater generalization

accuracy. However, the hand-labeling process used to create a supervised learning dataset is time-

consuming and it would be desirable to create a faster alternative.

To continue the works done in stage II, other famous architectures can be evaluated and

compared with the currently used ones. Also, different values of training parameters can be tested

to increase the segmentation accuracy by finding the optimum training configuration for each

model. Increasing the size of the dataset and generating more accurate annotations is another

93

suggestion that can lead to performance enhancement in the future. A sensory system like a lidar

scanner can be installed at DLR facility to create a precise ground truth free of errors caused by

human annotations. Besides, the availability of more powerful GPUs will make it possible to

increase the training batch size that will usually lead to a raise in the segmentation quality.

Using domain adaptation techniques to reduce the domain distribution shift and mismatch

in appearance between synthetic and real datasets is another topic that can be followed in the

future. These techniques take advantage of deep networks and embed domain adaptation in the

deep learning pipeline which results in learning more transferable representations.

Development of an approach to associate the obtained visual information to relative suction

units on the gripper surface using the spatial information is the next key step toward having an

end-to-end fully automated manufacturing pipeline, that needs to be done in the future.

94

Bibliography

[1] K. Gupta, M. Körber, A. Djavadifar, F. Krebs, and H. Najjaran, “Wrinkle and boundary

detection of fiber products in robotic composites manufacturing,” Assem. Autom., vol. 40,

no. 2, pp. 283–291, 2019.

[2] A. Djavadifar, J. B. Graham-Knight, K. Gupta, M. Körber, P. Lasserre, and H. Najjaran,

“Robot-assisted composite manufacturing based on machine learning applied to multi-

view computer vision,” in International Conference on Smart Multimedia, 2019.

[3] F. C. Campbell, “Structural Composite Materials,” 2010.

[4] K. D. Potter, “Understanding the origins of defects and variability in composites

manufacture,” ICCM Int. Conf. Compos. Mater., 2009.

[5] A. Rashidi and A. S. Milani, “Passive control of wrinkles in woven fabric preforms using

a geometrical modification of blank holders,” Compos. Part A Appl. Sci. Manuf., vol. 105,

pp. 300–309, Feb. 2018.

[6] A. Rashidi, H. Montazerian, K. Yesilcimen, and A. S. Milani, “Experimental

characterization of the inter-ply shear behavior of dry and prepreg woven fabrics:

Significance of mixed lubrication mode during thermoset composites processing,”

Compos. Part A Appl. Sci. Manuf., vol. 129, no. November 2019, p. 105725, 2020.

[7] H. Voggenreiter and D. Nieberl, “AZIMUT Abschlussbericht,” TIB, 2015.

[8] H. Montazerian, R. Sourki, M. Ramezankhani, A. Rashidi, M. Koerber, and A. S. Milani,

“Digital twining of an automated fabric draping process for industry 4.0 applications: Part

imulti-body simulation and finite element modeling,” in CAMX 2019 - Composites and

Advanced Materials Expo, 2019.

95

[9] M. Körber, C. F.-P. Manufacturing, and undefined 2019, “Automated Planning and

Optimization of a Draping Processes Within the CATIA Environment Using a Python

Software Tool,” Elsevier.

[10] M. Körber and C. Frommell, “Sensor-Supported Gripper Surfaces for Optical Monitoring

of Draping Processes,” SAMPE, 2017.

[11] K. Gupta, M. Körber, F. Krebs, and H. Najjaran, “Vision-based deformation and wrinkle

detection for semi-finished fiber products on curved surfaces,” in 2018 IEEE 14th

International Conference on Automation Science and Engineering (CASE), 2018, pp.

618–623.

[12] I. N. Aizenberg and C. Butakoff, “Frequency domain medianlike filter for periodic and

quasi-periodic noise removal,” in Image Processing: Algorithms and Systems, 2002, vol.

4667, no. May 2002, pp. 181–191.

[13] O. Russakovsky et al., “Imagenet large scale visual recognition challenge,” Int. J.

Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.

[14] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proceedings of

the IEEE International Conference on Computer Vision, 1999, vol. 2, pp. 1150–1157.

[15] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput.

Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.

[16] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple

features,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 1, no. July

2014, 2001.

[17] P. Viola and M. Jones, “Robust real-time object detection,” Int. J. Comput. Vis., vol. 4, no.

34–47, p. 4, 2001.

96

[18] Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning

and an Application to Boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, 1997.

[19] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from RGB-D

images for object detection and segmentation,” Lect. Notes Comput. Sci. (including

Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8695 LNCS, no. PART

7, pp. 345–360, 2014.

[20] R. Memisevic and C. Conrad, “Stereopsis via deep learning,” in NIPS Workshop on Deep

Learning, 2011, vol. 1, p. 2.

[21] J. Zbontar and Y. LeCun, “Computing the stereo matching cost with a convolutional

neural network,” in Proceedings of the IEEE conference on computer vision and pattern

recognition, 2015, pp. 1592–1599.

[22] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate

object detection and semantic segmentation,” Proc. IEEE Comput. Soc. Conf. Comput.

Vis. Pattern Recognit., pp. 580–587, 2014.

[23] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image

recognition,” arXiv Prepr. arXiv1409.1556, 2014.

[24] M. Osadchy, Y. Le Cun, and M. L. Miller, “Synergistic face detection and pose estimation

with energy-based models,” J. Mach. Learn. Res., vol. 8, no. May, pp. 1197–1215, 2007.

[25] J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of a convolutional network

and a graphical model for human pose estimation,” Adv. Neural Inf. Process. Syst., vol. 2,

no. January, pp. 1799–1807, 2014.

[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep

convolutional neural networks,” in Advances in neural information processing systems,

97

2012, pp. 1097–1105.

[27] A. Dosovitskiy et al., “FlowNet: Learning optical flow with convolutional networks,”

Proc. IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp. 2758–2766, 2015.

[28] W. Luo, A. G. Schwing, and R. Urtasun, “Efficient deep learning for stereo matching,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016,

pp. 5695–5703.

[29] Y. LeCun and others, “Generalization and network design strategies,” in Connectionism in

perspective, vol. 19, Citeseer, 1989.

[30] Y. Lecun et al., “Handwritten digit recognition with a back-propagation network,” in

Advances in neural information processing systems, 1990, pp. 396–404.

[31] Y. Lecun et al., “Backpropagation applied to handwritten zip code recognition,” Neural

Comput., vol. 1, no. 4, pp. 541–551, 1989.

[32] W. Rawat and Z. Wang, “Deep Convolutional Neural Networks for Image Classification:

A Comprehensive Review,” Neural Comput., vol. 29, no. 9, pp. 2352–2449, 2017.

[33] K. Chellapilla, S. Puri, and P. Simard, “High Performance Convolutional Neural

Networks for Document Processing,” in Tenth International Workshop on Frontiers in

Handwriting Recognition, 2006.

[34] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief

nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006.

[35] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural

networks,” Science (80-. )., vol. 313, no. 5786, pp. 504–507, 2006.

[36] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of

deep networks,” in Advances in neural information processing systems, 2007, pp. 153–

98

160.

[37] M. Ranzato, F. J. Huang, Y. Boureau, and Y. LeCun, “Unsupervised Learning of Invariant

Feature Hierarchies with Applications to Object Recognition,” in 2007 IEEE Conference

on Computer Vision and Pattern Recognition, 2007, pp. 1–8.

[38] R. Girshick, “Fast R-CNN,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp.

1440–1448, 2015.

[39] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection

with region proposal networks,” in Advances in neural information processing systems,

2015, pp. 91–99.

[40] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” Proc. IEEE Int. Conf.

Comput. Vis., vol. 2017-Octob, pp. 2980–2988, 2017.

[41] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in European conference

on computer vision, 2014, pp. 740–755.

[42] R. Anantharaman, M. Velazquez, and Y. Lee, “Utilizing Mask R-CNN for Detection and

Segmentation of Oral Diseases,” in 2018 IEEE International Conference on

Bioinformatics and Biomedicine (BIBM), 2018, pp. 2197–2204.

[43] J. Singh and S. Shekhar, “Road Damage Detection And Classification In Smartphone

Captured Images Using Mask R-CNN,” in IEEE International Conference On Big Data

Cup, 2018, vol. abs/1811.0.

[44] X. Chen, R. Girshick, K. He, and P. Dollar, “TensorMask: A foundation for dense object

segmentation,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2019-Octob, pp. 2061–2069,

2019.

[45] W. Liu et al., “SSD: Single Shot MultiBox Detector,” Dec. 2015.

99

[46] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-

time object detection,” in Proceedings of the IEEE Computer Society Conference on

Computer Vision and Pattern Recognition, 2016, vol. 2016-Decem, pp. 779–788.

[47] M. J. Shafiee, B. Chywl, F. Li, and A. Wong, “Fast YOLO: A Fast You Only Look Once

System for Real-time Embedded Object Detection in Video,” Sep. 2017.

[48] J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” Dec. 2016.

[49] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” 2018.

[50] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab:

Semantic image segmentation with deep convolutional nets, atrous convolution, and fully

connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848,

2017.

[51] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for

semantic image segmentation,” in Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, 2017.

[52] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal

visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338,

2010.

[53] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with

atrous separable convolution for semantic image segmentation,” in Proceedings of the

European conference on computer vision (ECCV), 2018, pp. 801–818.

[54] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical

image segmentation,” in Lecture Notes in Computer Science (including subseries Lecture

Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2015, vol. 9351, pp.

100

234–241.

[55] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “ICNet for Real-Time Semantic Segmentation

on High-Resolution Images,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes

Artif. Intell. Lect. Notes Bioinformatics), vol. 11207 LNCS, pp. 418–434, 2018.

[56] M. Osswald, S.-H. Ieng, R. Benosman, and G. Indiveri, “A spiking neural network model

of 3D perception for event-based neuromorphic stereo vision systems,” Sci. Rep., vol. 7, p.

40703, 2017.

[57] P. Ferrara et al., “Wide-angle and long-range real time pose estimation: A comparison

between monocular and stereo vision systems,” J. Vis. Commun. Image Represent., vol.

48, pp. 159–168, 2017.

[58] A. L. Hou, X. Cui, Y. Geng, W. J. Yuan, and J. Hou, “Measurement of safe driving

distance based on stereo vision,” Proc. - 6th Int. Conf. Image Graph. ICIG 2011, pp. 902–

907, 2011.

[59] H. Kim, C.-S. Lin, J. Song, and H. Chae, “Distance measurement using a single camera

with a rotating mirror,” Int. J. Control. Autom. Syst., vol. 3, no. 4, pp. 542–551, 2005.

[60] K. A. Rahman, M. S. Hossain, M. A.-A. Bhuiyan, T. Zhang, M. Hasanuzzaman, and H.

Ueno, “Person to camera distance measurement based on eye-distance,” in 2009 Third

International Conference on Multimedia and Ubiquitous Engineering, 2009, pp. 137–141.

[61] M. N. A. Wahab, N. Sivadev, and K. Sundaraj, “Target distance estimation using

monocular vision system for mobile robot,” in 2011 IEEE Conference on Open Systems,

2011, pp. 11–15.

[62] Y. M. Mustafah, R. Noor, H. Hasbi, and A. W. Azma, “Stereo vision images processing

for real-time object distance and size measurements,” in 2012 international conference on

101

computer and communication engineering (ICCCE), 2012, pp. 659–663.

[63] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a

common multi-scale convolutional architecture,” in Proceedings of the IEEE international

conference on computer vision, 2015, pp. 2650–2658.

[64] G. R. Sangeetha, N. Kumar, P. R. Hari, and S. Sasikumar, “Implementation of a Stereo

vision based system for visual feedback control of Robotic Arm for space manipulations,”

Procedia Comput. Sci., vol. 133, pp. 1066–1073, 2018.

[65] K. McGuire, G. De Croon, C. De Wagter, K. Tuyls, and H. Kappen, “Efficient optical

flow and stereo vision for velocity estimation and obstacle avoidance on an autonomous

pocket drone,” IEEE Robot. Autom. Lett., vol. 2, no. 2, pp. 1070–1076, 2017.

[66] M. L. Balter, A. I. Chen, T. J. Maguire, and M. L. Yarmush, “Adaptive kinematic control

of a robotic venipuncture device based on stereo vision, ultrasound, and force guidance,”

IEEE Trans. Ind. Electron., vol. 64, no. 2, pp. 1626–1635, 2017.

[67] L. Ma, J. Stückler, C. Kerl, and D. Cremers, “Multi-view deep learning for consistent

semantic mapping with rgb-d cameras,” in 2017 IEEE/RSJ International Conference on

Intelligent Robots and Systems (IROS), 2017, pp. 598–605.

[68] M. Cordts et al., “The cityscapes dataset for semantic urban scene understanding,” in

Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,

pp. 3213–3223.

[69] R. Mottaghi et al., “The role of context for object detection and semantic segmentation in

the wild,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 891–898,

2014.

[70] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

102

Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,

pp. 770–778.

[71] S. Zheng et al., “Conditional random fields as recurrent neural networks,” in Proceedings

of the IEEE international conference on computer vision, 2015, pp. 1529–1537.

[72] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic

segmentation,” in Proceedings of the IEEE conference on computer vision and pattern

recognition, 2015, pp. 3431–3440.

[73] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image

segmentation with deep convolutional nets and fully connected crfs,” arXiv Prepr.

arXiv1412.7062, 2014.

[74] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic

segmentation,” in Proceedings of the IEEE international conference on computer vision,

2015, pp. 1520–1528.

[75] V. Badrinarayanan, A. Handa, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-

Decoder Architecture for Robust Semantic Pixel-Wise Labelling,” 2015.

[76] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv

Prepr. arXiv1511.07122, 2015.

[77] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support

inference from rgbd images,” in European Conference on Computer Vision, 2012, pp.

746–760.

[78] S. Gupta, P. Arbelaez, and J. Malik, “Perceptual organization and recognition of indoor

scenes from RGB-D images,” in Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, 2013, pp. 564–571.

103

[79] F. Husain, H. Schulz, B. Dellen, C. Torras, and S. Behnke, “Combining Semantic and

Geometric Features for Object Class Segmentation of Indoor Scenes,” IEEE Robot.

Autom. Lett., vol. 2, no. 1, pp. 49–55, 2017.

[80] J. Wang, Z. Wang, D. Tao, S. See, and G. Wang, “Learning common and specific features

for RGB-D semantic segmentation with deconvolutional networks,” in Lecture Notes in

Computer Science (including subseries Lecture Notes in Artificial Intelligence and

Lecture Notes in Bioinformatics), 2016, vol. 9909 LNCS, pp. 664–679.

[81] D. Lin, G. Chen, D. Cohen-Or, P.-A. Heng, and H. Huang, “Cascaded feature network for

semantic segmentation of RGB-D images,” in Proceedings of the IEEE International

Conference on Computer Vision, 2017, pp. 1311–1319.

[82] C. Couprie, C. Farabet, L. Najman, and Y. LeCun, “Indoor semantic segmentation using

depth information,” arXiv Prepr. arXiv1301.3572, 2013.

[83] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into

semantic segmentation via fusion-based cnn architecture,” in Asian conference on

computer vision, 2016, pp. 213–228.

[84] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin, “Lstm-cf: Unifying context

modeling and fusion with lstms for rgb-d scene labeling,” in European conference on

computer vision, 2016, pp. 541–557.

[85] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid, “Exploring Context with Deep

Structured Models for Semantic Segmentation,” IEEE Trans. Pattern Anal. Mach. Intell.,

vol. 40, no. 6, pp. 1352–1366, 2018.

[86] G. Riegler, A. Osman Ulusoy, and A. Geiger, “Octnet: Learning deep 3d representations

at high resolutions,” in Proceedings of the IEEE Conference on Computer Vision and

104

Pattern Recognition, 2017, pp. 3577–3586.

[87] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Semanticfusion: Dense 3d

semantic mapping with convolutional neural networks,” in 2017 IEEE International

Conference on Robotics and automation (ICRA), 2017, pp. 4628–4635.

[88] Y. He, W.-C. Chiu, M. Keuper, and M. Fritz, “Std2p: Rgbd semantic segmentation using

spatio-temporal data-driven pooling,” in Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, 2017, pp. 4837–4846.

[89] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a

multi-scale deep network,” in Advances in neural information processing systems, 2014,

pp. 2366–2374.

[90] Y. Kuznietsov, J. Stuckler, and B. Leibe, “Semi-supervised deep learning for monocular

depth map prediction,” in Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2017, pp. 6647–6655.

[91] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth

estimation: Geometry to the rescue,” in European Conference on Computer Vision, 2016,

pp. 740–756.

[92] K. Wang and S. Shen, “MVDepthNet: real-time multiview depth estimation neural

network,” in 2018 International Conference on 3D Vision (3DV), 2018, pp. 248–257.

[93] L. Jorissen, P. Goorts, G. Lafruit, and P. Bekaert, “Multi-view wide baseline depth

estimation robust to sparse input sampling,” in 2016 3DTV-Conference: The True Vision-

Capture, Transmission and Display of 3D Video (3DTV-CON), 2016, pp. 1–4.

[94] Y. Li, K. Qian, T. Huang, and J. Zhou, “Depth estimation from monocular image and

coarse depth points based on conditional gan,” in MATEC Web of Conferences, 2018, vol.

105

175, p. 3055.

[95] A. Wang, Z. Fang, Y. Gao, X. Jiang, and S. Ma, “Depth Estimation of Video Sequences

With Perceptual Losses,” IEEE Access, vol. 6, pp. 30536–30546, 2018.

[96] Y. Chen, W. Li, X. Chen, and L. Van Gool, “Learning semantic segmentation from

synthetic data: A geometrically guided input-output adaptation approach,” in Proceedings

of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,

2019, vol. 2019-June, pp. 1841–1850.

[97] A. P. Dempster, “Upper and Lower Probabilities Induced by a Multivalued Mapping,”

Ann. Math. Stat., vol. 38, no. 2, pp. 325–339, 1967.

[98] G. Shafer, A mathematical theory of evidence, vol. 42. Princeton university press, 1976.

[99] A. Buades, B. Coll, and J.-M. Morel, “Non-Local Means Denoising,” Image Process.

Line, vol. 1, pp. 208–212, 2011.

[100] J. Canny, “A Computational Approach to Edge Detection,” IEEE Trans. Pattern Anal.

Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986.