automatic detection of geometrical anomalies in composites …
TRANSCRIPT
AUTOMATIC DETECTION OF GEOMETRICAL ANOMALIES IN COMPOSITES
MANUFACTURING:
A DEEP LEARNING-BASED COMPUTER VISION APPROACH
by
Abtin Djavadifar
B.A.Sc., Amirkabir University of Technology, 2017
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF APPLIED SCIENCE
in
THE COLLEGE OF GRADUATE STUDIES
(Mechanical Engineering)
THE UNIVERSITY OF BRITISH COLUMBIA
(Okanagan)
April 2020
© Abtin Djavadifar, 2020
ii
The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, a thesis/dissertation entitled:
Automatic Detection of Geometrical Anomalies in Composites Manufacturing: A Deep Learning-Based Computer Vision Approach
submitted by Abtin Djavadifar in partial fulfillment of the requirements for
the degree of Master of Applied Science
in Mechanical Engineering
Examining Committee:
Homayoun Najjaran, School of Engineering Supervisor
Abbas Milani, School of Engineering Supervisory Committee Member
Zheng Liu, School of Engineering Supervisory Committee Member
Additional Examiner
iii
Abstract
This thesis focuses on the development of a machine learning-based vision system for
quality control of composite manufacturing processes. Deep Convolutional Neural Networks
(DCNNs) are used to build a real-time end-to-end solution for the complex process of draping of
the fiber-reinforced cut-pieces by conducting online visual inspection. The visual inspection will
ultimately help with manufacturing of double-curved composite parts such as aircraft’s rear
pressure bulkhead. The developed solution provides accurate and robust measurement without the
need for expensive coordinate measuring machines (CMM) in the shop floor. The development of
inspection software is completed in the following two stages.
In stage I, after creating a hand-labeled visual dataset acquired from a fabric layup robotic
system in the German Aerospace Center (DLR), a DCNN was designed, trained and tested for
image classification. Then, the idea of combining images from multiple cameras for generalization
of the designed model to different wrinkle properties and environments was evaluated. The
proposed method employs computer vision techniques and Dempster-Shafer Theory (DST) to
enhance wrinkle detection accuracy without the need for any additional hand-labeling or re-
training of the model. By the application of the DST rule of combination, the overall wrinkle
detection accuracy was greatly improved.
In stage II, four state-of-the-art image segmentation DCNN models (DeepLab V3+, U-Net,
Mask-RCNN, IC-Net) were evaluated to accurately identify the gripper, fabric, and any probable
wrinkle on a dry fiber product. The results show using a DCNN model and transfer learning can
lead to acceptable results while training on a small and inaccurately annotated dataset. Also, the
impact of human annotation quality on the performance of DCNN models was evaluated by
iv
comparing two human-annotated datasets. Then, an approach for detection of wrinkles at the early
stages of formation was developed and evaluated. Finally, the challenges of using synthetically
generated data for training the models were assessed by conducting complementary experiments.
The developed solution can be practically used for visual inspection of the draping process
in composite manufacturing facilities. The presented method can be readily adopted to train DCNN
models using other datasets and perform visual inspection tasks in different automated
manufacturing processes.
v
Lay Summary
The overall goal of this study is to develop a solution for visual quality inspection of certain
steps of automated composites manufacturing processes using robots. To do so, multiple cameras
are installed in proper positions in front of a gripper robot that grabs the fiber-reinforced cut-pieces
and places inside a mold. The images captured by the cameras are then processed by a neural
network model to detect wrinkles, misalignment or other defects in the cut-piece which may occur
during the draping process.
This research is focused on creating the automatic visual inspection system by generating
the required datasets and training the neural networks. Different types of neural network models
are designed and implemented to perform the wrinkle detection task. To enhance the performance
of these models, novel and practical machine learning techniques such as combining multi-views
inputs from different cameras were developed and successfully evaluated.
vi
Preface
This thesis presents the result of research collaboration between the Advanced Control and
Intelligent Systems (ACIS) laboratory at the School of Engineering, the University of British
Columbia, and the Center for Light Weight Production (ZPL) of the German Aerospace Center
(DLR) in Augsburg, Germany.
A version of Stage II (described in Chapters 3 and 4) has been submitted to Robotics and
Computer-Integrated Manufacturing Journal, and is currently under review (A. Djavadifar, JB.
Graham-Knight, M. Körber, and H. Najjaran, “Automated Visual Detection of Geometrical
Defects in Composite Manufacturing Processes Using Deep Convolutional Neural Networks”).
A version of Chapter 4 has been published in the International Journal of Assembly
Technology and Management, 2019 [1] (K. Gupta, M. Körber, A. Djavadifar, F. Krebs, and H.
Najjaran, “Wrinkle and boundary detection of fiber products in robotic composites
manufacturing”).
A version of Stage I (described in chapters 3 and 4) has been published and presented in
the International Conference on Smart Multimedia (ICSM), 2019 [2] (A. Djavadifar, JB. Graham-
Knight, K. Gupta, M. Körber, P. Lasserre, and H. Najjaran, “Robot-assisted composite
manufacturing based on machine learning applied to multi-view computer vision”.
A version of Stage I (described in chapters 3 and 4) has been presented in the IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS), 2019 as a poster (A.
Djavadifar, JB. Graham-Knight, M. Körber and, H. Najjaran, “Robot-assisted composite
manufacturing using deep learning and multi-view computer vision”).
vii
In all of the papers except [1], I, Abtin Djavadifar was responsible for leading the research,
programming, algorithms’ development, and writing the paper. John Brandon Graham Knight was
involved in the concept formation stage, programming, implementation of developed algorithms,
and writing and edition of manuscript. Marian Körber, who works as a researcher at DLR,
collaborated in preparing the experimental setup and conducting the tests using DLR facilities.
Kashish Gupta was the first author of [1] and has helped in concept formation of other papers.
Patricia Lasserre has helped with providing computational resources needed for training the
models at UBC. Homayoun Najjaran, the last and supervisory author, was involved throughout the
project in concept formation and manuscript edition.
viii
Table of Contents
Abstract ......................................................................................................................................... iii
Lay Summary .................................................................................................................................v
Preface ........................................................................................................................................... vi
Table of Contents ....................................................................................................................... viii
List of Tables ................................................................................................................................ xi
List of Figures .............................................................................................................................. xii
List of Abbreviations ................................................................................................................. xiv
Acknowledgments ...................................................................................................................... xvi
Dedication .................................................................................................................................. xvii
Chapter 1 : Introduction ...............................................................................................................1
1.1 Motivations ..................................................................................................................... 1
1.2 Objectives ....................................................................................................................... 2
1.3 Contributions................................................................................................................... 3
1.4 Thesis Outline ................................................................................................................. 4
Chapter 2 : Literature Review ......................................................................................................8
2.1 Composite Manufacturing .............................................................................................. 8
2.2 Quality Control in Composites Manufacturing............................................................... 9
2.3 Automated Handling of Carbon-Fiber Reinforced Plastics at DLR ............................. 11
2.4 Limitations of Traditional Image Processing Techniques ............................................ 18
2.5 Visual Sensing Systems ................................................................................................ 21
2.6 Object Recognition Algorithms .................................................................................... 23
ix
2.7 Transfer Learning.......................................................................................................... 33
2.8 Stereo Vision Image Processing ................................................................................... 34
2.9 Multi-View Computer Vision ....................................................................................... 36
2.10 Depth Estimation .......................................................................................................... 37
Chapter 3 : Infrastructure for Dataset Development ...............................................................48
3.1 Hardware Setup ............................................................................................................. 48
3.1.1 Modular Gripper ....................................................................................................... 48
3.1.2 IDS XS Camera......................................................................................................... 49
3.2 Software Setup .............................................................................................................. 49
3.2.1 Data Generation for Training Image Classification Models ..................................... 49
3.2.2 Data Generation for Training Image Segmentation Models ..................................... 51
3.2.3 Synthetic Dataset Generation .................................................................................... 54
3.3 Computational Resources ............................................................................................. 58
Chapter 4 : Deep Learning for Wrinkle and Fabric Boundary Detection .............................59
4.1 Stage I – Wrinkle Detection Using an Image Classification Model ............................. 60
4.1.1 Phase 1 - Model Development .................................................................................. 61
4.1.2 Phase 2 - Model Generalization ................................................................................ 63
4.1.3 Phase 3 - Multi-view Inferencing ............................................................................. 64
4.2 Stage II – Wrinkle and Boundary Detection Using Image Segmentation Models ....... 69
4.2.1 Training Image Segmentation Models ...................................................................... 70
4.2.2 Assessing Human Annotation Quality ...................................................................... 72
4.2.3 Evaluation of Wrinkle Detection Performance ......................................................... 73
4.2.4 An Approach for Early Detection of Wrinkles ......................................................... 74
x
4.2.5 Training on Synthetic Data ....................................................................................... 75
Chapter 5 : Experimental Results ..............................................................................................76
5.1 Stage I - Wrinkle Detection Using an Image Classification Model .............................. 76
5.2 Stage II - Wrinkle and Boundary Detection Using Image Segmentation Models ........ 79
Chapter 6 : Conclusions ..............................................................................................................88
6.1 Summary ....................................................................................................................... 88
6.2 Future Work .................................................................................................................. 91
Bibliography .................................................................................................................................94
xi
List of Tables
Table 3.1 - Technical specifications of computational resources ................................................. 58
Table 4.1 - Training parameters for different models ................................................................... 71
Table 5.1 - Accuracy of detection on the initial dataset ............................................................... 76
Table 5.2 - Accuracy of detection against the new dataset ........................................................... 77
Table 5.3 - Accuracy of detection after multi-view inferencing ................................................... 79
Table 5.4 - Inferencing results for image segmentation models ................................................... 80
Table 5.5 - Performance of trained models on the test dataset ..................................................... 81
Table 5.6 - IoU between two human-annotated datasets .............................................................. 82
Table 5.7 - Wrinkle detection scores for DeepLab V3+ ............................................................... 83
Table 5.8 – Inferencing results for DeepLab V3+ trained on single class datasets (IoU scores) . 84
Table 5.9 - Inferencing results on the synthetic test dataset ......................................................... 86
Table 5.10 - Inferencing results on the real test dataset ................................................................ 87
xii
List of Figures
Figure 1.1 - Organization of the thesis ............................................................................................ 7
Figure 2.1 - Manufacturing of CFRP components in aerospace manufacturing .......................... 10
Figure 2.2 - Preform process of aircraft's rear pressure bulkhead ................................................ 12
Figure 2.3 – Draping of cut-pieces at DLR................................................................................... 12
Figure 2.4 – Comparison between the boundary of the fabric in the simulation and the actual
process........................................................................................................................................... 15
Figure 2.5 - Validation of the draping process simulations .......................................................... 16
Figure 2.6 - Simulation of the draping process inside the mold ................................................... 17
Figure 2.7 - Fabric on the gripper ................................................................................................. 18
Figure 2.8 - Fabric with contrast-enhanced using histogram equalization (left) and local
histogram equalization (right) methods ........................................................................................ 19
Figure 2.9 - Filtered contrast-enhanced fabric without first using [12] for frequency-domain
removal of periodic noise (left) and with noise removal first (right) ........................................... 20
Figure 2.10 - Overview of object recognition tasks in computer vision area ............................... 23
Figure 2.11 - Transfer learning diagram for visual inspection of the draping process using a pre-
trained model ................................................................................................................................ 34
Figure 3.1 - Main components of the modular gripper ................................................................. 49
Figure 3.2 - Frames before and after the hand-labeling process, with the colored grids in the
second image coded to represent the different classification categories: wrinkle (cyan), gripper
(green), fabric (yellow) and background (black) .......................................................................... 51
Figure 3.3 - Arrangement of cameras in front of the gripper setup .............................................. 52
xiii
Figure 3.4 - A sample image of the dataset (left) and its annotation mask created in Amazon
SageMaker (right) ......................................................................................................................... 53
Figure 3.5 - Conversion of an RGB mask (left) to its gray-scale equivalent (right) .................... 53
Figure 3.6 - Data augmentation workflow chart ........................................................................... 54
Figure 3.7 - Simulation of the draping process in Blender ........................................................... 56
Figure 3.8 - Samples of generated images in Blender .................................................................. 56
Figure 3.9 - Synthetic data annotation process: main image (top left); image with marked
wrinkles (top right); RGB mask (bottom left); gray-scale mask (bottom right) ........................... 57
Figure 4.1 - Examples of wrinkles in the images used for training (left); wrinkles in the new
dataset (right) ................................................................................................................................ 64
Figure 4.2 - Five co-temporal masks of the same fabric ............................................................... 68
Figure 4.3 - Five overlaid, co-temporal fabric masks before correction (left) and after correction
(right) ............................................................................................................................................ 68
Figure 4.4 - Calculation of IoU metric for evaluation of object localization accuracy ................ 72
Figure 5.1 - A sample image of the dataset (top image) and its two annotation masks ............... 82
Figure 5.2 - Effect of overlapping threshold on wrinkle detection scores .................................... 84
Figure 6.1 - Wrinkle detection accuracy (%) by phase ................................................................. 89
xiv
List of Abbreviations
Abbreviation Definition
2D Two Dimensional
3D Three Dimensional
ACIS Advanced Control and Intelligent Systems
ASPP Atrous Spatial Pyramid Pooling
CFRP Carbon-Fiber Reinforced Plastic
CNN Convolutional Neural Network
DCNN Deep Convolutional Neural Network
DST Dempster-Shafer Theory
FCN Fully Convolutional Networks
FOV Field of View
FRP Fiber-Reinforced Plastic\Polymer
GPU
HDR
Graphical Processing Unit
High Dynamic Range
HOG Histogram of Oriented Gradients
IBR Image-Based Rendering
ILSVRC ImageNet Large Scale Visual Recognition Challenge
IoU Intersection Over Union
LIDAR Light Imaging, Detection, and Ranging
mAP Mean Average Precision
xv
mIoU Mean Intersection Over Union
ML Machine Learning
RGB Red, Green, Blue
RGB-D Red, Green, Blue - Depth
RoI Region of Interest
SIFT Scale-Invariant Feature Transform
SLAM Simultaneous Localization and Mapping
UAV Unmanned Aerial Vehicle
xvi
Acknowledgments
First, I would like to appreciate my research advisor Dr. Homayoun Najjaran of the School
of Engineering at the University of British Columbia. Prof. Najjaran was always kindly available
whenever I faced a problem in my research or had a question regarding my writing. He allowed
me to make this thesis my own work but guided me in the proper direction by giving me his
constant advice.
I also like to thank the professors who helped me in the validation process of this research:
Dr. Abbas Milani and Dr. Zheng Liu. This work could not have been successfully conducted
without their sensational involvement and input.
Finally, I must be grateful to my parents for their immeasurable support and persistent
encouragement during my studies. This achievement would not have been possible without them.
xvii
Dedication
I dedicate this accomplishment to my parents who were the shining lights of my life,
constantly showing me the right direction and supporting me to overcome the difficulties of this
journey and reach my goals.
I also dedicate this work to my great friends who always supported me and helped me with
writing this dissertation. I highly appreciate all they have done, especially John Brandon Graham-
Knight, Marian Körber, Kyle Low, and Kashish Gupta for assisting me during my research.
1
Chapter 1 : Introduction
1.1 Motivations
The application of Fiber-Reinforced Plastics or Polymers (FRP) has constantly been
diversified from racing cars and sports equipment to helicopters and aircrafts during the last four
decades. In each composite structure, there are two or more dissimilar materials that are used
together to either combine best properties or impart a new set of characteristics which neither of
the components could achieve on their own. The most important aspect of manufacturing advanced
fiber-reinforced composites is that the material properties and the structure are created at the same
time. So, any probable defect that may happen during the manufacturing process will directly
influence the stiffness and strength of both material and structure. This increases the importance
of taking an efficient quality control strategy which helps with finding the defects as quickly as
possible to either fix them or stop manufacturing the defective part to avoid more waste. Defect
detection on dry fiber fabrics in the aviation industry is one of the complex issues slowing down
the manufacturing flow of various products such as aircraft's rear pressure bulkhead. So,
developing an automated quality control solution that visually finds these defects and removes
them has been of much interest in recent years.
Computer vision applications in manufacturing processes have increased in importance
with the development of powerful computer systems and high-resolution and fast camera systems.
Vision applications can be used for process control, quality assurance and documentation
purposes. To process the gathered visual data, classical algorithms and also more recently Machine
Learning (ML) methods are used. In many applications, components or products are processed at
high rates, which leads to a high volume of data. This data volume enables efficient algorithm
2
development and serves as a training data source for the development of ML models. A special
challenge arises when only a small amount of data is or can become available. The reason for this
can be low manufacturing batches; this is common in the aerospace industry. Particularly in the
development of ML applications, special methods must be used to compensate for the lack of
extensive training data. Thus, the process of finding the appropriate tools and algorithms for
addressing a new problem such as geometrical defect detection in a composite manufacturing
process is new and challenging. This motivates us to research a novel method that takes advantage
of deep learning techniques and modern computer vision systems to detect geometrical defects
happening on fiber fabrics during composite manufacturing processes using only a small dataset.
1.2 Objectives
This thesis aims to develop a data-driven vision-based automated wrinkle detection setup
which will be an end-to-end solution to solve the issues related to the multi-variable process of
draping the fiber-reinforced cut-pieces by conducting online quality control. This setup can be
used for quality inspection of the manufacturing process of aircraft’s rear pressure bulkhead. The
developed vision system will be an alternative to expensive and complicated technologies like
high-resolution cameras, RGB-D1 sensors, LiDAR2 sensors, etc. which may not be affordable for
the companies or not adaptable to different industrial environments.
1 Red, Green, Blue - Depth 2 Light Imaging, Detection, and Ranging
3
Specifically, the proposed technique employs deep learning tools along with Convolutional
Neural Networks (CNNs) to comprehend the process through images and understands its setup
interaction under different conditions and scenarios. For this purpose:
1. A practical setup of visual sensors needs to be designed and assembled.
2. A data acquisition, registration and preprocessing pipeline must be defined to convert the
captured visual data to a perceptible format for the CNN.
3. A suitable object recognition model must be adopted and optimized or developed to detect
the gripper and fabric plus the wrinkles happening during the process quickly, accurately,
and robustly.
1.3 Contributions
This thesis aims to develop a visual quality assessment tool for composite manufacturing
processes, particularly wrinkle detection on fiber fabric cut-pieces grasped by modular gripper
during the draping process using simple optical sensors. A deep learning vision-based method for
detecting the fabric, gripper and probable wrinkles is developed as an end-to-end solution that can
run in real-time with acceptable accuracy. For this purpose:
1. A practical setup of visual sensors was designed, assembled and mounted in front of the
modular gripper to provide visual data during the draping process.
2. A data generation and preparation pipeline was defined for 1) gathering the captured visual
data from the sensors, 2) annotating, augmenting, and cleaning the data, 3) projecting
images from multi-views in one anchor image (for multi-view setup), and 4) converting
the processed data to an appropriate format for training deep convolutional neural networks
(DCNN). Finally, two datasets were generated for training the DCNN models and
4
evaluating their performance. A virtual simulation method for synthetic data generation
was also developed and implemented.
3. Two practical object recognition models were designed and optimized to detect the fabric
boundary and wrinkles happening during the process quickly, accurately, and robustly.
a. First, an object classification neural network was designed and implemented. To
generalize the developed model, a technique for combining co-temporal views and
Dempster–Shafer theory of evidence were used for employing data captured from
different views. It is shown that this technique significantly increases accuracy.
b. Second, four state-of-the-art image segmentation networks were trained on the
custom generated dataset and tested to label the gripper, fabric, and wrinkles in
pixel-level. The achieved results by the best-performing model show a remarkable
performance on gripper and fabric detection task. The limitations of human
annotations were also evaluated to explain the reason behind the low score obtained
for wrinkle class. Further evaluations proved that the model is able to provide
acceptable results for wrinkle detection if an appropriate evaluation metric is used.
A method for detection of wrinkles at their early stages of formation was also
developed and evaluated. Finally, the challenges of using a synthetic dataset, that
eliminates the intensively time-consuming data annotation stages, for training the
DCNN models were assessed by training and testing the models on such a dataset.
1.4 Thesis Outline
The thesis is organized as follows.
5
Chapter 2 reviews the background research conducted on quality control of composite
manufacturing processes, as well as studies concerning employment of computer vision systems
for performing object recognition tasks.
Chapter 3 describes the dataset development infrastructure used for this project by
explaining the modular gripper configuration, the camera setup used for capturing the images, the
data processing pipeline including the data annotation and augmentation stages, the developed
virtual simulation method for synthetic data generation, and the computational resources used for
training and testing of the neural networks.
Chapter 4 first explains the limitations of traditional image processing techniques and then
demonstrates two different methods used to develop a deep learning-based vision system for
boundary and wrinkle detection, in two separate sections.
Section 4.1 explains the steps taken to design and implement an image classification neural
network. It also describes how multi-views of the scene were combined and Dempster-Shafer
theory of evidence was used for fusing the votes coming from different views to generalize the
developed model to other datasets that are visually different from the custom-created dataset.
Section 4.2 describes the training procedure of four state-of-the-art DCNNs for performing
instance segmentation and semantic segmentation tasks on the custom generated dataset. Then, the
limitations of human annotations and their effects on the evaluation process are discussed. The
wrinkle detection performance is evaluated again by using more appropriate metrics. A method
for detection of wrinkles at their early stages of formation is also developed. At the end of this
section, a method for training the models on synthetic images is presented and the raised challenges
and reasons behind them are explained.
6
Chapter 5 uses figures and tables to demonstrate the results of each stage explained in
chapter 3, in separate phases. It then discusses the obtained results in each phase and explains how
the findings of one phase specified the next steps needed to be taken in the later phases of the
project.
Chapter 6 shortly summarizes the works done in this thesis and highlights the contributions
and achievements. Finally, it discusses the future directions in the research path and suggests
further improvements. Figure 1.1 illustrates the organizational framework of the thesis.
8
Chapter 2 : Literature Review
2.1 Composite Manufacturing
Composites manufacturing has been distinguished as a key manufacturing technology with
potential to impact road industries in the last four decades because of light weight, high stiffness,
and supreme strength of composite materials. A composite is a structure combined of two or more
materials that form a new material with improved properties while preserving the micro-structure
of each constituent [3]. Production of FRP composites starts by combining strong, reinforcing
fibers with polymer resin. The use of lightweight materials helps improve carbon emission and
energy saving in many applications; for example, more efficient operation of wind turbines that
are an alternative source of energy, and higher fuel-saving due to lighter weight vehicles and
compressed gas tanks.
Typically, a composite material is made of two main components: 1) reinforcement that
transfers load in the composite and provides the mechanical strength, 2) matrix which keeps the
reinforcement material bonded and aligned to protect them from environmental effects and
abrasion. This combination creates products lighter than monolithic materials (e.g. metals) with
the same or better properties. FRP materials utilization is increasing in a variety of applications
such as automotive, compressed gas turbine storage, industrial equipment like pipeline and heat
exchangers, structural materials for buildings, wind turbine blade, hydrokinetic energy generation,
shipping containers, and support structures for any kind of system that can take advantage of lower
cost, higher strength, lighter weight, boosted stiffness, and better corrosion-resistance of composite
materials.
9
Carbon Fiber-Reinforced Plastic (CFRP) composites have even higher stiffness-to-weight
and strength-to-weight ratios that lead to significant energy savings during production and improve
their performance. These characteristics make CFRPs a suitable candidate to be used in more
specific applications like aircraft manufacturing in the aviation industry.
2.2 Quality Control in Composites Manufacturing
A major issue with the production of composite materials is the detection of manufacturing
defects in the structures and composite components. The fact that the constituents of a composite
material maintain their primary properties after forming a new material, makes it a difficult task
to discern defects in an inhomogeneous composite material. Unsought defects in composites can
considerably reduce part quality, cause a fatal failure, or lead to a significant waste of money and
energy in the case of late detection. Thus, the development of non-destructive evaluation methods
and in-situ sensors for process control are highly required to understand as-manufactured part
performance and hinder defect formation. Although some technologies are being already used to
evaluate the quality of composites in a non-destructive manner, the development of novel methods
and improving current technologies is essential to increase the production speed and facilitate the
production of larger components. Figure 2.1 shows a typical workflow used in the aviation
manufacturing industry.
10
Figure 2.1 - Manufacturing of CFRP components in aerospace manufacturing
Figure 2.1 shows how quality control in mid-stages, e.g. preforming stage, can prevent
further costs by rejecting a defective part before entering subsequent stages e.g. vacuum bagging
and infiltration, curing, and machining stages.
As the volume of advanced composite materials used in the aviation industry is rising
constantly and main structural parts are increasingly made from composites, it is extremely
important to develop practical means for quality control of design practices, materials, and
production processes employed to build composite structures in this industry. As production rates
have been expanding markedly in recent years, the quality target must also be adjusted to zero
rework and repair, zero defects, and zero scraps.
Composites manufacturing procedures are typically manual, so defects or unwanted
features in moldings are usually the result of human error that can be prevented by more rigorous
quality control at every step of the process. In practice, the situation is more complicated because
of many interactions between the process design, part design decisions, and the variabilities in
processes and materials. Thus, it would be challenging to find a precise approach for identifying
the sources of variability and the probable defects that can arise from them [4]. To decrease the
probability of generating poor quality and costly parts, the mentioned defects must be properly
found and effectively removed during the manufacturing process. Also, further evaluations can be
11
done to understand the variabilities in development and design stages which can result in defect
formation.
The use of a visual inspection approach offers a framework within which these defects can
be identified quickly and properly. So, immediate actions can be taken to remove the defective
parts as early in the process as possible to avoid more costs. Further analysis will help to make
rational decisions about potential routes to develop zero-defect manufacturing processes, design,
and materials.
2.3 Automated Handling of Carbon-Fiber Reinforced Plastics at DLR
With the increasing importance of fiber-reinforced materials in the modern aviation
industry, many composite manufacturing facilities including the German Aerospace Center (DLR)
desire to develop a full-scale closed production chain from the raw materials to the finished
component. Manual handling of composite plies is capable of producing high performance,
complex parts; however, manual handling, is also an expensive, time-consuming process. Delays
in handling may lead to delays in the manufacturing flow. In addition, manually handling the
material on large-scale composite panels introduces the possibility of human error. Hence, large-
scale part manufacturing can greatly benefit from automated solutions to inform and control the
process [5]. However, the robot-assisted handling of these materials in the manufacturing process
of CFRP components is challenging due to the probable defects, such as wrinkles, that may happen
on the fabric, or misalignment of the actual and targeted boundary of fabric [6].
DLR conducts a large amount of research involving the use of robotic arms for smart
automation of composites manufacturing tasks. One of the key challenges in this way is to handle
12
and drape dry fiber fabrics to form vast double-curved components which are essential parts of
various products such as the aircraft's rear pressure bulkhead (Figure 2.2).
Figure 2.2 - Preform process of aircraft's rear pressure bulkhead
A significant part of this task is the automated production of a preform created from dry
carbon fiber cut-pieces. The complexity of this handling process lies in the double-curved target
geometry. The cut-pieces are gripped in the flat state and transferred to the target geometry during
the deformation process, also known as the draping process (Figure 2.3).
Figure 2.3 – Draping of cut-pieces at DLR
13
To drape the cut-pieces, an end-effector system called "Modular Gripper" was developed
in the AZIMUT project [7], which can deform its gripper surface to a double curved geometry
with the aid of a rib-spine design. The modular gripper is capable of picking up a flat fiber cut-
piece, transferring it to a double-curved geometry and depositing it at the intended position in the
mold [7]. The spine consists of two glass fiber rods connected to the gripper structure by three
linear actuators. The rods are bent by shortening the linear axes, causing an inhomogeneous
curvature of the suction surface. The 15 ribs are able to create independent curvatures; together,
ribs and spine produce a double curvature. The suction surface consists of 127 suction units that
are individually adjustable in their suction intensity. The ribs and spine are deformed to selectively
manipulate the carbon fiber fabric so that the cut-pieces match the predefined boundary edge
geometry as well as the predefined fiber orientation [7].
During deformation, undesired wrinkles can form, which need to be identified and
removed. The aerospace industry, having little tolerance for error, requires a high product quality.
This makes it imperative to establish an automated method to identify wrinkles if they form on the
fabric throughout the draping process.
The automated draping process by the modular gripper has proven to be very flexible but
also very complex. A total of 145 process parameters must be selected for draping a cut-piece: 3
parameters for the spine motor position, 15 parameters for the rib motor position, and 127
parameters for the intensity of each suction unit which have a significant impact on the quality of
the draping. This is due to two effects that occur during the deformation of the gripper surface.
The stresses arising in the material are either relieved in the fiber direction by buckling of the
material or by shearing the textile transverse to the fiber direction. Both effects can be handled
with the aid of a targeted selection of suction intensities by less firmly gripping the area of the
14
textile with expected relative movements. The suction intensity must be increased in areas where
the shearing force is applied, to make the grip tighter. The quality of the drapery is characterized
by checking two main items: 1) if any fold forms in the fiber cut-piece, 2) how accurate the edge
contour of the cut-piece corresponds to the predefined contour.
Misalignment of actual and predefined fabric boundary can result in cut-pieces overlapping
each other inside the mold and decrease the final product quality. Moreover, wrinkles can affect
the mechanical properties of a carbon-reinforced component negatively by letting the fiber fold
when the vacuum is applied. Folded fibers cannot resist tensions and the strength of the final
component in the fold area reduces.
Development of an automated solution that visually finds these misalignments and defects
early in the process has been of much interest recently. Such a solution can be used for online
quality inspection of the draping process either when the fabric is grasped by the gripper and is
being transferred to the mold or later when it is placed inside the mold (middle and right images
in Figure 2.3). Also, finding the exact location of the fabric on the gripper can help to adjust the
gripper motion planning and increase the accuracy of fabric placement inside the mold.
Irrespective of how these 127 suction intensities are optimized, whether manually or
automatically, a process is required that validates the drapery according to the selected intensities.
The first step towards such validation is the detection of the cutting-edge curve and any wrinkles
that may have formed on the fabric surface in the draping process. The user can use this data to
monitor the composite production process and apply the necessary adjustments online.
Due to the high number of parameters affecting the process, identifying the underlying
relationship between 145 process parameters (i.e., suction unit intensities and the gripper surface
geometry) and the final mechanical and physical properties of the dry fabric after the draping
15
process is a challenging task. To tackle this issue, a finite element-based simulation was done by
Montazerian et al. in [8]. To validate the simulation results, the boundary edges of the fabric in
both the simulation and the actual process must be compared. The simulation provides the
boundary edge of the textile over each suction surface by dividing each suction unit into 100
smaller portions along both x and y directions (see Figure 2.4). The fabric contour is then
documented according to this coordinate system. Although the actual contour can be documented
by manually inspecting the images captured from the setup, an automated evaluation method is
highly required as the manual process is extremely time-consuming and costly. The developed
method in this thesis can be used to find the actual contour of the fabric.
Figure 2.4 – Comparison between the boundary of the fabric in the simulation and the actual process
As shown in Figure 2.5, the actual boundary can be identified by the DCNN model using
the real images and then get compared with the simulated boundary. The validation result lets the
user tune the suction intensities according to the observed deviations. Increasing the suction
intensity in a selected area will prevent the fabric from sliding on the suction surface. Reducing
the suction intensity allows the textile to move across the surface. Using this approach, the user
can assure the desired boundary edge is achieved. The formation of the wrinkles can be also
16
controlled by adjusting the suction configurations as the folds always indicate an excessive
accumulation of material in an area.
Figure 2.5 - Validation of the draping process simulations
To validate the draping quality inside the mold, Kӧrber et al. in [9] designed a method to
calculate the target contours with the aid of CAD-based optimization tools. They have shown how
the gripper’s deformation and position inside the mold can be iteratively calculated and optimized
by simulating the process in the CAD environment. For this purpose, the mold model and a
simplified gripper model were integrated into a CAD model using CATIA. With the help of several
CATScript macros and python framework, the gripper model was moved onto the location of the
targeted cut-piece. Afterward, the ribs and spine of the gripper model were deformed and the
distances between the suction surfaces and the mold surface were measured after every
deformation step. The optimization was terminated when a termination criterion was met.
Figure 2.6 shows the mold and the gripper model, the targeted cut-piece location before
the simulation run (left image) and the optimized location (right image). The colored surfaces
17
indicate the distance between the suction surfaces and the mold surface. Green areas are above the
mold surface, yellow areas are in contact with it and red areas penetrate it. This simulation can
provide the desired position and orientation of the gripper, the proper deformation parameters, and
the target contour of the cut-piece on the gripper. The position and deformation parameters are
needed for the automation of the process while the target contour is required for the validation
purposes. To validate the simulation, the actual draping result needs to be captured using a visual
system. The developed method in this thesis can be useful at the validation step by providing the
boundary of the fabric inside the mold.
Figure 2.6 - Simulation of the draping process inside the mold
The Center for Lightweight Production Technologies (ZLP) in Augsburg, Germany,
developed an optical sensor system which provides information of relative movements between
the suction surfaces and the fabric during the draping process [10]. However, this setup was limited
in its ability to detect the fabric boundary geometry and probable wrinkles. The complicated shear
stresses happening while draping a flexible fabric material on a curvature can lead to wrinkling of
the fabric. Wrinkles greatly affect the manufacturing process and reduce the quality of the final
product. Figure 2.7 is an example showing the fabric, gripper, and linear wrinkles that possibly
appear within the draping process (one of the wrinkles is shown with a red bounding box).
Gupta et al. in [1], [11] aimed to address this issue by automatically finding the 3D
geometry and boundary edges of the cut-piece while it was mounted on a yoga ball. Then, they
18
used this data to predict the material behavior and the draping quality within the deformation. They
gathered both RGB and infrared images to take advantage of both color features and depth
measurements in the detection of boundary of composite products and the wrinkles that may
happen.
Figure 2.7 - Fabric on the gripper
Their experimental results proved their solution's robustness to changes in parameters which were
estimated according to the experimental setup condition; however, it was highly dependent on
tuned parameters that must be re-calibrated for every new situation.
2.4 Limitations of Traditional Image Processing Techniques
Wrinkle detection is based on lighting effects, where one side of the wrinkle will be
significantly lighter than the other. In the original image, subtle wrinkles are difficult to identify,
even to the human eye. To enhance these lighting effects, contrast enhancement through histogram
equalization is performed on the masked area of the fabric. Because the fabric is not oriented in a
19
fixed position relative to the lighting, one portion of the fabric can have an overall higher light
intensity than another. To help distinguish large-scale from small-scale lighting effects (i.e.
wrinkles), a localized histogram equalization technique is used. The results of this process can be
seen in Figure 2.8.
Figure 2.8 - Fabric with contrast-enhanced using histogram equalization (left) and local histogram
equalization (right) methods
Aggressive contrast enhancement succeeds in making the luminosity differences of
wrinkles more visible; however, it also makes the texture of the fabric apparent. The texture can
be seen in the top-right to bottom-left diagonal stripes in Figure 2.8. These stripes are highly
periodic, and as such are a good candidate for filtering in the frequency domain. The method
presented in [12] was utilized, with one change: it was found that tuning the normalizing divisor
(δ) for that method was difficult and image-specific. As such, rather than using a constant divisor,
the magnitude of any identified peak is set to the average of the kernel area. A low-pass Gaussian
filter and several passes with a median filter are then applied to remove high-frequency
components. Figure 2.9 shows the results of these filtering operations with and without the
frequency-domain filtering.
20
Figure 2.9 - Filtered contrast-enhanced fabric without first using [12] for frequency-domain removal of
periodic noise (left) and with noise removal first (right)
Considering the steps taken in this process, using traditional image processing techniques
has many difficulties from selecting the right technique at each stage to tuning the parameters of
applied filters which needs expertise and practical knowledge. Also, the parameters need to be re-
tuned if any change happens in lighting conditions, quality of images, etc. which hinders the
performance of such methods in real-time. Thus, it is essential to look for novel methods that can
handle these issues more efficiently.
It should be noted that detecting fabric and gripper during the draping process is simpler
than wrinkle detection because of their regular geometry and color features. So, traditional
techniques can be employed for detecting classes with fewer complexities as complementary to
more advanced methods like convolutional neural networks used for the detection of challenging
objects.
21
2.5 Visual Sensing Systems
With the increasing demand for efficient, fast and accurate robots in industrial
manufacturing, it is imperative for roboticists to employ unconventional and innovative practices
to solve some non-trivial problems, one of them being the use of Two-Dimensional (2D) image
data to perceive Three-Dimensional (3D) surroundings. The menace posed by the traditional 2D
methods greatly limits the operability of robots in a practical working environment. The
representation of a 3D scene in 2D leads to the loss of some essential features that can play a vital
role in facilitating the issue.
Robotic engineers have always been inspired by the occurrence of processes in nature, and
human binocular vision tends to solve the problems faced by the modern-day industries. This
motivates researchers to imitate the human vision system to address the challenges on the way of
achieving a robust and well-performing machine vision platform by enabling a more
comprehensive insight into the process. Apart from its deeper semantic understanding of the scene,
such a technique offers the ability to enhance performance on some complex tasks like object
recognition, grasping, handling, manipulating, assembling, etc. Besides, tasks requiring an
external human supervisor can be automated through a practical estimation of the object's pose
and dimension. Thus, the ability to perceive the surrounding environment is crucial to a wide
variety of industries e.g. automation, inspection, process control, robot guidance, etc.
To reach this goal, there is a need to collect visual data by the means of appropriate passive
optical sensors and develop a powerful tool to process and analyze the input to obtain the desired
output in real-time. The technique is also expected to reduce the costs, enhance the performance
by optimizing the process, and avoid high computation and extensive execution timelines.
22
To design a vision system capable of performing the object recognition task, some key
parameters need to be chosen.
First, the number, type, and arrangement of visual sensors that will be used. One can use
1) a single visual sensor, 2) a stereo vision system having two visual sensors, or 3) a multi-view
setup including three or more sensors. The used visual sensor can be 1) an RGB3 camera that
perceives the color features in the images, 2) an RGB-D camera that uses depth features in addition
to the color features, or 3) LIDAR and LIDAR-like systems which use a laser light to measure the
distance from the target. In the case of using more than one sensor, the arrangement of sensors in
the scene can vary considering their relative angle and distance (narrow-baseline or wide-baseline
setups).
Second, choosing the appropriate method for performing object detection and localization
task. There are many options available ranging from traditional image processing techniques,
Histogram of Oriented Gradients (HOG), Scale-Invariant Feature Transform (SIFT), to recent deep
learning approaches which are mostly based on CNNs such as Region Proposals, Single Shot
MultiBox Detector (SSD), etc.
Also, it needs to be decided if there is enough labeled data to use supervised learning
methods or a semi-supervised or unsupervised method should be used because of the lack of data.
In the case of using depth features for detection, it should be decided if depth will be added as
another input channel to color channels or it will be used for creating a point cloud or a 3D model
of the scene. The following sections review different methods that can be used for object
recognition in different computer vision applications.
3 Red, Green, Blue
23
2.6 Object Recognition Algorithms
Before introducing various object recognition algorithms, it is noteworthy to discern
different but similar computer vision tasks like image classification, object detection, object
localization, instance segmentation, and semantic segmentation, to avoid any confusion that may
happen later. An image classification method assigns a class label to a whole image, whilst object
localization draws a bounding box around each object present in the image. Object detection is a
combination of these two tasks and assigns a class label to each object of interest after drawing a
bounding box around it. Object recognition is a general term used for referring to all of these tasks
together [13]. Figure 2.10 shows how these concepts are related to each other.
Figure 2.10 - Overview of object recognition tasks in computer vision area
Object detection models are able to create a bounding box for every object present in the
image. But they cannot provide any information about the object’s shape as the bounding boxes
have a rectangular or square shape. In contrast, image segmentation models can create a mask in
pixel level for every object that appeared in the image. This method provides us with a more
24
comprehensive understanding of the object(s) present in the image. The image segmentation task
itself can be performed in two ways: 1) semantic segmentation which assigns a label to each pixel
of the image according to the class of the object surrounded inside the pixel (class-aware labeling),
2) instance segmentation which identifies object boundaries in pixel level and distinguishes
between separate objects from the same class (instance-aware labeling).
In the following sections, some of the well-known object recognition methods are
described.
I. Scale-Invariant Feature Transform (SIFT)
SIFT is a feature detection algorithm in computer vision that detects and demonstrates local
features in each image. It was first published by David Lowe in 1999 [14] and then patented in
Canada by the University of British Columbia in 2004. SIFT applications include object
recognition, 3D modeling, robotic navigation and mapping, video tracking, gesture recognition,
and match moving.
There are multiple steps involved in the SIFT algorithm: 1) finding the potential location
of features (scale-space peak selection), 2) locating the feature key points (key point localization),
3) assigning an orientation to key points (orientation assignment), 4) describing the key points as
a high dimensional vector (key point description), and 5) key point matching.
SIFT starts with extracting each object key points from a group of reference images and
then saves them in a database. In a new image, each available feature must be individually
compared with the elements of the database to recognize an object based on probable similarities.
Then, Euclidean distance between their feature vectors must be calculated to detect candidate
matching features. Those subsets of the thorough set of matches that their key points have the same
25
object, object orientation, location, and scale in the new image are recognized to find acceptable
matches. Then, an implementation of the generalized Hough transform as a hash table is used for
the determination of consistent clusters. Every cluster having at least 3 features with the same
object and object pose should either pass further detailed model verification or be discarded as an
outlier. In the end, the number of probable false matches and the accuracy of fit must be considered
to compute the chance that observing a set of features indicate the presence of an object. An object
match will be confidently marked as correct if it passes all these tests successfully [15].
II. Viola-Jones Object Detection Framework
The Viola-Jones object detection framework was introduced in 2001 by Paul Viola and
Michael Jones [16], [17] and could achieve competitive real-time object detection rates. It was
primarily developed to address the problem of face detection but also could be used to detect
various object classes. Viola-Jones algorithm is known as a robust algorithm by always providing
a very high true-positive rate and a very low false-positive rate. The ability of Viola-Jones in
processing at least two frames per second makes it a good candidate for practical applications in
real-time. However, its performance is limited to the detection task only. So, it cannot be
considered as a good object recognition method as detection is just the primary step in the whole
recognition process.
Viola-Jones is popular for its three key contributions. First, it introduced a new
representation of the image, known as integral image, that allows the quick computation of the
features used by the detector. Second, an AdaBoost-based learning algorithm was developed,
which yields significantly effective classifiers by choosing a few crucial visual features from a
greater set [18]. Third, a method was developed to combine more intricate classifiers progressively
26
in a cascade allowing the quick discard of the image background regions while concentrating the
computation on regions with a higher probability of having an object. The cascade statistically
ensures that ignored regions are improbable to possess the desired object [16].
III. Histogram of Oriented Gradients (HOG)
The concepts behind the HOG was first described by Robert K. McConnell in a patent
application in 1986. However, it was not known much until 2005 when Dalal et al. did some
supplementary work to improve HOG descriptors and presented their work at the Conference on
Computer Vision and Pattern Recognition (CVPR).
HOG is a feature descriptor that is often utilized for feature extraction from image data to
facilitate object detection in a variety of computer vision tasks. HOG descriptor focuses on the
shape or the structure of an object and can provide both the edge features and edge direction, which
makes it different from older methods that only could identify if the pixel is an edge or not. To do
so, it extracts the gradient and orientation (also called magnitude and direction) of the edges. The
orientations are calculated in localized portions by breaking down the complete image into smaller
regions and calculating the gradients and orientation for each region and generating a histogram
for each of these regions separately. The name ‘histogram of oriented gradients’ is chosen because
the histograms are created using the gradients and orientations of the pixel values.
IV. Convolutional Neural Networks (CNNs)
CNNs have shown a prominent performance on both complex and low-level vision tasks
such as instance segmentation [19], stereo depth perception [20], [21], object detection [22], [23],
27
pose estimation [24], [25], image classification [26], optical flow prediction [27], and stereo
estimation [28].
The roots of CNNs for image classification started in the 1980s. This early work focused
on the identification of hand-written digits, specifically as related to automated zip code detection
[29]–[31]. This work continued through the late 1990s, but with little adoption [30]. The method
worked well but suffered from a lack of parallel compute power available at the time [32].
Advancement in GPU4 computation and increased availability of datasets led to a renewal of
interest in 2006 and led to technological advances such as the first application of maximum pooling
for dimensionality reduction [33]–[37].
The current enthusiasm for DCNNs was sparked by the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC), wherein the winning entry in 2012 was a DCNN [26]. DCNNs
have since dominated the ILSVRC, and specifically the image classification component [13].
Some of the well-known DCNNs that have been frequently used in recent years are introduced in
the following sections:
i. Region Proposals (R-CNN)
R-CNN [22] is a short term for “Region-based Convolutional Neural Networks”. The R-
CNN structure is based on two steps. First, selecting a feasible number of bounding-box object
regions as candidates (also known as RoI5) using a selective search approach. Second, extracting
CNN features to perform classification while taking the features from each region independently.
4 Graphical Processing Unit 5 Region of Interest
28
To overcome the expensive and slow training process of the R-CNN model, Girshick et al.
enhanced the training process by joining three independent models together and training the new
framework, called Fast R-CNN [38]. This model aggregates CNN feature vectors and makes one
CNN forward pass for the whole image. The feature vectors share the feature matrix instead of
extracting it independently for every region proposal. In the end, to learn the bounding-box
regressor and the object classifier, the same feature matrix is extracted to be used which leads to
speeding up R-CNN by taking the advantage of computation sharing.
Although Fast R-CNN was noticeably faster during training and testing, the achieved
improvement was not significant because of the high expenses of separately generating the region
proposals by another model. Faster R-CNN [39] tried to speed up this process by integrating the
region proposal algorithm into the CNN model by constructing a single, joined model made of fast
R-CNN and RPN (Regional Proposal Network) having the same convolutional feature layers.
Later, He et al. extended Faster-RCNN by appending a branch which predicts segmentation
masks in each RoI [40], making it able to identify each object instance for every known object
within an image.
Pixel-level segmentation needs more accurate alignment compared to bounding boxes. So,
mask R-CNN tried to improve the RoI pooling layer to provide better and more precise mapping
of the RoI to the regions of the original image. Mask-RCNN beat all the records in different parts
of the COCO suite of challenges [41]. Its success has led to its employment in a variety of
applications from the detection and segmentation of oral diseases [42] to detection and
classification of road damages using images captured by smartphones [43].
In 2019, a new framework called TensorMask was introduced by Facebook AI for
extremely accurate instance segmentation tasks [44]. It uses a dense, sliding-window technique
29
and novel architectures and operators to capture the 4D geometric structure with rich and effective
representations for dense images. The main idea is that although the direct sliding-window
paradigm can accurately detect objects in a single stage without requiring a follow-up refinement
step, it is not effective for instance segmentation tasks, where instance masks are complex 2D
geometric structures, not simple rectangles. So, high-dimensional 4D tensors with scale-adaptive
sizes are required to assure effective representation of instance masks while they slide densely on
a 2D regular grid. TensorMask accomplishes this using structured, high-dimensional 4D geometric
tensors, which are composed of sub-tensors having axes with well-defined units of pixels. These
sub-tensors enable geometrically meaningful operations, such as coordinating transformations, up-
scaling, down-scaling, and the use of scale pyramids.
ii. Single Shot Multi-Box Detector (SSD)
SSD was introduced by C. Szegedy et al. in [45] in 2016 and could reach a new record in
object detection task performance. Despite the RPN-based approaches like R-CNN series that
generate region proposals and detect each proposal's object in two separate stages, SSD detects
multiple objects present in the image in only one shot which lets it perform faster.
Single shot means that object classification and localization tasks are both done in
a single forward pass of the CNN. MultiBox is a bounding box regression method developed by
the authors. The detector network first acts as an object detector and then classifies the detected
objects itself. SSD takes the bounding boxes' output space and discretizes them into a group of
default boxes over various aspect ratios and then scales them for every feature map location. While
predicting objects in new images, the network assigns a score to the probability of each object
class presence in every default box and then adjusts each box to perfectly match with the shape of
30
the object. Also, predictions coming from various feature maps having diverse resolutions are
combined to make the model able to accept objects of different sizes. SSD proved its competitive
accuracy against other techniques with an object proposal step by achieving better results on the
MS COCO, PASCAL VOC, and ILSVRC datasets.
iii. You Only Look Once (YOLO)
Contrary to R-CNN family that locate the objects present in an image using regions by only
looking at the areas of the images where are more probable to contain an object, YOLO
framework considers the whole image as its input and makes predictions about the coordinates of
the bounding boxes and their probabilities. Three main advantages of YOLO are 1) its incredibly
fast speed by processing images at 45 fps6 in real-time, 2) its ability to perform global reasoning
and seeing the entire image while making predictions despite the region proposal-based and sliding
window techniques 3) its capability to understand generalized object representations. Working
principle of the YOLO unified model is very simple: a single convolutional network predicts
multiple bounding boxes and their class probabilities simultaneously. YOLO uses full images for
training which makes it able to directly optimize detection performance.
After the introduction of YOLO by Redmon et al. in 2016 [46], various versions of YOLO
have been developed trying to enhance its performance in different aspects. Fast YOLO is a version
of YOLO using 9 convolutional layers instead of 24 which performs about 3 times faster than
YOLO but has lower mAP (mean Average Precision) scores [47]. YOLO VGG-16 uses VGG-16
as its backbone instead of the original YOLO network which makes it more accurate but slower
6 Frames per second
31
than to be applied in real-time. YOLOv2 is focused on reducing the significant number of
localization errors and improving the low recall of original YOLO while maintaining classification
accuracy [48]. The original YOLO can only detect 20 classes which are not enough for a large
group of object detection applications while YOLO9000 is a real-time framework for detecting
more than 9000 object categories by jointly optimizing classification and detection [48]. YOLOv3
is the latest member of the YOLO family with some improvements compared to YOLOv2 such as
a better feature extractor, DarkNet-53 backbone with shortcut connections, and a better object
detector with feature map up-sampling and concatenation [49].
iv. DeepLab
DeepLab is a revolutionary semantic segmentation model designed by Google. It uses
atrous convolutions to simply upsample the output of the last convolutional layer and computes a
pixel-wise loss to make dense predictions. Atrous convolution is a term used for a convolution
with up-sampled filters that allows the network to efficiently expand the filters field of view while
not raising the computation load and quantity of parameters.
DeepLab V1 made some improvements over the previous models by the employment of
Fully Convolutional Networks (FCN) which led to overcoming two main challenges: 1) reduced
feature resolution caused by multiple pooling and down-sampling layers in DCNNs, 2) reduced
localization accuracy caused by DCNNs invariance.
DeepLab V2 attempted to further enhance the performance of DeepLab V1 by addressing
another challenge, the existence of objects at multiple scales. It used a novel technique called
Atrous Spatial Pyramid Pooling (ASPP), wherein multiple atrous convolutions with distinct
sampling rates were applied to the input feature map and the outputs were combined [50].
32
The next challenge remaining in front of the DeepLab was capturing sharper object
boundaries. DeepLab V3 architecture employed a novel encoder-decoder with atrous separable
convolution which could obtain boundaries of sharp object to tackle the above issue. DeepLab V3
also applied depth-wise separable convolutions to increases computational efficiency [51].
After the great success of DeepLab V2 and DeepLab V3 on the PASCAL VOC 2012
semantic image segmentation benchmark [52], Chen et al. extended their previous work by adding
a decoder module to improve the segmentation performance, particularly along object boundaries
[53]. They took advantage of both ASPP and decoder modules to achieve a more efficient encoder-
decoder architecture and thus improved their performance on the PASCAL VOC 2012 challenge.
v. U-Net
U-Net is an encoder-decoder based architecture developed by Ronneberger et al. for
biomedical image segmentation [54]. One of the important challenges in the medical computer
vision field is the limited number of datasets as many are not publicly available to protect patient
confidentiality. Moreover, accurately annotating medical images requires trained personnel. U-
Net employs data augmentation techniques to effectively use the available labeled samples which
makes it a useful tool in cases without large annotated datasets. U-Net has also been shown to
perform well on grayscale datasets.
vi. IC-Net
Zhao et al. introduced an image cascade network, called IC-Net, that combines branches
with different resolutions under appropriate label orientation to solve the real-time semantic
segmentation challenge [55]. The importance of architectures like IC-Net is that they achieve an
33
acceptable trade-off between accuracy and efficiency to perform image segmentation tasks in real-
time which is a necessity for a variety of applications.
2.7 Transfer Learning
Machine learning models are developed assuming that the training and test data have a
similar feature space, and the same distribution. So, we need to develop a new model and collect
a new set of training data if the feature space or the data distribution changes. Collecting the needed
training data and rebuilding the models is expensive and time-consuming. Thus, finding a way of
transferring knowledge between task domains is of much interest. The idea of transfer learning in
deep learning tasks is initiated by the fact that humans can use their previous knowledge gained
from learning other tasks to learn new tasks or address new issues better and faster. In the computer
vision field, it will be like having one CNN that tries to learn basic features of the images such as
shape, corners, illumination, etc., in various tasks and then employs what has learned to more
efficiently understand new classes in images.
To implement Transfer learning, the last predicting layers of the pre-trained model must
be removed and substituted with adjusted predicting layers. During the training, weights of the
preserved part of the model are kept fixed and do not get updated, which makes the pre-trained
model able to act as a feature extractor. Thus, training only changes weights of the last layers
which try to learn the task-specific features. Using transfer learning improves performance with
less training time and reduces the need for huge training sets having tens of thousands of images
to smaller sets with only a few hundreds of images. Figure 2.11 shows how transfer learning
technique can be applied to make a neural network that is pre-trained on ImageNet dataset with
34
over 14 million images able to detect new objects such as gripper, fabric, and wrinkles by re-
training the last layers on a small dataset with only about 8000 images.
Figure 2.11 - Transfer learning diagram for visual inspection of the draping process using a pre-trained
model
2.8 Stereo Vision Image Processing
The increasing demand for machines making 3D models of the surrounding environment
has led to the further development of the stereo vision systems which use a technique aimed at
inferring depth information from two (or more) images. Scene understanding includes many
different aspects ranging from object detection and image classification to pose estimation and
depth perception that leads to obtaining the physical geometry and is a key feature for many natural
and artificial systems. There is a wide variety of practical applications for this field such as obstacle
avoidance in autonomous driving vehicles, object detection and manipulation in robotics,
35
navigation and localization of the mobile systems, surface detection for the landing of drones and
UAV7s, etc.
A stereo vision system is made up of two cameras which are placed in a fixed distance
related to each other. The final goal of the system is to perceive the depth of each point on the
image and create a 3D model of the scene by processing the two images captured simultaneously
based on trigonometry or other methods like machine learning. However, this process is subject to
many problems like occlusion, photometric distortions and noise, specular surfaces,
foreshortening, uniqueness constraint, perspective distortions, uniform (ambiguous) regions,
repetitive patterns, transparent objects, discontinuities, etc., which make it tough to find the
correspondences in two images.
There are lots of proposed methods utilizing only a singular vision sensor for this purpose.
However, as a perfect instance of a stereo vision system, a human's eyesight can practically guess
the size and depth of an object in a very short time. So, to approach the level of the human eye,
there is yet another tool that can be used.
Recently, several research works have been done to improve different aspects of the stereo
vision technique. Osswald et al. developed a spiking neural network architecture that exploits an
event-based representation to address the stereo correspondence issue [56]. A novel benchmark
was presented by Fererra et al. to compare the limitations and performance of monocular and stereo
vision systems for target detection tasks by studying the performances and limitations of 3D pose
estimation of a known target in intensely rough conditions [57]. Many novel approaches have been
used to obtain the depth of a particular object from a singular image [58]–[61]. But, using a stereo
7 Unmanned Aerial Vehicle
36
vision system for object detection is usually better than using a single camera since it is simpler to
calibrate and produces more precise results [62]. Eigen et al. proposed a quick and simple
multiscale model architecture for convolutional networks that performs excellently on three
modalities including surface normal, depth, and semantic labels [63].
Some other researches have focused on developing the practical applications of stereo
vision systems. Sangeetha et al. designed and implemented a stereo vision system to handle robotic
arms in space applications [64]. In another work, the size and depth of the objects in the
surrounding environment were used for the localization and navigation of the mobile systems [62].
McGuire et al. presented a computationally efficient stereo optical flow algorithm for obstacle
avoidance and velocity estimation on micro aerial vehicles [65]. Balter et al. employed an adaptive
kinematic control method plus a stereo vision system to control a robotic venipuncture device as a
piece of evidence that these systems can enter the realm of modern medicine as a powerful
technique [66].
2.9 Multi-View Computer Vision
The fusion of multi-views from the same scene is one of the main techniques in the field
of robotics, pose estimation and 3D reconstruction. However, it is not a common option in
performing semantic segmentation tasks due to the 1) lack of annotated multi-view datasets, 2)
difficulties of relating the location of each image to the reference coordinate, 3) complexities of
the fusion techniques, especially while performing feature-level image fusion.
In such a system, a single dynamic camera can be used to aggregate several views and
create a semantic reconstruction of the environment, or multiple fixed cameras can be employed
to obtain various aspects of the scene simultaneously.
37
Ma et al. proposed a new deep learning method performing semantic segmentation tasks
by taking RGB-D images from multi-views as input [67]. They prove that multi-view consistency
in training time leads to a great enhancement in fusion at test time compared to training the
networks on single views and then fusing the predictions.
In some cases, such as wrinkle detection during the draping process, a single view of the
scene will not be enough for segmentation as the object is constantly moving and the camera will
not be able to always provide a clear sight on the scene. So, employing a setup of multiple cameras
fixed in various locations in front of the scene in addition to a multi-view deep learning algorithm
that analyzes multiple views in feature level can be proposed as a promising solution.
2.10 Depth Estimation
The rapid development of computer vision systems in addition to the emergence of
machine learning techniques have been two key factors in improving machines' vision level in
recent years. Although the visual sensors can provide high-quality images from the scene in
different environments and conditions, getting the depth of objects is not as easy as capturing
images and needs extra computational effort. A robust depth estimation technique can be helpful
to various applications like medical imaging, industrial automation, autonomous driving vehicles,
3D reconstruction of urban environments, etc.
Researchers have been taken different approaches to tackle this issue. Depth estimation
using a single image is attracting much attention as it only needs one camera and is needless of
synchronizing different sensors. However, the required algorithm for this purpose is more
complicated than multi-view systems and is highly dependent on object size, global view, and
environmental condition. On the other hand, systems composed of multiple cameras are more
38
robust and accurate because of working with geometric relations between images. Nonetheless,
they are more expensive, and the calibration process of cameras is time-consuming, complex and
error-prone.
From the machine learning point of view, although DCNNs have made impressive progress
during recent years, their need to large annotated datasets for training is a barrier against adapting
them to different conditions. That is why researchers prefer to take semi-supervised and
unsupervised approaches instead of basic supervised learning methods. However, these techniques
are usually based on human knowledge and expertise, which means they need to be hand-tuned
for every new situation.
To find the best method for performing depth estimation tasks, one needs to consider different
parameters like:
• Number of available visual sensors
• Range of depth values in the scene
• Desired accuracy and permitted error
• Expected time for performing the task
• The availability of a computational unit for training the neural networks
• The amount of available annotated data for using a supervised learning method
• The presence of experts to tune the network by their knowledge and expertise while
employing semi-supervised or unsupervised learning approaches
• The environmental condition
• The possibility of occlusions and shadows
• The baseline of cameras
39
The following sections review some of the pivotal researches which focus on the depth
estimation task by employing different machine learning techniques and a different number of
views. After introducing the RGB-D image processing paradigm, the methods are classified into
two main categories; depth estimation using a single image or multiple images. Each category then
is divided into three subcategories based on the employed learning approach; supervised, semi-
supervised, or unsupervised. Finally, research works that have used multiple cameras and a
supervised learning method are divided into smaller subcategories based on the width of their
cameras' baseline (narrow-baseline or wide-baseline).
I. RGB-D Image Processing
One of the fundamental targets in the computer vision research area is making the
computers able to look at the images in a human-like manner. So, understanding the image in pixel
level, known as semantic image segmentation, has attracted much attention in different categories
like object detection [41], [52] and scene understanding [68], [69]. In recent years, the development
of DCNNs [23], [26], [70] in addition to the accessibility of large-scale annotated image datasets
have been led to a great enhancement in performance of semantic segmentation algorithms [50],
[71], [72]. RGB semantic pixel-wise labeling can be considered as a start point of semantic
segmentation during recent years [54], [63], [71]–[76].
Although the emergence of different semantic segmentation algorithms helped the
computers to conceive the environment in a similar way to humans, some geometric information
was still missing in color channels which could only be extracted from depth information. In
meanwhile, the availability of additional depth channel, which could help to a better realization of
geometric information in the scene, resulted in increasing interest in semantic segmentation of
40
RGB-D images [77], [78]. As a straightforward solution, depth data, gathered by cheap depth
sensors, was added to the input color channels as an extra channel [19], [72]. Considering the
merits of employing depth data, the next works were focused on the development of networks that
were able to jointly learn from depth and color information [79], [80]. These networks used depth
data to separate objects/scenes while using color channels to extract semantic information [81].
After the introduction of CNNs and fully connected networks, they were employed for
extracting the depth features to empower the segmentation of RGB-D images. Couprie et al.
trained a CNN on a mixture of RGB and depth images to synthesize information at distinct
receptive field resolutions [82]. Gupta et al. based their work on the R-CNN network [22] and used
a novel idea to detect objects [19]. They added a depth image to available three channels and called
it HHA, which retains each pixel's height above ground, horizontal disparity, and the angle of the
local surface normal. Then, they attained semantic segmentation of the image by training a
classifier using the features extracted by CNN. Long et al. fused both RGB and HHA images and
proposed an FCN with an up-sampling stage that outputs a high-resolution segmentation by
combining low-resolution predictions [72]. Their method boosts the accuracy of semantic
segmentation compared to other methods that directly fuse the segmentation scores. Eigen et al.
used a multi-task network to estimate surface normal, depth, and semantics for RGB-D images
[63]. Trying to develop an end-to-end solution better than the use of HHA or a direct concatenation
of depth and RGB features, Fusenet was introduced as an encoder-decoder network which tries to
fuse complementary depth information and color cues into a semantic segmentation framework
[83]. The encoder part simultaneously uses RGB and depth images for feature extraction and then
combines RGB feature maps and depth features in different levels of fusion (dense and sparse
fusion) as the network goes deeper. Some other research groups have also focused on the
41
development of more complex CNN architectures for conducting single image semantic
segmentation [84], [85].
Although CNNs are known for their good performance on single image segmentation,
applying them to 3D reconstruction tasks using multi-view images is not usual. Riegler et al.
presented a novel 3D representation of CNN, called OctNet, for deep learning with high-resolution
inputs that can be used in various tasks such as pose estimation, object categorization, and semantic
segmentation on voxels [86]. McCormac et al. combined CNNs and a SLAM8 system, called
ElasticFusion, which produces a handy semantic 3D map by fusing the semantic predictions made
by the CNN on multiple views into a map [87]. He et al. proposed a super pixel-based multi-view
CNN which employs information obtained from complementary views of the same scene for single
image segmentation [88]. Ma et al. trained a network on multi-view consistency and fused the
outputs from different viewpoints as complementary to the mentioned monocular CNN methods
[67].
II. Depth Estimation Using A Single Image
Recently, many robust algorithms have been developed for finding the depth of points in
an image using stereo vision. However, estimating the depth of an object using only a single image
remains an open issue due to the weak performance of developed algorithms.
Monocular depth estimation techniques work with object sizes, global views, line angles and
environmental conditions which are not easy to achieve. Also, each set of parameters corresponds
to more than hundreds of possible world scenes that are hard to choose between.
8 Simultaneous Localization and Mapping
42
On the other hand, monocular vision setups are relatively cheaper and faster. They also are
not dependent on the synchronization of cameras, which is one of the major sources of error in
multi-view setups.
i. Supervised Learning
Development of depth datasets like KITTI, RGB-D Scenes, Sun 3D and NYU Depth, has
facilitated the use of supervised learning techniques for performing depth estimation tasks. Most
of the available RGB-D datasets are composed of images and depth information of objects which
are usually found in houses, offices, streets, etc. Mentioned datasets can be useful for most of the
depth estimation tasks. However, for more specific cases, researchers must either create a new
annotated dataset that needs human effort or take semi-supervised or unsupervised approaches that
have more technical complexities.
Eigen et al. presented a supervised monocular depth estimation method that uses two neural
networks working successively [89]. The first network predicts the depth at a coarse-scale using
original input. Then, the second network takes both the original image and output of the coarse
network and refines the output within local regions. In contrast to stereo matching, local views are
not enough for detecting dominant features in monocular techniques. So, a global understanding
of the scene is necessary to employ cues like object locations, vanishing points, etc. Finding this
global understanding is the main goal of the coarse-scale network while the fine-scale network
works locally.
43
ii. Semi-supervised Learning
Although supervised deep learning algorithms made huge progress in performing depth
estimation tasks, they are highly dependent on having a large amount of training data to perform
well. Creating the required data needs employment of depth sensors like RGB-D cameras for
indoor environments and 3D laser scanners for outdoor settings. When using these extra sensors,
the fact that the error and noise of the sensors will be added to the system is deniable. Correlating
the data gathered by lasers with images is another problem as the nature of laser output is sparser
than raw images. Finally, tuning the cameras and finding the exact values of intrinsic and extrinsic
parameters at each moment is another challenge in this process. To overcome these problems,
semi-supervised learning approaches employ a few numbers of labeled data plus a large set of
unlabeled data to train the neural network in a wise manner.
Kuznietsov et al. proposed a semi-supervised, encoder-decoder based deep learning
method for monocular depth perception [90]. The introduced method was based on a deep residual
network architecture which had an encoder-decoder template. It used stereo vision geometry
principles to learn the estimation of depth directly from binocular cameras in an unsupervised
manner.
iii. Unsupervised Learning
The vast demand for hand-labeled data to train the neural networks is the most challenging
issue that depth estimation algorithms are struggling with. Unsupervised learning is a machine
learning algorithm used to make predictions on datasets without annotated images. This technique
makes the network able to learn the depth directly and establishes an end-to-end solution to object
detection tasks.
44
Grag et al. presented an unsupervised manner which can perform needless to pre-training
and an annotated ground-truth [91]. The main idea of the designed network is taking the structure
of an auto-encoder neural network and training it on a group of pair images. The only difference
between each pair of images is that the second one is taken after a small, known camera motion.
So, each pair can be considered as a stereo pair captured by two relatively close visual sensors at
the same time.
III. Depth Estimation Using Multiple Images
Depth estimation from multiple images is one of the most challenging tasks in the field of
computer vision. A variety of applications like object grasping using robotic arms, distance
estimation for autonomous driving, 3D reconstruction of urban areas and buildings, product
handling in industrial environments, medical imaging, movie and game making industry, etc.,
require a robust and accurate depth estimation of the scene as an essential component. The
performance of available solutions largely depends on parameters like lighting condition, the type
of cameras, the arrangement of cameras, the rate of changes in sequential images and depth range.
Thus, finding a robust and comprehensive solution to this issue has been of much interest in recent
years.
The single image reconstruction methods are suffering from some restrictions in viewing
conditions, reflectance, and symmetry of the images. So, using multiple cameras and finding the
relations between images by calculating the disparities becomes an option. The advantage of these
methods is that they are quicker and do not require much computational effort. Also, they usually
provide better depth accuracy compared to single cameras.
45
However, multi-view learning algorithms need at least several ten to hundreds of properly
captured images. Furthermore, they have some issues in cases with objects occluded in several
views, similar texture and color variations.
i. Supervised Learning (Narrow-baseline)
Although multi-view setups take advantage of looking at the scene from different views
and provide more details, the limited space to mount the cameras in some of object detection
applications is an issue. So, developing an algorithm to percept the depth by using a narrow-
baseline setup of cameras has become of interest.
Wang et al. developed a deep neural network to achieve a real-time multi-view depth
estimation system that can work with two or more images captured from diverse types of cameras
[92]. Their model tries to bring up a new solution inspired by classic multi-view systems with
improved features. It uses a combination of encoder-decoder networks in the same manner as auto-
encoders. But these two kinds of systems differ in some technical details.
ii. Supervised Learning (Wide-baseline)
In some applications like industrial manufacturing processes, autonomous driving vehicles,
etc., cameras can be located relatively far from each other resulting in a wide-baseline setup. It has
been one of the trending techniques during recent years as it provides a larger Field of View (FOV)
and allows for coverage of a broad area rather than a limited region.
Jorissen et al. presented a novel solution that uses the Image-Based Rendering (IBR)
technique to create enough images for stereo matching [93]. The introduced algorithm extracts the
structures from captured images by a sparse linear camera setup. Starting the extraction process
46
from the central view and then rendering the depth maps for the adjacent views resolves the partial
occlusions. Stereo matching algorithms usually work with matching windows and try to match
pixels by finding the correspondences between images. This action happens after looping over
predefined disparities. To deal with occlusions in multiple camera setups, a confidence value will
be assigned to each pixel and then pixels with a value lower than the threshold will be recalculated
using data from neighborhood pixels.
iii. Unsupervised Learning
Creation of annotated datasets using images captured by multiple cameras is even harder
as the cameras should be perfectly synchronized during the whole data creation process. Some
researches use additional depth sensors to improve the accuracy of depth information. However, a
well-established unsupervised learning algorithm can remove all these barriers and reach to an
end-to-end solution needless of human effort.
Li et al. proposed a new approach that requires coarse point-cloud samples to create a dense
depth map of the image [94]. The used methodology requires an image, as well as sampled depth
points. These points are generated from the ground truth provided by the dataset by obtaining only
a portion of the overall depth points.
Wang et al. developed a method using unsupervised learning and a perceptual loss to
estimate the depth with a twofold setup [95]. In this setup, the images are first passed through the
depth and pose estimation network which then passes the data to the perception network. The depth
network obtains raw feed from the input images. Using Depth-net, the depth map of each image is
determined. The first image and the next are then sent to Pose-net which helps to recreate a pose
47
estimation for the first image. Finally, two networks are put together to create a reconstructed view
of pose and depth.
IV. Depth Estimation Challenges
The main problems challenging the researches in this field come from lack of annotated
datasets, errors in calibration and synchronization of cameras, lack of expert knowledge in some
specific applications, and the complexities of neural network architectures.
It is usually preferred to use supervised learning for depth estimation as it is more robust
to noise and unwanted changes compared to the unsupervised learning methods. However, in more
specific applications that are not explored much yet, it is easier to focus on unsupervised
approaches rather than creating new datasets.
The number of employed cameras depends on the application and varies according to the
available space for mounting the cameras, available visual sensors, size of the region of interest,
and required estimation accuracy.
To improve existing solutions, one can focus on enhancing the architecture of the neural
networks. Recent researches show that using an appropriate loss function can lead to notable
improvements in network performance. Furthermore, the data preprocessing stage has a great
impact on the estimation results. It is expected that most of the future works in this field focus on
adjusting the architecture of CNNs and try to find depth clues from intermediate layers and employ
them for final estimation. Also, the expansion of annotated datasets in the future can be helpful to
enhance the performance of currently developed solutions.
48
Chapter 3 : Infrastructure for Dataset Development
3.1 Hardware Setup
3.1.1 Modular Gripper
To drape the cut-pieces, an end-effector system called the "Modular Gripper" was
developed at DLR, which can deform its gripper surface to a double curved geometry with the aid
of a rib-spine design. The process can be broken down into the following steps. First, a textile cut-
piece is placed on a table. Second, the gripper picks up the blank by sucking the fabric against its
surface using an array of suction units and preforms the blank to the double-curved target geometry
by manipulating the suction units. Lastly, the cut-piece is precisely placed into the mold [1].
Gripping system functions based on two main components (Figure 3.1). On one side, 127
suction units form a suction surface which grasps the textiles. For this purpose, each suction unit
has a 100𝑚𝑚𝑚𝑚 × 100𝑚𝑚𝑚𝑚 effective area and its suction intensity can be individually varied in eight
stages. The second component consisting of a rib-spine system ensures the deformation of the
suction surface from a flat state to a double-curved geometry. The spine consists of two flexible
glass fiber rods and three linear actuators. The linear actuators pull on the rods and thus create a
curvature that affects the adjustment of the connected suction units. The second curvature is
achieved with 15 ribs. The suction units of one rib are mechanically connected by a shaft. If this
shaft is rotated by an actuator, the distance between two lever arms is shortened, resulting in a
homogeneous change in the angle of attack of all connected active surfaces. This mechanism
makes the rib-spine system able to create various double-curved geometries.
49
Figure 3.1 - Main components of the modular gripper
3.1.2 IDS XS Camera
IDS XS camera with autofocus is an industrial camera that can be easily used for a variety
of applications. The XS comes with USB 2.0 and Mini B USB 2.0 connectors which facilitate its
integration. The camera is armed with OmniVision's 5 Megapixel CMOS sensor with 1.4 µm pixel
size which provides high-quality images with high color reproduction accuracy in both normal and
harsh light conditions. The small size of the XS, its light weight, compact design, integrated power
supply, and its ability to capture videos at 15 fps (2592 × 1944 pixel) make it a perfect choice
for embedded systems and industrial applications.
3.2 Software Setup
3.2.1 Data Generation for Training Image Classification Models
50
To train the deep neural networks, a dataset of approximately 16,000 images was created.
A stereo imaging module consisting of two IDS XS cameras was used to capture the movements
of the robot holding the gripper with differently sized and shaped fabrics. The image pair (left and
right) from the imaging module was recorded in 14 different scenarios, each at 14 fps while the
duration of each video was around 40 seconds. Then, each frame of the image pair was divided
into 128 (64 for left and 64 for right) smaller images to enhance the features local to the area. Each
acquired frame from the stereo imaging module was 1280 × 720 pixels in size, which was then
subjected to a grid of 8 × 8 to facilitate hand-labeling. This process yielded 64 smaller images
(160 × 90 pixels in size) per frame.
Hand-labeling 16,000 images is time-consuming and challenge. To accelerate this process,
the slow movement of the experimental setup was exploited; images were batched into groups
where a little movement had occurred between frames and thus labels were expected to be similar.
The first image in the batch was labeled manually (Figure 3.2), and the label was copied to every
other image in the batch. The batch was then visually inspected for errors. In Figure 3.2, an image
frame is shown before and after the manual labeling process. The image is divided into 64 units,
each indexed based on its position in the image matrix. The indexed images are labeled as
"Background", "Gripper", "Fabric" or "Wrinkle," with each label shown in a different color for
visual inspection. The annotation process was done using MATLAB R2018a software.
51
Figure 3.2 - Frames before and after the hand-labeling process, with the colored grids in the second image
coded to represent the different classification categories: wrinkle (cyan), gripper (green), fabric (yellow) and
background (black)
3.2.2 Data Generation for Training Image Segmentation Models
To create a multi-view dataset for use in next steps, an imaging module, consisting of 3
IDS XS cameras, was placed in front of the gripper setup as shown in Figure 3.3. Then 206 images
were taken while gripper was moving, to capture different poses of the setup. This stage had to be
quick enough to not interfere with the company's manufacturing timeline. The whole process of
installing the cameras and capturing the images took less than two hours to complete which is a
reasonable and feasible downtime for many composite manufacturing processes.
Annotating the images was done using the Amazon SageMaker Ground Truth Labeling
tool.
52
Figure 3.3 - Arrangement of cameras in front of the gripper setup
At this step, Labelers were asked to mark exact boundaries of gripper, fabric and wrinkle
classes in each image. Amazon’s labeling tool facilitated this step by providing both polygon and
brush tools for annotating images in pixel-level. Also, SageMaker offers the option to create a
labeling task and ask a group of users to participate in the task or supervise the annotation process.
The annotation process was done by 5 of our colleagues in 2 hours which is roughly equal to
performing the task in 10 hours by a single person. An example of generated mask images is shown
in Figure 3.4.
The generated RGB masks were then converted to gray-scale masks to facilitate the later
conversion of annotations to required formats for training various neural networks (Figure 3.5).
A gray-scale image is an image that only has shades of gray and each pixel value is presented with
a number between 0 (black) to 255 (white). Gray-scale images store all the visual data in one
channel instead of three which makes them easier to be processed in comparison to RGB images.
53
Figure 3.4 - A sample image of the dataset (left) and its annotation mask created in Amazon SageMaker
(right)
Figure 3.5 - Conversion of an RGB mask (left) to its gray-scale equivalent (right)
The dataset was then divided into a test set including 31 images (15% of the entire dataset),
and a train/validation set including 175 images. Data augmentation was used to increase the
quantity of train set. First, 5% of all images in the train/validation set were randomly put into the
validation set and 95% into the training set. The train set including 166 images was augmented to
7442 images through vertical flips, horizontal flips, and random cropping. The data augmentation
flow is shown in Figure 3.6.
54
Figure 3.6 - Data augmentation workflow chart
3.2.3 Synthetic Dataset Generation
Building large-scale labeled datasets is a challenging task considering both collecting and
annotating stages. It is hard to guarantee the gathered images provide a large enough diversity of
scenes and conditions. Annotating the images in pixel-level is even more costly because each
worker has to spend a lot of time adjusting the created label for every single image [96]. Also,
sometimes only the experts in that field are able to find and annotate the target objects. Tumor
detection in medical CT-scans and anomaly detection in manufacturing processes are two
examples of such tasks.
Synthetic data generation is a delightful alternative to the manual annotation in pixel level.
Recent developments in computer graphics have made it possible to automatedly create such
synthetic images and semantic per-pixel labels using virtual 3D environments. Synthetic data
generation for training DCNNs seems to be a tempting technique to avoid manual annotation costs.
However, the domain distribution shift and mismatch in appearance usually result in a considerable
performance reduction when the trained model on synthetic data is tested on real data. Recently,
55
some domain adaptation techniques are developed and presented to address this issue. However,
there is still a lot of work to be done in this field.
To measure the performance drop while using synthetic data instead of real data, the
draping process was simulated in Blender 2.8. Blender is an open-source 3D creation software
supporting the 3D pipeline modeling and simulation, rigging, rendering, motion tracking and
compositing, creation of 2D animation pipelines, and video editing.
To simulate the draping process, the gripper, fabric, and wrinkles were modeled as solid
objects. Then the images taken from the real setup were used to texturize the model. Four virtual
cameras were then placed in front of the model to capture images from different angles. Also, three
spotlights were added to the scene based on the three-point lighting standard method. This
technique uses three lights; a key light, fill light and backlight (also called rim light), that all
together create a well-illuminated photo with controlled shading and shadows. An image taken
from the actual DLR facility environment was also used as the simulation background. Blender
allows using HDR9 maps as background images. However, capturing an HDR image from the
actual setup needs a professional camera and photographer which was not available at the time of
conducting this research. Figure 3.7 shows a snapshot of the simulation process in Blender.
To increase data diversity, different fabric textures, wrinkle shapes, and lighting conditions
were used in 10 simulation runs. In each run, the gripper setup conducted the same movement and
four cameras captured the images from different angles. In total, 1384 images were generated.
Each run took about 10 minutes to complete. Some of these generated images are shown in Figure
3.8.
9 High Dynamic Range
56
Figure 3.7 - Simulation of the draping process in Blender
Figure 3.8 - Samples of generated images in Blender
57
To annotate the images, it was only needed to mark the gripper, fabric, and wrinkles with
distinct colors in the 3D model and then run the simulation again. So, instead of hand-labeling
1384 images, only 10 cases had to be manually labeled. The generated masks were finally
converted to gray-scale images, the same as the real dataset, to ease the data preprocessing process.
For conversion to gray-scale format, pixels were classified into gripper, fabric, or wrinkle if the
highest value between their RGB intensities was blue, green, or red, respectively. Figure 3.9 shows
a sample of a synthetic image and its RGB and gray-scale labels. Finally, the generated dataset
was distributed into training and test sets including 1315 and 69 images, respectively.
Figure 3.9 - Synthetic data annotation process: main image (top left); image with marked wrinkles (top
right); RGB mask (bottom left); gray-scale mask (bottom right)
58
3.3 Computational Resources
For efficient training of deep convolutional neural networks, powerful GPUs are needed.
All the model trainings in this project were done using two computers. The technical specifications
of these computers are described in Table 3.1.
Table 3.1 - Technical specifications of computational resources
Computer ID CPU GPU RAM Hard disk
EME 2211 Intel Xeon®
CPU E5520 @ 2.27 GHz * 16
GeForce GTX 1070/PCIe/SSE2
24 GiB
2 TB
ASC 310
Intel Xeon® W-2133 CPU
@ 3.60GHz * 12
TITAN Xp/PCIe/SSE2
64 GiB
2 TB
For managing the training steps, Supervisely was used which is a free and open-source
platform focused on computer vision tasks. Supervisely provides an end-to-end solution from
image annotation to the deployment of neural network models by taking advantage of parallel
computing techniques. After adding the mentioned computers as Supervisely agents, it was
possible to run data preprocessing and model training tasks in parallel, simultaneously. Also, many
of the state-of-the-art DCNN models for performing various image recognition tasks are already
implemented and can be freely used on Supervisely. It is remarkable that finding the proper
implementation of these DCNN architectures and adapting them to different cases is a time-
consuming stage which can be eliminated by using Supervisely pre-implemented models.
59
Chapter 4 : Deep Learning for Wrinkle and Fabric Boundary Detection
This chapter describes two different approaches taken in this thesis to develop a deep
learning-based vision system for fabric boundary and wrinkle detection during the draping process
of fiber-reinforced materials with an industrial robot.
This thesis is mainly focused on methods that use convolutional neural networks to avoid
dealing with the limitations of more traditional methods such as HOG and SIFT. The limitations
and complexities of using traditional image processing techniques are assessed in section 2.4
which emphasizes the importance of employing novel methods to address these issues.
Due to the lack of RGB-D and 3D datasets, difficulties of RGB-D and 3D data generation,
and high cost of accurate depth sensors, developed solutions are not dependent on depth features
for detection and only need 2D RGB images. Furthermore, the currently available CNNs which
require depth data or 3D models as input are not robust enough yet and cannot be easily adapted
to perform different tasks such as wrinkle detection. Considering that the goal of this thesis is to
develop a reliable, feasible and practical solution, it was decided to focus on perceiving the 3D
environment using only 2D images to come up with a more accurate and robust solution for visual
inspection of draping process in composite manufacturing. It is noteworthy that Gupta et al. have
already developed a method in [1], [11] which uses RGB-D features to find the location of fabric
on a yoga ball and then identifies the boundary of wrinkles in a local reference frame. However,
adapting this method to the varying geometry of the modular gripper has its own challenges and
cannot be easily achieved.
60
Finally, it was decided to only focus on supervised learning methods in this thesis as
currently developed unsupervised/semi-supervised image segmentation methods cannot be easily
applied to new tasks and developing a totally new algorithm is challenging.
4.1 Stage I – Wrinkle Detection Using an Image Classification Model
At stage I, after developing a preliminary, hand-labeled dataset (described in section 3.2.1)
captured on a functioning robotic system used at DLR composite manufacturing facility, a well-
performing DCNN was designed and implemented from scratch to perform image classification.
Also, the idea of combining images from multiple cameras for generalization of the designed
model to different wrinkle properties and environments was evaluated. The proposed method
employs computer vision techniques and belief functions to enhance accuracy without the need
for any additional hand-labeling or re-training of the model. Co-temporal views of the same fabric
are extracted, and individual detection results obtained from the DCNN are fused using the
Dempster-Shafer theory. By the application of the DST rule of combination, the overall wrinkle
detection accuracy was greatly improved in this composite manufacturing facility.
The primary contribution of this stage is a method for combining multiple, co-temporal
views of fabric with defects into a single representation. This combined view can then be mapped
back to each of the original views, providing increased defect detection accuracy. To facilitate this
multi-view inferencing, a method of extracting components and detecting defects using traditional
computer vision techniques is presented. A technique for combining these views is also developed
and described.
The developed method currently has minimal interaction from an operator in selecting key
points used in perspective correction. The future work will remove this need and produce a fully
61
automated solution. For practical applicability, automated detection is expected to be independent
of the fabric's shape, orientation, size and even visually dominant characteristics such as color.
Furthermore, detection performance should be robust to change in experimental conditions such
as lighting and location of the fabric with reference to the gripper. To accommodate such design
requirements, the proposed solution enhances a data-driven, automated wrinkle detection module,
developed independently of the hand-crafted features [11].
The work in this paper is executed in 3 phases:
1. Model development, training, and evaluation (Section 4.1.1)
2. Evaluation of the algorithm's ability to generalize (Section 4.1.2)
3. Development and evaluation of an approach for combining co-temporal views to
improve the algorithm's accuracy in various conditions (Section 4.1.3)
4.1.1 Phase 1 - Model Development
I. DCNN Architecture
The network developed was motivated by the current dominance of CNN image classifiers
and modeled on these recent advances.
The network used for training can be broadly classified into two major parts. The first part
consists of convolutional neural network layers along with batch normalization and max-pooling
which together act as a feature extractor. This part of the network extracts meaningful information
and presents it in a feature latent space to the rest of the network to map the extracted features to
a particular label. The second part of the network is made of fully connected layers that complete
the extracted latent space by matching it to a label.
62
The input mechanism and the dataset itself consist of highly sequential information from
the camera. As such, the network tends to learn only the local features of each segment while
driving itself to local minima. To break the self-correlation in the input stream and the dataset, the
training is executed on a mini dataset, consisting of randomly sampled input-output pairs from the
original dataset. The mini dataset is created in batches and sent to the network until the learning
gradient starts to disappear.
The training network employs a data loading mechanism that inputs data from the mini
dataset and performs a random transformation before feeding it to the network. The random
transformation consists of random resize, flip and crop with a mean and standard normalization.
These transformations help to make the learned features by the model independent of the object’s
position, size, and orientation. The data loader then converts the input to a tensor for the network.
The input is fed through 14 units of the neural network with each unit consisting of a
convolution layer, batch normalization, and a ReLU10. Max pooling is executed every 3 to 4 units
and finally, an average pooling is applied at the end of the 14th unit. The output of average pooling
is fed through a fully connected layer that outputs the classification probability of each class. Adam
optimizer is used for training and optimization with learning rate decaying from 10−4 to 10−9 with
respect to the number of epochs. Each image input is fed through a series of convolutional layers
with the kernel size = 3, stride = 1 and padding = 1. The detection was conducted with batch size
= 32 and 4 CPU workers for data loading. The units (consisting of a CNN, a normalizer, and an
activation layer) incremented from 32 to 128 output features before forming the input for the fully
10 Rectified Linear Unit
63
connected layer. To practically implement the developed architecture and use it for training
purposes, the PyTorch machine learning library for Python was used.
II. Training
The network was trained on three datasets with different sizes to evaluate the effect of
training set's size on accuracy of detection. The small, medium and large training sets contained
750, 1500 and 3000 images, respectively, all sampled from the custom-created dataset which was
introduced in section 3.2.1. Then each dataset was split into training and testing sets, including
80% and 20% of the entire dataset, respectively.
4.1.2 Phase 2 - Model Generalization
To evaluate the generalization ability of the developed solution, the DCNN was applied to
a new fabric shape with different wrinkle properties. The same gripper was used to grasp the fabric,
and all labels exhibited properties similar to the images in the first dataset. The new images were
hand-labeled as explained in Section 3.2.1. The new dataset was used strictly for evaluation
purposes and the neural network was not re-trained with these new labels. Figure 4.1 shows
samples from old and new datasets, indicating differences in wrinkle shape and lighting conditions
between two datasets.
64
Figure 4.1 - Examples of wrinkles in the images used for training (left); wrinkles in the new dataset (right)
4.1.3 Phase 3 - Multi-view Inferencing
The observed loss of detection accuracy while testing in new situations motivated Phase 3,
wherein multiple views of the scene are combined to create a robust algorithm. The dataset created
in phase 2, consisting of co-temporal views of the same fabric, was re-used. The pre-trained neural
network model from Phase 1 was used to label each view of the scene separately. To combine
multiple views, an algorithm using traditional computer vision techniques was developed to extract
and overlap fabrics from multiple images corresponding to different viewpoints on a stationary
scene. The Dempster-Shafer combination rule ([97], [98]) was then applied to find the most
probable label of each pixel in the anchor image based on individual views probabilities. Finally,
the combined labels were mapped back to each original view.
65
I. Fabric Detection
Fabric detection is facilitated by the observation that the gripper forms a nice frame,
reliably separating the fabric from the background. Fabric detection, therefore, begins with
locating and closing the gripper. The largest hole in the gripper mask is then the presence of fabric.
Values such as intensity threshold and morphological structuring elements were selected manually
through trial-and-error. Such a method presents problems in finding values that generalize to
various images. The values which produced good results for the largest number of images in the
sample set are presented here. The same values are used for all images and all views.
The contour of the gripper was found by combining three methods:
1. Identification of white gripper squares
2. Identification of green gripper squares
3. Edge detection to supplement boundary detection
Item (1) is accomplished by thresholding in the CIE L × a × b color space. By accepting low
color values, only white/gray pixels are retained. High values of lightness are retained to separate
white pixels from gray. Good results were achieved by rejecting all pixels where either color
channel is greater than 140 and rejecting lightness values below 180.
Item (2) is accomplished by thresholding in the HSV color space. Since the goal is to retain
only green colors, hue values outside of the range 60 - 100 are rejected, and very low saturation
(less than 80) or very high saturation (greater than 200) are also rejected.
Item (3) is accomplished by performing denoising using the non-local means method [99] with
an aggressive h value of 10. Canny edge detection [100] is then employed to identify significant
edges in the image.
66
These three masks are combined, where any pixel which is retained in any mask is overall
retained. The gripper squares are joined through an alternating series of morphological closing
with a disk structuring element of radius 3, and morphological dilation with a disk of radius 1. A
general solution for the number of iterations is difficult. For the sample dataset, 35 iterations were
performed on all images.
The gripper mask is inverted such that the fabric is kept, and the gripper rejected. The main
component is then separated from the other components and retained. All connected components
are identified, and components not meeting specific requirements are discarded. These
requirements are:
• An area of at least 150,000 pixels
• A maximum distance between the image center and the nearest non-zero pixel in the
component
• A density less than 0.4 (number of white pixels in the bounding box, normalized by
bounding box area)
• A height span or width span greater than 0.7, where the span is the dimension of the
bounding box divided by the dimension of the image.
This filtration removes excessively large or small components. The components are then
ranked by score, and the highest-scoring component is deemed to be the fabric. The score combines
closeness to the center of the image and area. If any component overlaps the center of the image,
it is given maximum score so that it is always retained.
The morphological closing to connect the gripper squares is then reversed: multiple
iterations of morphological opening with a disk of radius 3 and erosion with a disk of radius 1 are
performed. One fewer iteration than the number of closings is performed, which gives a nice edge
67
to the fabric. Finally, any holes in the mask are filled by tracing the contour and then flood filling
at the centroid.
II. Centering and Rotating the Image
The orientation and position of the gripper in the image relative to the camera changes
across images, so the fabric mask is of non-uniform position and non-uniform rotation in the final
images. To normalize these differences, the centroid of the mask is found and translated to the
center of the image. The image is then rotated in the range -90 to 90 degrees, and the best
orientation is used. An orientation is preferred if it maximizes the non-zero pixels in the center
column of the image, plus the maximum number of non-zero pixels in any row of the image. The
image is cropped to the rectangle with minimum bounds which contains non-zero pixels.
III. Correlating Multiple Views
Perspective correction is performed to normalize, as much as possible, non-uniform fabric
orientation between images. Since the end-result is highly susceptible to small changes in
perspective correction key points, human intervention is required to make identification as accurate
as possible. The user is first presented with all camera views available for a period (shown in
Figure 4.2) and must select the one with the least skewing and curvature. Control points are
collected by having the user select four points on each image (top left of the fabric, top right of the
fabric, bottom left of the fabric, and bottom right of the fabric). These points are then used to create
a perspective warp transform for each image.
68
Figure 4.2 - Five co-temporal masks of the same fabric
Figure 4.3 - Five overlaid, co-temporal fabric masks before correction (left) and after correction (right)
Some views are insufficient to be included in processing. That is, sometimes the fabric is
too small, too skewed, or has some other unacceptable features which prevent inclusion. In such a
case, the image is blacklisted and not included in results.
Fabric size is normalized by resizing each image to the maximum dimension (height and
width) in all images. The aspect ratio is preserved in this resizing, and each image is then zero-
padded as required to make an absolute dimension match for all images. Padding is done evenly
from all sides to keep the fabric centered.
Every acquired view consists of a tensor of probabilities mapping each pixel to a certain
class. Once multiple views are overlapped, each pixel consists of multiple class probability tensors
(𝒎𝒎���⃗ 𝒊𝒊𝒊𝒊) coming from different views.
69
𝒎𝒎���⃗ 𝒊𝒊𝒊𝒊 =
⎣⎢⎢⎢⎡
𝑚𝑚𝑖𝑖𝑖𝑖 (𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺)𝑚𝑚𝑖𝑖𝑖𝑖 (𝐹𝐹𝐹𝐹𝐹𝐹𝐺𝐺𝐺𝐺𝐹𝐹)𝑚𝑚𝑖𝑖𝑖𝑖 (𝑊𝑊𝐺𝐺𝐺𝐺𝑊𝑊𝑊𝑊𝑊𝑊𝐺𝐺)
𝑚𝑚𝑖𝑖𝑖𝑖 (𝐵𝐵𝐹𝐹𝐹𝐹𝑊𝑊𝐵𝐵𝐺𝐺𝑜𝑜𝑜𝑜𝑊𝑊𝑜𝑜)⎦⎥⎥⎥⎤ ( 1 )
Where 𝐺𝐺 is the camera id, 𝐺𝐺 𝜖𝜖 {1,2, … ,5} and 𝑗𝑗 is the pixel location in the image,
𝑗𝑗 𝜖𝜖 (2592 × 1944). These probabilities are then combined from different cameras, using the
Dempster-Shafer rule of combination.
𝒎𝒎���⃗ 𝒊𝒊 = 𝒎𝒎���⃗ 𝟏𝟏𝒊𝒊 ⊕ 𝒎𝒎���⃗ 𝟐𝟐𝒊𝒊 ⊕ … ⊕ 𝒎𝒎���⃗ 𝟓𝟓𝒊𝒊 ( 2 )
Wherein ⊕ is the orthogonal sum (also known as Dempster rule of combination) which
combines belief constraints that come from independent belief sources. This produces a single
tensor of classification probability (𝒎𝒎���⃗ 𝒊𝒊), which is mapped back to each source view by reversing
the processing steps described above. Each source image is then relabeled based on these new
probabilities.
4.2 Stage II – Wrinkle and Boundary Detection Using Image Segmentation Models
The introduction of well-performing baseline methods like FCN [72], YOLO [46], R-CNN
[22], and Fast/Faster R-CNN [38], [39] has resulted in a huge improvement in the performance of
object detection and semantic segmentation tasks performed by computers. It encourages the
employment of semantic or instance segmentation methods to label images at the pixel level.
Further actions can be taken in the future to associate the obtained visual information to relative
suction units on the gripper surface while considering the spatial information.
70
At stage II, four state-of-the-art deep convolutional neural networks for image
segmentation are trained and tested on the custom-captured dataset. The achieved IoU11 scores by
DCNN models are compared against each other and the best-performing model is selected to be
used in next steps. The obtained results show how using a DCNN model and transfer learning can
lead to acceptable results while training on a small and inaccurately annotated dataset. Next, the
effect of human annotation quality on the performance of DCNN models has been evaluated by
comparing two different annotations and re-training the models on the union and intersection of
these datasets. The wrinkle detection performance is assessed by calculating the precision, recall,
and F1 metrics while considering the intersection of two annotated datasets as ground truth. Then,
an approach for detecting wrinkles at the early stages of formation is developed and assessed.
Finally, the limitations of using synthetic data for training the models are evaluated. The presented
method can be readily adopted to train DCNN models using other datasets and perform visual
inspection tasks in different manufacturing processes.
4.2.1 Training Image Segmentation Models
As 6688 images were not enough to start training a model from scratch, transfer learning
was used to train the models. In the first stage of training, four state-of-the-art architectures known
for instance segmentation (Mask-RCNN) or semantic segmentation (DeepLab V3+, U-Net, IC-
Net) were trained on the custom-created dataset. Table 4.1 shows the training parameters used for
each architecture. All the training was done on ASC 310 computer (see section 3.3) which has a
single NVIDIA TITAN Xp GPU. One common issue happening while training on an augmented
11 Intersection over Union
71
dataset is that model overfits to the training data and cannot perform well on test set. To ensure
that the trained models are not overfitted, the training and validation loss values were assessed at
the end of each training run to confirm these two values are converging.
Table 4.1 - Training parameters for different models
Training parameters Models Learning rate Training epochs validations/epoch Batch size Input size
DeepLab V3 0.0001 5 2 1 512*512 U-Net 0.001 5 2 1 256*256 IC-Net 0.0001 5 2 1 2048*1024
Mask RCNN 0.001 5 2 1 256*256
For training DeepLab V3, Xception65 backbone was used, weight decay, atrous rates, and
output stride were set to 0.00004, (8,12,18), and 16, respectively. For training U-Net and IC-Net,
momentum, patience, and learning divisor were set to 0.9, 1000, and 5, respectively. Other training
parameters were set to their default values suggested by the original papers' authors.
After training four models, the test dataset was used for evaluating the trained models and
Intersection over Union (IoU) metric was used for measuring the segmentation performance.
Basically, IoU can be used to evaluate any algorithm that predicts the location of objects by
providing bounding boxes as output. To employ IoU for evaluation of a prediction, only two inputs
are needed, the ground-truth bounding boxes as well as the predicted bounding boxes made by the
model (Figure 4.4).
72
Figure 4.4 - Calculation of IoU metric for evaluation of object localization accuracy
To calculate the IoU, the area of overlap between two bounding boxes must be divided by
the area of the union. IoU value can vary between 0 to 1 indicating totally wrong and totally right
predictions, respectively. The IoU score is usually calculated for every class independently and
then the average of all classes is calculated to provide a mean IoU score for the whole semantic
segmentation process.
4.2.2 Assessing Human Annotation Quality
Human labeling of the images is itself, a difficult and time-consuming task. The labeling
tools used were not particularly accurate while there was also ambiguity around what a wrinkle
exactly is, so annotations were not precise. This lack of precision was tested by running a second
hand-labeling of the same dataset and calculating the IoU score between the two human-annotated
datasets.
73
4.2.3 Evaluation of Wrinkle Detection Performance
The achieved results in section 4.2.1 show that the developed algorithm can perform very
well on gripper and fabric detection tasks. However, it is still tough to reach high IoU scores for
wrinkle detection due to the unclear definition of a wrinkled area. It is noteworthy that although
the IoU scores are relatively low, the visual inspection of the inferencing results shows that the
model can detect most of the wrinkles by labeling the wrinkle core; this is the main target in
performing such a task. The major cause of the low IoU score is the ambiguity in defining a wrinkle
boundary which makes it hard to provide an accurate pixel-level annotation for both humans and
machines.
During the manufacturing process, it is only needed to find the corresponding suction units
located behind each wrinkle area and adjust their suction intensity in a way to remove the wrinkles
while preventing the formation of new wrinkles in the neighboring regions. This simply requires
a region-level annotation instead of an accurate pixel-level annotation, which suggests that the
achieved performance is well enough.
To provide a better assessment of wrinkle detection performance, the intersection of two
human annotations was found and considered as the core of the wrinkles that are the main target
to be detected. DeepLab V3+ was selected because of its significant performance in section 4.2.1
and its test results were used to assess the wrinkle detection quality. For every image in the test
set, connected components were found in both ground truth image and the prediction provided by
DeepLab V3+; each component was considered as a single wrinkle. A single wrinkle in the
prediction was deemed a true positive if there was a 70% overlap with one or more wrinkles in the
ground truth. The overlapping threshold was found experimentally through a grid search from 5%
to 100% with a step of 5%.
74
Wrinkles in the ground truth having insufficient overlap with the wrinkles in the prediction
were tagged as false-negative predictions. Wrinkles in the prediction which did not sufficiently
overlap the ground truth were considered as a false-positive detection. Precision, recall, and F1
scores were calculated for each image separately. The average of each metric was also calculated
for all images to evaluate the total wrinkle detection performance.
Visually investigating the false-positive decisions showed that a common mistake made by
the DCNN model in this task was labeling some small areas as wrinkles that could not actually be
a wrinkle due to their size. To prevent this phenomenon, the average wrinkle size in the training
set was calculated and any prediction made by the DCNN model which was smaller than 20% of
the average wrinkle size was ignored. The precision, recall, and F1 scores were calculated again
after filtering the small components.
4.2.4 An Approach for Early Detection of Wrinkles
The obtained results in section 4.2.2 show ambiguities around what a wrinkle is as the two
human annotations have many differences. In section 4.2.3, The intersection of these two
annotation sets was considered as the core of the wrinkle which includes the pixels where both
human annotations agree on the existence of a wrinkle; this is called the wrinkle class. The union
of these two annotations minus their intersection represents the areas where one of the operators
is reporting the existence of a wrinkle while the other one does not agree; this is called the maybe
wrinkle class. These regions may either be a result of the operator annotation error or indicate a
wrinkle at its early formation stage. Early detection of probable defects in composite
manufacturing processes has significant importance in performing in-situ monitoring methods and
can help to reduce both production time and waste by preventing the defects from happening.
75
To further evaluate the performance of the model on the detection of “wrinkle” and “maybe
wrinkle” classes, three new pairs of training and test sets were created. Each new dataset included
only a single class: “wrinkle”, “maybe wrinkle”, or “wrinkle + maybe wrinkle”. DeepLab V3+
was trained on each of these three datasets separately and then tested on all three test sets
individually.
4.2.5 Training on Synthetic Data
To check the feasibility of using synthetic data for training the neural networks, all the four
models were trained on the synthetic dataset generated in section 3.2.3. The trained models then
were tested on the real test set and was also used for testing the models trained on the real data.
The obtained results (see section 5.2) show a significant decrease in models' performance as
expected. The trained models were then tested on a test set taken from the synthetic dataset to
make sure the models have learned the features correctly.
76
Chapter 5 : Experimental Results
5.1 Stage I - Wrinkle Detection Using an Image Classification Model
I. Phase 1
In phase 1, a CNN model was designed and trained to classify input images as wrinkle,
fabric, gripper, or background. The training and test datasets were generated by acquiring images
during the draping process. Each image then was divided into smaller regions to facilitate the
efficient hand-labeling of data. Next, Supervised learning was employed on the generated sub-
images to train an image classifier. Three training datasets having different sizes were evaluated
and it was observed that past a point, increased size offered diminishing returns. The obtained
results in this phase are presented in Table 5.1. The accuracy metric was used for evaluating the
classification performance at this stage. The accuracy score for each class is easily calculated by
dividing the number of correct predictions of that class by the total number of instances from that
class existing in the dataset.
Table 5.1 - Accuracy of detection on the initial dataset
Accuracy of detection (%) Label Small Medium Large
Wrinkle 92.5 95.5 95.5 Fabric 94.5 98.5 99.5
Background 95.5 97.5 98.5 Gripper 94.5 97.5 97.5
The best-trained neural network model achieved 95.5% accuracy in detecting wrinkles.
Accuracy for other classes was even higher. The slightly lower level of accuracy in wrinkle
77
detection can be explained by the fact that, at a very small scale (32 × 32), it is hard to distinguish
between a fabric and a wrinkle.
The results show diminishing returns between the medium and large training sets. This
means increasing the size of the dataset will not enhance the performance beyond a point.
II. Phase 2
In phase 2, a complementary dataset was generated with altering lighting conditions,
changing wrinkle shapes and sizes, differing fabric shapes, and significantly varying distance and
location of the camera with respect to the scene. The pre-trained network from phase 1 was
evaluated on this dataset to test the ability of the model to generalize. Results are presented in
Table 5.2. Notably, the classification accuracy of wrinkles decreased to 66.6%, which is
considerably less than the 95.5% accuracy obtained in phase 1.
Table 5.2 - Accuracy of detection against the new dataset
Accuracy of detection (%) Wrinkle 66.6 Fabric 12.3
Background 85.5 Gripper 37.6
The significant difference in the two results indicates that the model had some difficulty in
generalizing to the new dataset. Gripper and fabric detection rates were down significantly because
of the unbalanced quantity of classes in the new dataset plus the ambiguities of labeling images
having portions of different classes. It is noteworthy that the model was trained to label an image
as wrinkle if wrinkles were covering only 30% of the image area. This tendency to wrinkle class
lead to more accurate wrinkle detection but sacrificed the detection accuracy of other classes. The
78
primary goal of this phase was to achieve an acceptable wrinkle detection accuracy and the low
scores for gripper and fabric detection were not explored much.
Visual inspection of wrinkle detection failed cases showed that this is largely due to
occlusion of wrinkles combined with poor lighting and shadow effects in failed cases. These
occlusion and lighting issues changed for different views of the multi-view setup, which led to
certain wrinkles being highly visible, and thus detectable, in one view while missed in another.
III. Phase 3
In phase 3, an approach was devised to increase accuracy when generalizing the model
through the correlation of multiple co-temporal views of the same fabric. View correlation consists
of detecting the fabric using traditional computer vision techniques and overlaying those fabrics
as closely as possible to allow inference across multiple views. Once the fabrics are overlaid, the
Dempster-Shafer theory of evidence [97], [98] was applied to combine votes on each pixel from
each view.
The results of this process are shown in Table 5.3. Detection accuracy was improved
considerably when compared with phase 2, achieving 85.9% success in wrinkle detection. This
was accomplished without the need for any re-training or fine-tuning of the pre-trained model.
Since wrinkle detection accuracy is the main goal, gripper and background labels were not
preserved in phase 3 and are combined in the Other label. As in Phase 2, fabric detection accuracy
is very low. However, the problems with fabric and gripper detection can be solved by employing
alternative methods developed in the other stage of this project (see section 5.2).
79
Table 5.3 - Accuracy of detection after multi-view inferencing
Accuracy of detection (%) Wrinkle 85.9 Fabric 8.6 Other 89.7
5.2 Stage II - Wrinkle and Boundary Detection Using Image Segmentation Models
In Stage II, four well-performing DCNN models (Mask-RCNN, U-Net, DeepLab V3+, and
IC-Net) were applied to the custom-created dataset. The best-performing model was found to be
DeepLab V3+ that could achieve acceptable results for gripper and fabric detection. The lower
performance of the wrinkle detection task was explained by comparing two human-annotations of
the same dataset and showing that humans achieve an IoU of only 0.41; this discrepancy is mainly
due to geometrical defects, as wrinkles do not have a clearly defined boundary. The model was
then evaluated using a binary predictor based on the Jaccard index between components rather
than pixels; the model achieves a recall rate of 0.71 and a precision score of 0.76. Two
complementary approaches were also introduced for the detection of wrinkles at the early stages
of formation as well as the completely formed wrinkles. Finally, the limitations of using synthetic
data for training the models were evaluated.
I. Training Image Segmentation Models
Four well-performing algorithms (Mask-RCNN, U-Net, DeepLab V3+, and IC-Net) are
trained and tested on a custom-captured dataset and the intersection over union (IoU) metric was
used to evaluate the segmentation performance. Table 5.4 shows IoU values for different classes
gained by each model.
80
Table 5.4 - Inferencing results for image segmentation models
Intersection over union Models Gripper Fabric Wrinkle
DeepLab V3 0.9237 0.8564 0.4037 U-Net 0.8928 0.8549 0.2394 IC-Net 0.8869 0.8373 0.3263
Mask-RCNN 0.7518 0.7704 0.3527
Trained models at this phase produce acceptable IoU scores of up to 0.92 and 0.86 for
gripper and fabric, respectively. Wrinkle detection achieves a significantly lower IoU score of 0.40
in the best case. The reasons behind this lower score are explained in next section.
Table 5.5 shows the generated segmentation masks by trained models for 5 of the images
in the test dataset. Gripper, fabric, and wrinkle labels are shown with blue, orange, and green
masks, respectively.
The results show that DeepLab V3+ works best overall, and particularly in detecting
wrinkles, which is the most challenging class to detect in this task. All the networks, except Mask-
RCNN, perform very well when segmenting gripper and fabric. Mask-RCNN has difficulties
identifying sharp edges and boundaries between classes, which led to poor performance in gripper
and fabric segmentation. In principle, instance segmentation only works for "objects" with clearly
defined shapes and boundaries and does not work on "things" with amorphous background regions.
In the targeted task, the wrinkles might act like things, while the gripper and fabric can be
considered as objects. Given that the task is not to identify one wrinkle from the others, Mask-
RCNN cannot be a suitable choice. Finally, DeepLab V3+ was selected to be used in the next
sections due to its good performance, especially in wrinkle detection.
81
Table 5.5 - Performance of trained models on the test dataset
ID Image Ground Truth DeepLab V3+ U-Net IC-Net Mask RCNN
1
2
3
4
5
82
I. Assessing Human Annotation Quality
Table 5.6 shows the IoU scores between two human-annotated datasets.
Table 5.6 - IoU between two human-annotated datasets
Intersection over union Models Gripper Fabric Wrinkle Human 0.9296 0.8553 0.4097
Human annotation performs relatively poorly, even when compared with the automated
models. The difficulties in using the keyboard and mouse for annotation certainly led to some
inaccuracies between the two datasets. Furthermore, there is also ambiguity around what exactly
is a wrinkle. This matter is visible in Figure 5.1 where not only are the two larger wrinkles different
lengths, but there is also a third smaller wrinkle in one image which is not present in the other at
all. It is notable that the way people define, locate, and label objects in an image can significantly
vary from one person to another.
Figure 5.1 - A sample image of the dataset (top image) and its two annotation masks
83
II. Evaluation of Wrinkle Detection Performance
The average precision, average recall, and average F1 score values for wrinkle detection
are provided in Table 5.7.
Table 5.7 - Wrinkle detection scores for DeepLab V3+
DeepLab V3+ Average precision Average recall Average F1 score
Before removing small components 0.6741 0.7145 0.6714
After removing small components 0.7649 0.7145 0.7218
Overall, the best average precision score shows that about 76% of the model predictions
have been correct. The average recall score indicates that the model has been able to detect more
than 71% of the present wrinkles.
It can be seen that filtering the small components has significantly increased the average
precision score from 0.6741 to 0.7649 by reducing the number of false-positive predictions. It is
important to keep the number of false-positive predictions as low as possible to prevent
unnecessary pauses in the manufacturing flow.
The effect of the overlapping threshold on the average precision, average recall, and F1
scores is shown in Figure 5.2.
84
Figure 5.2 - Effect of overlapping threshold on wrinkle detection scores
Lower overlapping threshold values (between 0 to 0.4) will provide better scores by
ignoring some of the false-positive and false-negative detections and increasing the number of
true-positive detections even if the network has not been able to properly detect a wrinkle. Also,
higher threshold values (0.8 to 1) will decrease the scores by requiring highly accurate predictions.
Evaluating the above chart proves that thresholds below 70% can be selected as an acceptable
overlapping threshold value as the scores will start to rapidly decrease after that point.
III. An Approach for Early Detection of Wrinkles
DeepLab V3+ was trained on three single class training sets and then tested on all three
single class test sets. The inferencing results are presented in Table 5.8.
Table 5.8 – Inferencing results for DeepLab V3+ trained on single class datasets (IoU scores)
Test Train
Wrinkle + Maybe wrinkle Wrinkle Maybe wrinkle
Wrinkle + Maybe wrinkle 0.5149 0.4037 0.2308 Wrinkle 0.3326 0.4609 0.1034
Maybe wrinkle 0.1399 0.0970 0.1299
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5 5 0 5 5 6 0 6 5 7 0 7 5 8 0 8 5 9 0 9 5 1 0 0
Scor
es
Overlapping threshold (%)
Average precision
Average recall
Average F1
85
Training the model on the union of two human annotation sets (“wrinkle + maybe wrinkle
class”) provides better results on the tasks where the model is looking for both wrinkle and maybe
wrinkle class instances; this includes detection of wrinkles at their early stages of formation
(“maybe wrinkle” class).
In applications where redundant pauses in the manufacturing flow are costly and the model
is only looking for explicit wrinkles, it appears better to train the model on the intersection of two
human annotation sets (“wrinkle” class).
The model trained only on the “maybe wrinkle” class performs poorly in all cases as this
dataset includes lots of noise and mislabeled portions caused by operator error and makes it hard
for the DCNN model to extract meaningful features during the training.
IV. Training on Synthetic Data
Table 5.9 shows the IoU scores of four models trained on the synthetic training dataset and
tested on the synthetic test dataset. It can be seen that DeepLab V3+ performs very well and
achieves 0.98, 0.94, and 0.51 IoU scores for the segmentation of gripper, fabric, and wrinkle,
respectively.
The lower score for wrinkle detection can be explained by the complexities of defining the
wrinkle boundaries. U-Net and Mask-RCNN are performing weaker than DeepLab V3+ but still
provide acceptable scores. A probable reason for IC-Net failure in this task can be that it is an
architecture designed to perform image segmentation tasks in real-time and tries to use its pre-
learned feature space for finding the target objects as quick as possible. As a result, it will not be
an appropriate architecture for learning new feature spaces with drastic distribution changes.
86
Table 5.10 demonstrated the IoU scores for inferencing the models on the real test set that
was used in first parts of this stage as well. As expected, the scores have intensely decreased
because of the domain distribution shift and mismatch in appearance. This issue can be tackled by
creating larger synthetic datasets with richer features and using domain adaptation techniques to
reduce the distribution shift between synthetic and real feature spaces.
Table 5.9 - Inferencing results on the synthetic test dataset
Intersection over union Models Gripper Fabric Wrinkle
DeepLab V3 0.9822 0.9392 0.5082 U-Net 0.8955 0.5743 0.4002 IC-Net 0.3234 0.5668 0.1232
Mask-RCNN 0.8549 0.8395 0.3669
Mask RCNN performs better than the other three models on the segmentation of gripper
and fabric classes. It can be explained by the popularity of Mask RCNN for its generalization and
easy adaptation to various datasets that make it have a better understanding of object shapes and
boundaries. U-Net achieves 0.1649 scores for wrinkle detection which is still very low but is more
than twice the score of other models. U-Net is an architecture particularly designed for medical
imaging purposes and is known for its great performance on a variety of anomaly detection tasks
like tumors or kidney stones detection in medical images such as CT-scans. So, it is not highly
dependent on color features and performs very well on the cases that the geometrical features are
more dominant in distinguishing classes, like wrinkle detection task.
87
Table 5.10 - Inferencing results on the real test dataset
Intersection over union Models Gripper Fabric Wrinkle
DeepLab V3 0.1327 0.2800 0.0706 U-Net 0.3061 0.4939 0.1649 IC-Net 0.0 0.0598 0.0268
Mask-RCNN 0.3477 0.6487 0.0694
88
Chapter 6 : Conclusions
6.1 Summary
The main contribution of this thesis is the development of new solutions for quality control
of composite manufacturing processes, specifically, the visual inspection of the draping process of
fiber-reinforced cut-pieces for manufacturing the rear pressure bulkhead of aircraft in the aviation
industry.
The thesis starts by reviewing visual inspection methods that can be employed to tackle the
issues happening during the draping process. It can be concluded that the conventional
programming techniques to analyze spatio-temporal data from multiple sensory inputs usually
involve hand-crafting certain features of the setup. These techniques are usually limited to specific
applications and will need a great deal of tuning for adapting to new tasks involving. On the other
hand, using machine learning algorithms makes the process overcome such limitations by allowing
machine-based self-interpretation of the process. So, it was decided to apply convolutional neural
networks plus deep learning techniques on 2D images to perform object recognition tasks.
To generate the required datasets for training the DCNN models, a setup of multi-cameras
was installed in front of the modular gripper robot that is the main component performing the
draping process. Then, two datasets were generated, annotated, and augmented for training and
testing purposes as described in Chapter 3.
In the first stage of the project, a deep convolutional neural network was designed and
implemented to perform the image classification task and label small areas of each image as
wrinkle, fabric, gripper, or background. This model was trained and tested on generated datasets
and could provide acceptable results (Phase 1). Also, the effect of dataset size was evaluated at
89
this phase. To measure the ability of this model to generalize, it was tested on the second dataset
that was not seen by the network during the training. Testing the pre-trained model on a new dataset
with different wrinkle size and shape and different lighting conditions led to a noticeable decrease
in detection performance (Phase 2). The neural network had learned the features which existed in
both datasets, but not features exhibited only in the new dataset. To handle this issue, the idea of
utilizing a combination of multiple low-price cameras and employing their images from different
views to get a higher dimensional feature map was proposed (Phase 3). Intuitively, looking at the
scene from different angles would provide us with more details and would lead to less occlusion
and can be rectified for photometric distortions and noise, discontinuities, etc. Images from five
different views were fed into the trained model and predictions from each view were combined
using the Dempster-Shafer [97] theory of evidence.
Wrinkle detection in the generalized inferencing case was 85.9% when evidential
reasoning was used to fuse multi-view information (Figure 6.1). This is an increase of 19.3%
compared to the base case established in Phase 2.
Figure 6.1 - Wrinkle detection accuracy (%) by phase
The results of this stage are very promising and show that a combination of computer vision
and evidential reasoning techniques can be used to help increase feature detection accuracy when
90
a DCNN is generalized to previously unseen, but similar, images. Also, the setup can be easily
adapted and trained for different experimental scenarios subjected to a different arrangement of
cameras.
In Stage 2, four state-of-the-art image segmentation models were trained on the generated
dataset to identify the fabric, gripper, and wrinkles during the draping process. The obtained results
were very promising at this stage. Overall, the detection of gripper and fabric was successful with
IoU scores of approximately 0.92 and 0.86 respectively for the best-performing model. DeepLab
V3+ was selected as the best candidate because of its good performance, especially on the wrinkle
detection task.
Wrinkle detection accuracy was significantly lower compared with other classes, with an
IoU score of approximately 0.40 for the best-performing model. In order to explain the poor
performance of the wrinkle-detection task, a second human-annotated dataset was compared with
the first; results, in this case, show a wrinkle detection IoU of only 0.41, which indicates similar
performance to the automated methods. This suggests that limitations in the human annotations
used as ground truth are a significant factor in the final performance of the model. These limitations
are both the result of the crude input of a computer mouse and ambiguity as to how humans
interpret wrinkles.
Further evaluation of wrinkle detection performance showed that the model could
successfully detect more than 71% of the present wrinkles. The average precision score also
showed that the DCNN model was right in 67% of its all predictions. To enhance the average
precision score, a filtration stage was conducted by ignoring any prediction having a wrinkle size
smaller than 20% of the average wrinkle size in the training set. The average precision score
increased to more than 76% without affecting recall. A detection was defined to be successful if
91
the provided segmentation by DCNN was overlapping with the ground truth more than the
overlapping threshold. The optimum value for the overlapping threshold found to be 70% through
a grid search experiment.
An approach was also introduced for the detection of wrinkles at the early stages of
formation. It was shown that training on the union of two human-annotated datasets (“wrinkle +
maybe wrinkle” set) was preferred for cases that require early detection of wrinkles. In contrast,
training on the intersection (“wrinkle” set) is a better option for the cases where redundant pauses
in the manufacturing process will be costly and it is preferred to take an action only when a defect
is present with high certainty.
Finally, the limitations of using synthetic data for training the models were evaluated by
generating a synthetic dataset using a 3D modeling software and then training and inferencing the
models on it. The results of testing the DCNN models on both real and synthetic test sets showed
a considerable performance drop while testing on real data because of the domain distribution shift
and mismatch in appearance.
The developed solutions can be used for a variety of composite manufacturing processes
or adapted to other similar tasks by only generating a small dataset and then applying the
techniques proposed in this thesis.
6.2 Future Work
Considering that this thesis is focused on finding deep learning-based approaches that
specifically use 2D images as input, other possible methods have remained unexplored.
Using RGB-D images or 3D models as the input of convolutional neural networks to
perform object detection during the draping process can be an interesting topic of research.
92
Considering the fact that each wrinkle is just an abnormality in the geometry of the fabric,
developing a model that uses depth features for detection can be very helpful to tackle the issues
related to locating the wrinkle boundary. The introduction of novel deep learning architectures
plus the promotion of 3D and RGB-D datasets in the future can be a key to this problem.
Semi-supervised and unsupervised learning methods can be also evaluated to eliminate the
need for generation and annotation of large datasets.
In Stage I of this thesis, the certainty of classification was increased by combining multiple
votes from several observations of the weaker classifier. A logical extension of this possibility
would be re-training the neural network on the new dataset, allowing it to learn the new features.
The arbitrary position and orientation of the gripper relative to the camera necessitated a process
to overlap fabrics. Key point detection for perspective correction was performed by hand, which
is not desired in an automated process. Moreover, arbitrary geometries result in images with
rotation occlusion, small fabric size, or other undesirable properties that can affect the results. A
fixed configuration of gripper and camera geometry may be able to maximize lighting effects while
minimizing warping, perspective and occlusion issues. Homography could then be used in
transformations, making the overall process automatable, faster, and more robust. The machine
learning model could also be trained on a wider variety of wrinkles to allow greater generalization
accuracy. However, the hand-labeling process used to create a supervised learning dataset is time-
consuming and it would be desirable to create a faster alternative.
To continue the works done in stage II, other famous architectures can be evaluated and
compared with the currently used ones. Also, different values of training parameters can be tested
to increase the segmentation accuracy by finding the optimum training configuration for each
model. Increasing the size of the dataset and generating more accurate annotations is another
93
suggestion that can lead to performance enhancement in the future. A sensory system like a lidar
scanner can be installed at DLR facility to create a precise ground truth free of errors caused by
human annotations. Besides, the availability of more powerful GPUs will make it possible to
increase the training batch size that will usually lead to a raise in the segmentation quality.
Using domain adaptation techniques to reduce the domain distribution shift and mismatch
in appearance between synthetic and real datasets is another topic that can be followed in the
future. These techniques take advantage of deep networks and embed domain adaptation in the
deep learning pipeline which results in learning more transferable representations.
Development of an approach to associate the obtained visual information to relative suction
units on the gripper surface using the spatial information is the next key step toward having an
end-to-end fully automated manufacturing pipeline, that needs to be done in the future.
94
Bibliography
[1] K. Gupta, M. Körber, A. Djavadifar, F. Krebs, and H. Najjaran, “Wrinkle and boundary
detection of fiber products in robotic composites manufacturing,” Assem. Autom., vol. 40,
no. 2, pp. 283–291, 2019.
[2] A. Djavadifar, J. B. Graham-Knight, K. Gupta, M. Körber, P. Lasserre, and H. Najjaran,
“Robot-assisted composite manufacturing based on machine learning applied to multi-
view computer vision,” in International Conference on Smart Multimedia, 2019.
[3] F. C. Campbell, “Structural Composite Materials,” 2010.
[4] K. D. Potter, “Understanding the origins of defects and variability in composites
manufacture,” ICCM Int. Conf. Compos. Mater., 2009.
[5] A. Rashidi and A. S. Milani, “Passive control of wrinkles in woven fabric preforms using
a geometrical modification of blank holders,” Compos. Part A Appl. Sci. Manuf., vol. 105,
pp. 300–309, Feb. 2018.
[6] A. Rashidi, H. Montazerian, K. Yesilcimen, and A. S. Milani, “Experimental
characterization of the inter-ply shear behavior of dry and prepreg woven fabrics:
Significance of mixed lubrication mode during thermoset composites processing,”
Compos. Part A Appl. Sci. Manuf., vol. 129, no. November 2019, p. 105725, 2020.
[7] H. Voggenreiter and D. Nieberl, “AZIMUT Abschlussbericht,” TIB, 2015.
[8] H. Montazerian, R. Sourki, M. Ramezankhani, A. Rashidi, M. Koerber, and A. S. Milani,
“Digital twining of an automated fabric draping process for industry 4.0 applications: Part
imulti-body simulation and finite element modeling,” in CAMX 2019 - Composites and
Advanced Materials Expo, 2019.
95
[9] M. Körber, C. F.-P. Manufacturing, and undefined 2019, “Automated Planning and
Optimization of a Draping Processes Within the CATIA Environment Using a Python
Software Tool,” Elsevier.
[10] M. Körber and C. Frommell, “Sensor-Supported Gripper Surfaces for Optical Monitoring
of Draping Processes,” SAMPE, 2017.
[11] K. Gupta, M. Körber, F. Krebs, and H. Najjaran, “Vision-based deformation and wrinkle
detection for semi-finished fiber products on curved surfaces,” in 2018 IEEE 14th
International Conference on Automation Science and Engineering (CASE), 2018, pp.
618–623.
[12] I. N. Aizenberg and C. Butakoff, “Frequency domain medianlike filter for periodic and
quasi-periodic noise removal,” in Image Processing: Algorithms and Systems, 2002, vol.
4667, no. May 2002, pp. 181–191.
[13] O. Russakovsky et al., “Imagenet large scale visual recognition challenge,” Int. J.
Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
[14] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proceedings of
the IEEE International Conference on Computer Vision, 1999, vol. 2, pp. 1150–1157.
[15] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput.
Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.
[16] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple
features,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 1, no. July
2014, 2001.
[17] P. Viola and M. Jones, “Robust real-time object detection,” Int. J. Comput. Vis., vol. 4, no.
34–47, p. 4, 2001.
96
[18] Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning
and an Application to Boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, 1997.
[19] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from RGB-D
images for object detection and segmentation,” Lect. Notes Comput. Sci. (including
Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8695 LNCS, no. PART
7, pp. 345–360, 2014.
[20] R. Memisevic and C. Conrad, “Stereopsis via deep learning,” in NIPS Workshop on Deep
Learning, 2011, vol. 1, p. 2.
[21] J. Zbontar and Y. LeCun, “Computing the stereo matching cost with a convolutional
neural network,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 1592–1599.
[22] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate
object detection and semantic segmentation,” Proc. IEEE Comput. Soc. Conf. Comput.
Vis. Pattern Recognit., pp. 580–587, 2014.
[23] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
recognition,” arXiv Prepr. arXiv1409.1556, 2014.
[24] M. Osadchy, Y. Le Cun, and M. L. Miller, “Synergistic face detection and pose estimation
with energy-based models,” J. Mach. Learn. Res., vol. 8, no. May, pp. 1197–1215, 2007.
[25] J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of a convolutional network
and a graphical model for human pose estimation,” Adv. Neural Inf. Process. Syst., vol. 2,
no. January, pp. 1799–1807, 2014.
[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep
convolutional neural networks,” in Advances in neural information processing systems,
97
2012, pp. 1097–1105.
[27] A. Dosovitskiy et al., “FlowNet: Learning optical flow with convolutional networks,”
Proc. IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp. 2758–2766, 2015.
[28] W. Luo, A. G. Schwing, and R. Urtasun, “Efficient deep learning for stereo matching,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016,
pp. 5695–5703.
[29] Y. LeCun and others, “Generalization and network design strategies,” in Connectionism in
perspective, vol. 19, Citeseer, 1989.
[30] Y. Lecun et al., “Handwritten digit recognition with a back-propagation network,” in
Advances in neural information processing systems, 1990, pp. 396–404.
[31] Y. Lecun et al., “Backpropagation applied to handwritten zip code recognition,” Neural
Comput., vol. 1, no. 4, pp. 541–551, 1989.
[32] W. Rawat and Z. Wang, “Deep Convolutional Neural Networks for Image Classification:
A Comprehensive Review,” Neural Comput., vol. 29, no. 9, pp. 2352–2449, 2017.
[33] K. Chellapilla, S. Puri, and P. Simard, “High Performance Convolutional Neural
Networks for Document Processing,” in Tenth International Workshop on Frontiers in
Handwriting Recognition, 2006.
[34] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief
nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006.
[35] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural
networks,” Science (80-. )., vol. 313, no. 5786, pp. 504–507, 2006.
[36] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of
deep networks,” in Advances in neural information processing systems, 2007, pp. 153–
98
160.
[37] M. Ranzato, F. J. Huang, Y. Boureau, and Y. LeCun, “Unsupervised Learning of Invariant
Feature Hierarchies with Applications to Object Recognition,” in 2007 IEEE Conference
on Computer Vision and Pattern Recognition, 2007, pp. 1–8.
[38] R. Girshick, “Fast R-CNN,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp.
1440–1448, 2015.
[39] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection
with region proposal networks,” in Advances in neural information processing systems,
2015, pp. 91–99.
[40] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” Proc. IEEE Int. Conf.
Comput. Vis., vol. 2017-Octob, pp. 2980–2988, 2017.
[41] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in European conference
on computer vision, 2014, pp. 740–755.
[42] R. Anantharaman, M. Velazquez, and Y. Lee, “Utilizing Mask R-CNN for Detection and
Segmentation of Oral Diseases,” in 2018 IEEE International Conference on
Bioinformatics and Biomedicine (BIBM), 2018, pp. 2197–2204.
[43] J. Singh and S. Shekhar, “Road Damage Detection And Classification In Smartphone
Captured Images Using Mask R-CNN,” in IEEE International Conference On Big Data
Cup, 2018, vol. abs/1811.0.
[44] X. Chen, R. Girshick, K. He, and P. Dollar, “TensorMask: A foundation for dense object
segmentation,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2019-Octob, pp. 2061–2069,
2019.
[45] W. Liu et al., “SSD: Single Shot MultiBox Detector,” Dec. 2015.
99
[46] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-
time object detection,” in Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, 2016, vol. 2016-Decem, pp. 779–788.
[47] M. J. Shafiee, B. Chywl, F. Li, and A. Wong, “Fast YOLO: A Fast You Only Look Once
System for Real-time Embedded Object Detection in Video,” Sep. 2017.
[48] J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” Dec. 2016.
[49] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” 2018.
[50] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab:
Semantic image segmentation with deep convolutional nets, atrous convolution, and fully
connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848,
2017.
[51] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for
semantic image segmentation,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2017.
[52] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal
visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338,
2010.
[53] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with
atrous separable convolution for semantic image segmentation,” in Proceedings of the
European conference on computer vision (ECCV), 2018, pp. 801–818.
[54] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical
image segmentation,” in Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2015, vol. 9351, pp.
100
234–241.
[55] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “ICNet for Real-Time Semantic Segmentation
on High-Resolution Images,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes
Artif. Intell. Lect. Notes Bioinformatics), vol. 11207 LNCS, pp. 418–434, 2018.
[56] M. Osswald, S.-H. Ieng, R. Benosman, and G. Indiveri, “A spiking neural network model
of 3D perception for event-based neuromorphic stereo vision systems,” Sci. Rep., vol. 7, p.
40703, 2017.
[57] P. Ferrara et al., “Wide-angle and long-range real time pose estimation: A comparison
between monocular and stereo vision systems,” J. Vis. Commun. Image Represent., vol.
48, pp. 159–168, 2017.
[58] A. L. Hou, X. Cui, Y. Geng, W. J. Yuan, and J. Hou, “Measurement of safe driving
distance based on stereo vision,” Proc. - 6th Int. Conf. Image Graph. ICIG 2011, pp. 902–
907, 2011.
[59] H. Kim, C.-S. Lin, J. Song, and H. Chae, “Distance measurement using a single camera
with a rotating mirror,” Int. J. Control. Autom. Syst., vol. 3, no. 4, pp. 542–551, 2005.
[60] K. A. Rahman, M. S. Hossain, M. A.-A. Bhuiyan, T. Zhang, M. Hasanuzzaman, and H.
Ueno, “Person to camera distance measurement based on eye-distance,” in 2009 Third
International Conference on Multimedia and Ubiquitous Engineering, 2009, pp. 137–141.
[61] M. N. A. Wahab, N. Sivadev, and K. Sundaraj, “Target distance estimation using
monocular vision system for mobile robot,” in 2011 IEEE Conference on Open Systems,
2011, pp. 11–15.
[62] Y. M. Mustafah, R. Noor, H. Hasbi, and A. W. Azma, “Stereo vision images processing
for real-time object distance and size measurements,” in 2012 international conference on
101
computer and communication engineering (ICCCE), 2012, pp. 659–663.
[63] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a
common multi-scale convolutional architecture,” in Proceedings of the IEEE international
conference on computer vision, 2015, pp. 2650–2658.
[64] G. R. Sangeetha, N. Kumar, P. R. Hari, and S. Sasikumar, “Implementation of a Stereo
vision based system for visual feedback control of Robotic Arm for space manipulations,”
Procedia Comput. Sci., vol. 133, pp. 1066–1073, 2018.
[65] K. McGuire, G. De Croon, C. De Wagter, K. Tuyls, and H. Kappen, “Efficient optical
flow and stereo vision for velocity estimation and obstacle avoidance on an autonomous
pocket drone,” IEEE Robot. Autom. Lett., vol. 2, no. 2, pp. 1070–1076, 2017.
[66] M. L. Balter, A. I. Chen, T. J. Maguire, and M. L. Yarmush, “Adaptive kinematic control
of a robotic venipuncture device based on stereo vision, ultrasound, and force guidance,”
IEEE Trans. Ind. Electron., vol. 64, no. 2, pp. 1626–1635, 2017.
[67] L. Ma, J. Stückler, C. Kerl, and D. Cremers, “Multi-view deep learning for consistent
semantic mapping with rgb-d cameras,” in 2017 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), 2017, pp. 598–605.
[68] M. Cordts et al., “The cityscapes dataset for semantic urban scene understanding,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,
pp. 3213–3223.
[69] R. Mottaghi et al., “The role of context for object detection and semantic segmentation in
the wild,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 891–898,
2014.
[70] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
102
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,
pp. 770–778.
[71] S. Zheng et al., “Conditional random fields as recurrent neural networks,” in Proceedings
of the IEEE international conference on computer vision, 2015, pp. 1529–1537.
[72] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic
segmentation,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 3431–3440.
[73] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image
segmentation with deep convolutional nets and fully connected crfs,” arXiv Prepr.
arXiv1412.7062, 2014.
[74] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic
segmentation,” in Proceedings of the IEEE international conference on computer vision,
2015, pp. 1520–1528.
[75] V. Badrinarayanan, A. Handa, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-
Decoder Architecture for Robust Semantic Pixel-Wise Labelling,” 2015.
[76] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv
Prepr. arXiv1511.07122, 2015.
[77] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support
inference from rgbd images,” in European Conference on Computer Vision, 2012, pp.
746–760.
[78] S. Gupta, P. Arbelaez, and J. Malik, “Perceptual organization and recognition of indoor
scenes from RGB-D images,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2013, pp. 564–571.
103
[79] F. Husain, H. Schulz, B. Dellen, C. Torras, and S. Behnke, “Combining Semantic and
Geometric Features for Object Class Segmentation of Indoor Scenes,” IEEE Robot.
Autom. Lett., vol. 2, no. 1, pp. 49–55, 2017.
[80] J. Wang, Z. Wang, D. Tao, S. See, and G. Wang, “Learning common and specific features
for RGB-D semantic segmentation with deconvolutional networks,” in Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), 2016, vol. 9909 LNCS, pp. 664–679.
[81] D. Lin, G. Chen, D. Cohen-Or, P.-A. Heng, and H. Huang, “Cascaded feature network for
semantic segmentation of RGB-D images,” in Proceedings of the IEEE International
Conference on Computer Vision, 2017, pp. 1311–1319.
[82] C. Couprie, C. Farabet, L. Najman, and Y. LeCun, “Indoor semantic segmentation using
depth information,” arXiv Prepr. arXiv1301.3572, 2013.
[83] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into
semantic segmentation via fusion-based cnn architecture,” in Asian conference on
computer vision, 2016, pp. 213–228.
[84] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin, “Lstm-cf: Unifying context
modeling and fusion with lstms for rgb-d scene labeling,” in European conference on
computer vision, 2016, pp. 541–557.
[85] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid, “Exploring Context with Deep
Structured Models for Semantic Segmentation,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 40, no. 6, pp. 1352–1366, 2018.
[86] G. Riegler, A. Osman Ulusoy, and A. Geiger, “Octnet: Learning deep 3d representations
at high resolutions,” in Proceedings of the IEEE Conference on Computer Vision and
104
Pattern Recognition, 2017, pp. 3577–3586.
[87] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Semanticfusion: Dense 3d
semantic mapping with convolutional neural networks,” in 2017 IEEE International
Conference on Robotics and automation (ICRA), 2017, pp. 4628–4635.
[88] Y. He, W.-C. Chiu, M. Keuper, and M. Fritz, “Std2p: Rgbd semantic segmentation using
spatio-temporal data-driven pooling,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017, pp. 4837–4846.
[89] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a
multi-scale deep network,” in Advances in neural information processing systems, 2014,
pp. 2366–2374.
[90] Y. Kuznietsov, J. Stuckler, and B. Leibe, “Semi-supervised deep learning for monocular
depth map prediction,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2017, pp. 6647–6655.
[91] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth
estimation: Geometry to the rescue,” in European Conference on Computer Vision, 2016,
pp. 740–756.
[92] K. Wang and S. Shen, “MVDepthNet: real-time multiview depth estimation neural
network,” in 2018 International Conference on 3D Vision (3DV), 2018, pp. 248–257.
[93] L. Jorissen, P. Goorts, G. Lafruit, and P. Bekaert, “Multi-view wide baseline depth
estimation robust to sparse input sampling,” in 2016 3DTV-Conference: The True Vision-
Capture, Transmission and Display of 3D Video (3DTV-CON), 2016, pp. 1–4.
[94] Y. Li, K. Qian, T. Huang, and J. Zhou, “Depth estimation from monocular image and
coarse depth points based on conditional gan,” in MATEC Web of Conferences, 2018, vol.
105
175, p. 3055.
[95] A. Wang, Z. Fang, Y. Gao, X. Jiang, and S. Ma, “Depth Estimation of Video Sequences
With Perceptual Losses,” IEEE Access, vol. 6, pp. 30536–30546, 2018.
[96] Y. Chen, W. Li, X. Chen, and L. Van Gool, “Learning semantic segmentation from
synthetic data: A geometrically guided input-output adaptation approach,” in Proceedings
of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
2019, vol. 2019-June, pp. 1841–1850.
[97] A. P. Dempster, “Upper and Lower Probabilities Induced by a Multivalued Mapping,”
Ann. Math. Stat., vol. 38, no. 2, pp. 325–339, 1967.
[98] G. Shafer, A mathematical theory of evidence, vol. 42. Princeton university press, 1976.
[99] A. Buades, B. Coll, and J.-M. Morel, “Non-Local Means Denoising,” Image Process.
Line, vol. 1, pp. 208–212, 2011.
[100] J. Canny, “A Computational Approach to Edge Detection,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986.