moohialdin, ammar,lamari, fiona,miska, marc ...real-time...site workers and their ppe (helmet and...

10
This may be the author’s version of a work that was submitted/accepted for publication in the following source: Moohialdin, Ammar, Lamari, Fiona, Miska, Marc, & Trigunarsyah, Bam- bang (2021) A real-time computer visionsystem for workers’ PPE and posture detection in actual construction site environment. In Wang, Chien Ming, Kitipornchai, Sritawat, & Dao, Vinh (Eds.) EASEC16: Proceedings of The 16th East Asian-Pacific Conference on Structural Engineering and Construction, 2019. Springer, Singapore, pp. 2169-2181. This file was downloaded from: https://eprints.qut.edu.au/197052/ c 2019 [Please consult the author] This work is covered by copyright. Unless the document is being made available under a Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other uses. If the docu- ment is available under a Creative Commons License (or other specified license) then refer to the Licence for details of permitted re-use. It is a condition of access that users recog- nise and abide by the legal requirements associated with these rights. If you believe that this work infringes copyright please provide details by email to [email protected] Notice: Please note that this document may not be the Version of Record (i.e. published version) of the work. Author manuscript versions (as Sub- mitted for peer review or as Accepted for publication after peer review) can be identified by an absence of publisher branding and/or typeset appear- ance. If there is any doubt, please refer to the published source. https://doi.org/10.1007/978-981-15-8079-6_199

Upload: others

Post on 20-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Moohialdin, Ammar,Lamari, Fiona,Miska, Marc ...Real-Time...site workers and their PPE (helmet and safety gear) and postures, since the construction site environment consists multiple

This may be the author’s version of a work that was submitted/acceptedfor publication in the following source:

Moohialdin, Ammar, Lamari, Fiona, Miska, Marc, & Trigunarsyah, Bam-bang(2021)A real-time computer visionsystem for workers’ PPE and posture detectionin actual construction site environment.In Wang, Chien Ming, Kitipornchai, Sritawat, & Dao, Vinh (Eds.)EASEC16: Proceedings of The 16th East Asian-Pacific Conference onStructural Engineering and Construction, 2019.Springer, Singapore, pp. 2169-2181.

This file was downloaded from: https://eprints.qut.edu.au/197052/

c© 2019 [Please consult the author]

This work is covered by copyright. Unless the document is being made available under aCreative Commons Licence, you must assume that re-use is limited to personal use andthat permission from the copyright owner must be obtained for all other uses. If the docu-ment is available under a Creative Commons License (or other specified license) then referto the Licence for details of permitted re-use. It is a condition of access that users recog-nise and abide by the legal requirements associated with these rights. If you believe thatthis work infringes copyright please provide details by email to [email protected]

Notice: Please note that this document may not be the Version of Record(i.e. published version) of the work. Author manuscript versions (as Sub-mitted for peer review or as Accepted for publication after peer review) canbe identified by an absence of publisher branding and/or typeset appear-ance. If there is any doubt, please refer to the published source.

https://doi.org/10.1007/978-981-15-8079-6_199

Page 2: Moohialdin, Ammar,Lamari, Fiona,Miska, Marc ...Real-Time...site workers and their PPE (helmet and safety gear) and postures, since the construction site environment consists multiple

16th East Asia-Pacific Conference on Structural Engineering & Construction (EASEC16)

Edited by C.M. Wang, V. Dao and S. Kitipornchai

Brisbane, Australia, December 3-6, 2019

A REAL-TIME COMPUTER VISION SYSTEM FOR WORKERS’ PPE

AND POSTURE DETECTION IN ACTUAL CONSTRUCTION SITE

ENVIRONMENT

Moohialdin, Ammar*a; Lamari, Fionaa; Marc, Miskaa; Trigunarsyah, Bambangb

a: School of Civil Engineering and Built Environment, Science and Engineering Faculty, Queensland

University of Technology, Brisbane 4000, Australia.

b: Property, Construction and Project Management, Design and Social Context, RMIT University, Melbourne

3000, Australia.

Emails: [email protected]

Abstract. The real-time video detection model is yet a challenging, especially in detecting construction

site workers and their PPE (helmet and safety gear) and postures, since the construction site

environment consists multiple complications such as different illumination levels, shadows, complex

activities, a wide range of personal protective equipment (PPE) designs and colours. This paper

proposes a novel computer vision (CV) system to detect the construction workers’ PPE and postures in

a real-time manner. Four different recording sessions have been carried out to build a dataset of 95

videos by using a novel design of site cameras. The PPE detection included eight different types of

helmets and gears and the postures detection consisted of nine classes. The Python data-labelling tool

was used to annotate the selected datasets and the labelled datasets were used to build a detection model

based on the TensorFlow environment. The proposed method consists of two layers of decision trees,

which was tested and validated on two videos of 2000 frames. The proposed model achieves high-

performance results in both identification and recall ratios over 83% and 95%, respectively. It also

achieved higher accuracy in classifying the postures over 72% and 64% in model testing and validation.

The proposed model can promote potential improvements in the application of real-time video analysis

in actual site conditions.

Keywords: Construction; Worker; Computer Vision; PPE; Posture; Detection; Real-time.

1. INTRODUCTION

Collecting accurate information from the construction sites is essential to be an input for

decisions making processes. This information is needed to be in a real-time manner so it can

support safety and productivity-related decisions as well as proactive actions. Conventional

data collection methods such as monitoring sensing technology are overly intrusive (Wong et

al., 2014; Chan, et al., 2012a; 2012b), costly and require stuff training (Zhou et al., 2013; Liang

et al., 2011). It is, therefore, important to collect information from the construction site in proper

ways that that can fit the actual site conditions (Gatti et al., 2013).

The video recording applications in the construction sites can produce massive amounts of

inexpensive information (Dimitrov & Golparvar-Fard, 2014; Memarzadeh, et al., 2012; Chi &

Caldas, 2011) with minimal cost (Han & Lee, 2013). In the last decade, essential improvements

have been made in computerised vision analysis and its algorithms, which have been effectively

employed in real industry applications such as detecting the construction workers and their

movements (Han & Lee, 2013; Memarzadeh et al., 2013). However, the CV application is still

challenging when considering the real-time application and automated interpretation of the

Page 3: Moohialdin, Ammar,Lamari, Fiona,Miska, Marc ...Real-Time...site workers and their PPE (helmet and safety gear) and postures, since the construction site environment consists multiple

WANG et al.

2

massive amount of video’s information. Moreover, there is a lack of structured data about the

site activates in a way that can be transcribed into logical algorithms (Seo, Han, Lee, & Kim,

2015). Besides, the CV application in the actual site conditions should have the ability to

process the videos with less time and likelihood error (Seo et al., 2015; Yang et al., 2015; Gong

& Caldas, 2010).

Construction site environments also include dust, direct sunlight, rain and the movement of

heavy equipment, which can be another challenge of using off-the-shelf cameras. Therefore,

this study aims to build a real-time CV system that can effectively be implemented in the actual

site conditions to detect the site workers’ PPE and postures. It also proposes a novel practical

design for a site cameras system that can fit the challenging construction site conditions. The

results of the proposed system implementation show that it can be effectively used in the actual

site conditions for real-time data analysis purposes.

2. METHODOLOGY

The CV used in this research refers to the process of transferring the knowledge of site

workers and their postures onto a computer to interpret site videos and retrieve meaningful

information. The CV system included three main parts: a data acquisition unit; a processing and

understanding unit; and a reporting unit.

2.1. Data Acquisition and Structure

This research used a 2D camera designed for the construction site applications rather than

off-the-shelf cameras. The camera included four main parts: a power unit, a processing unit, a

data storage unit and a camera, as shown in Figure 1. Construction site environments include

dust, direct sunlight, rain and the movement of heavy equipment; therefore, the cameras needed

protection to prevent them from being damaged. A plastic box was used for this purpose and

designed to sit at the top of a tripod that can reach heights of up to three meters to ensure the

coverage of a wide area. The recording system included a 2D camera with a framerate of 24 fps

and a resolution of 1024 by 720. The camera recorded video clips that were 41 seconds long. It

also sent each frame to a cloud browser that shows one fps. The camera gets 5-volt from the

processing unit and sent back the footage to the same unit. The processing unit then saved the

footage as videos on the external hard drive and sent images to the cloud browser.

Figure 1. The structure of the CVA measurement units.

It can be challenging to have a constant electrical power connection on a construction site as

additional arrangements are needed. The position of the camera may need to be changed as the

construction work progresses. Therefore, the camera was designed to run on solar power. The

CVA system also included a small battery of 12-volt that the transformer converted into 5-volt

power. The processor unit organised the power connection between the transformer and the

camera. It also controlled the storage of the footage storage on the hard drive and in the cloud

Page 4: Moohialdin, Ammar,Lamari, Fiona,Miska, Marc ...Real-Time...site workers and their PPE (helmet and safety gear) and postures, since the construction site environment consists multiple

WANG et al.

3

browser. Three cameras were deployed on three different construction sites at allocated

locations. The orientation of the cameras at the top of the tripods helped to cover a flat area of

up to 20 m2 at ground level. The CVA system detected multiple objects, such as helmets, gear

and workers, as well as the workers’ postures. The distance between the camera and the targeted

workstation was around 30 to 40 m.

2.2. Requirements of Real-Time Data Analysis

A sensitivity analysis was conducted to determine the appropriate video resolution and frame

rate for a real-time data analysis. The analysis algorithm was the average number of the

processed frames per minute, which used to compare 138 different combinations of video

resolution versus framerate. The duration of the tested video was 15 seconds and it was

converted into six different resolutions: 4096 by 2160 – 4K; 2048 by 1080 – 2K; 1280 by 720

– 720P; 720 by 576 – 576P; 720 by 480 – 480P and 1960 by 1080. For each type of the

resolution, 23 different framerates between 8 fps and 30fps were tested.

A human detection code was created using MATLAB software. The sensitivity analysis

began by reading the video and then set the value of the video resolution, framerate and time

counter. The code then converted the video into frames. In each frame, the detection algorithm

identified any humans and plotted bounding boxes around them. The frames were then

converted into videos that include the detection results. Finally, the MATLAB video viewer

showed the videos with the bounding boxes. The code outcomes also presented a summary of

the processing time and the average number of frames processed per minute. Figure 2 shows

the sensitivity analysis process in the form of pseudocode. 1. Initialization;

2. Input1: Request_To_Upload_A_Video_Record;

3. Get_Information_About_ The_Video;

4. Output1: Show_Video_ Duration,Resolution,FrameRate,Number_Of_Frames;

5. Input2: Request_To_Inter_The_Required<Video_Resolution>;

6. Input3: Request_To_Inter_The_Required<Frame_Rate>;

7. Set_Timer_Start_Point;

8. Create_Output_Folder<Resized_Frames,Segemented_Frames, Frames_With_Bounding_Box>;

9. For i==1: Number_Of_Frames;

10. Stract_Frames_From_Video;

11. Resize_Frames_As<Input2>;

12. Output2: Save_Step11Otcomes_Into_ Resized_Frames_Floder<Frame_Name_

“frame”_sequential_number_of_three_ digits>;

13. Segment_Moving_Objects_From_Each_Frame;

14. Output3: Save_ Step13Otcomes _Into_ Segmented_Frames_Floder<Frame_Name_

“frame”_sequential_number_of_three_ digits>;

15. Set_Fliters_For_Human_Detection;

16. Draw_Bounding_Box_Arround_Moving_Objects;

17. Output4: Save_ Step16Otcomes _Into_Frames_With_Bounding _Floder<Frame_

Name_“frame”_sequential_number_of_three_ digits>;

18. Count_Number_Of_Bounding_Boxes;

19. Output5: WriteVideo<For_ALL_Frames_In_Resized_Frames_Floder>; Video_Name

<”Resized_Video”>;

20. Output6: WriteVideo<For_ALL_Frames_In_Segmented_Frames_Floder>; Video_

Name<”Segmented_Video”>;

21. Output7: WriteVideo<For_ALL_Frames_In_ Frames_With_Bounding_Box _Folder>;

Video_Name<” Bounding_Box_Video”>;

22. Display<Output5, Output6, Output7>;

23. Set_Timer_End_Point;

24. End;

25. Display<”Number_Of_Workers”== Number_Of_Bounding_Boxes>;

26. Display<”Processing_Time”==Timer_Counts>;

27. End Figure 2. The Pseudocode of Video Resolution and Frame Rate Sensitivity Analysis.

2.3. Video Frames Structure

The videos of the construction site consisted of a series of worker activities, including

walking, lifting, bending, carrying and kneeling. As a static frame of video cannot show changes

in the workers’ activities. A series of consecutive frames were extracted from the videos records

at the rate of 10 fps to support the real-time data analysis. These frames included dynamic

changes on the workers’ postures and activities, which were defined as foreground. Any static

objects in the frames, such as foundation and constructed work were defined as background.

The characteristics of the images include 2D RGB images defined in a colour intensity vector

versus the location of the pixels as:

𝑓(𝑥, 𝑦) =

𝑟(𝑥, 𝑦)𝑏(𝑥, 𝑦)𝑔(𝑥, 𝑦)

(1)

Page 5: Moohialdin, Ammar,Lamari, Fiona,Miska, Marc ...Real-Time...site workers and their PPE (helmet and safety gear) and postures, since the construction site environment consists multiple

WANG et al.

4

Where, 𝑓(𝑥, 𝑦) is the image function and represents each pixel in the image; x and y are

integer variables representing the location of each pixel; and 𝑟, 𝑏 𝑎𝑛𝑑 𝑔 is the colour channel.

The colour intensity range is an integer number within 0 and 255. The images were defined as

a function of the location of the pixels and the time 𝑓(𝑥, 𝑦, 𝑡), where 𝑡 denotes the time domain.

The colour intensity of the moving pixels changes from the image at a time (𝑡) to time (𝑡 + ∆𝑡). The RGB image function can be formulated as:

𝑓(𝑥, 𝑦, 𝑡) =

𝑟(𝑥, 𝑦, 𝑡)𝑏(𝑥, 𝑦, 𝑡)𝑔(𝑥, 𝑦, 𝑡)

(2)

2.4. Tensorflow Model

TensorFlow has multiple models that have different processing speeds and levels of

accuracy, such as SSD-Mobilenet, R-CNN, Faster-RCNN and Mask-RCNN models. This

research compared Tensorflow models to assess their ability to achieve the required rate of 10

fps. The methodology also considered 0.30 sec per 10 frames as a safety margin of the

processing speed and 25 mAP (mean average precision) as the minimum accepted accuracy

level. Only two models met the criteria: SSD-Mobilenet-V1-FPN and Faster-RCNN-Inception-

V2. In term of processing speed, the two models are very similar, but Faster-RCNN has been

proven to be the most accurate model (Huang et al., 2017). Therefore, Faster-RCNN-Inception-

V2 was chosen to build the detection model.

2.5. Applied Decision Tree of the CVA

The first stage of the video analysis involved detecting if there were workers in each frame

using the highly visible items, i.e., the safety helmets and highly visible clothing (gear). The

worker detection algorithm considered three main classes (helmet, gear and worker’s body) to

detect the site workers in a real environment. The outcomes of this step provided bounding

boxes around helmets and gear, as Figure 3-a illustrates.

Video Record Start at time t Tack 10 fps

Image Cropping

and Size

Adjustment

No

Detections

Counts

00

Cycle

Determination

Image Colour

Enhancement

Safety

Helmet

Safety

Gear

Human

Body

Colour

Intensity?

Shape?

Yes No

WorkersNo

Yes

Initiate

Bounding BoxAll Workers Yes

No

00

No

01

Figure 3. The Proposed Decision Tree of the Task and Worker Detection Process.

The automated postures identification process provided information on a limited number of

worker body postures. It also included the identification of the upper and lower parts of the

workers’ bodies based on the bounding boxes around the workers. As the workers perform

different activities, their posture changes and their upper and lower body parts form different

lines and angles. In this research, both parts were assumed to create only two lines with one

angle. There is also one angle connecting the two parts. Based on this simplified definition of

the workers’ body postures, the decision tree defined five different postures positions (see

Figure 3-b). Initially, the proposed was designed to detect five postures including standing,

sitting, kneeling, bending and overhead. Then, the model has been modified to use these five

postures as a base to estimate four more postures including walking, pushing, carrying and

climbing. The primary assumption of the posture detection still depends on the number of lines

(a) (b)

Page 6: Moohialdin, Ammar,Lamari, Fiona,Miska, Marc ...Real-Time...site workers and their PPE (helmet and safety gear) and postures, since the construction site environment consists multiple

WANG et al.

5

and angles created from the postures. For instance, in case the worker is walking, the upper

body part will create a similar pattern as the standing posture while the lower body part will

create two intersecting lines with an angle between 30° and 60°.

3. RESULTS AND DISCUSSIONS

3.1. Real-Time Sensitivity Analysis

A sensitivity analysis was conducted to determine the appropriate video resolution and

framerate for real-time data analysis. The analysis algorithm calculated the average number of

processed frames per minute, which used to compare 138 different combinations of video

resolutions versus framerates. The duration of the tested video was 15 seconds and it was

converted into six different resolutions: 4096 by 2160 – 4K; 2048 by 1080 – 2K; 1280 by 720

– 720P; 720 by 576 – 576P; 720 by 480 – 480P; and 1960 by 1080. For each resolution, 23

different framerates between 8 fps and 30fps have been tested. The results show that three video

resolutions have the minimum number of processed frames per minute 4K, 2K and 1960 by

1080, respectively. For the 4K resolution, there was no significant change in the average

number of processed frames per minute when the farm rate changes except in case the frame

rate is 26 fps. At this level, the average number of processed frames were around seven frames

per minute. The optimal average number of processed frames per minute has been identified to

be about 1368 and 1224 frames per minute at the resolution 480P with nine fps and 576P with

ten fps, respectively.

The analysis results have been formulated in a different structure to identify whether the

average number of processed frames can meet the frame rate of the processed video. The

processing ration (PR) indicates the ability of the CV system to process video records in a real-

time manner, which calculated as:

𝑃𝑅 =𝑉𝑖𝑑𝑒𝑜 𝐹𝑟𝑎𝑚𝑒 𝑅𝑎𝑡𝑒 𝑖𝑛 𝐹𝑟𝑎𝑚𝑒𝑠 𝑝𝑒𝑟 𝑀𝑖𝑛𝑢𝑡𝑒

𝑇ℎ𝑒 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 𝐹𝑟𝑎𝑚𝑒𝑠 𝑃𝑒𝑟 𝑀𝑖𝑛𝑢𝑡𝑒 (3)

In case the video framerate equals the average number of processed frames per minute, the

PR value will be one. The appropriate framerate for the real-time video processing has been

found between 8 and 10 fps and the resolution 576P and 480P. These combination sets of video

framerates and resolutions can provide around 50% extra processing capacity in a real-time

manner.

3.2. Model Training

This research utilises the TensorBoard to visualise the model training process and its

outcomes. It also presents the performance matrices that measures the accuracy of the model

compared to time and training iterations. The first performance indicator is the total loss that

describes the ability of the model to classifying each detected object into its assigned class and

to what extent this classification is accurate. As observed in the visualised graph (see Figure 4),

there are two continuous plots with different colours, dark and faded orange, indicating the

actual and smoothed total loss. Initially, the training process was set to include 120,000

iterations. The gradient update of the prediction accuracy increased, while the total loss rapidly

decreased and reached to around 0.4 at the 37,000 iterations. Thus, the training process was

stopped at 37,000 iterations as there is no further improvement in the total loss. The prediction

accuracy pattern also shows normal consistent changes from 20,000 iterations, which means

that around 20,000 and 30,000 iterations will also give the same detection accuracy level.

Page 7: Moohialdin, Ammar,Lamari, Fiona,Miska, Marc ...Real-Time...site workers and their PPE (helmet and safety gear) and postures, since the construction site environment consists multiple

WANG et al.

6

Figure 4. Gradient Change in the Model’s Total Loss over Time.

3.3. PPE and Worker Detection

The first measure of the PPE and worker detection is the identification rate (IR) and recall

reate (RR), which was calculated for each class: helmet (H), gear (G) and worker (W). 10% of

the 3,000 frames have been used to test the detection model. The training and testing datasets

included a variety of helmet and gear colures and shapes. The most common colure of the

helmet is white colure (93.8%). The dataset also consists of some other colours such as blue,

yellow and green helmets. Similarly, various gear colours were included in the training and

testing datasets with different gear designs. For instance, some of the workers wore yellow

gears with blue colour and grey reflective tape, while others wore orange gear without reflective

tape. The most commonly used gear colure and design in the dataset under study is yellow-blue

colure without reflective tape. The variety in helmet and gear colours and shapes help to

enhance the diversity of the PPE that this model can detect effectively in a real-site

environment. Table 1 illustrates some examples of the detection outcomes.

Table 1. Illustration Examples of the Detection Outcomes.

Outcome Outcome

Example H G W Posture Example H G W Posture

√ √ √ Walking

√ √ √ Climbing

√ √ √ Bending

√ √ √ Standing

3.3.1. Workers and PPE Detection Evaluation To evaluate the performance of the detection model, a video of 1000 frames has been used

for testing and another video of 1000 frames has been used for validation. IR and RR are

calculated as per the definition of each detection class including helmet, gear and worker. The

validation process includes 200 frames from a different construction site and 100 frames of

different capturing angle, which include different site environment, video recording angles,

illumination levels and occlusion cases. The results of the model testing and validation are

summarised in Table 2. Four main measures were used to calculate the model performance

indicators including: (1) true-positive (TP) indicates to cases in which the model correctly

predicts and labels objects as positive that are actually showing in the scene; (2) true-negative

(TN) are the cases where the model correctly does not predict and label objects as positive that

are actually not showing in the scene; (3) false-positive (FP) indicates to the cases in which the

model incorrectly predicts and labels objects as positive while they are actually not showing in

the scene; (4) false-negative (FN) indicates to the cases where the model incorrectly does not

predict and label objects as positive while they are actually showing in the scene.

Page 8: Moohialdin, Ammar,Lamari, Fiona,Miska, Marc ...Real-Time...site workers and their PPE (helmet and safety gear) and postures, since the construction site environment consists multiple

WANG et al.

7

Table 2. The IR and RR results of the Model Testing and Validation.

Testing Validation

Class Actual

IR RR Actual

IR RR 𝑷 𝑵 𝑷 𝑵

Helmet P

red

icte

d

𝑃 779 41 95.00% 89.03%

Pre

dic

ted

𝑃 1740 280 86.14% 98.31%

𝑁 96 8 𝑁 30 0

Gear 𝑃 1285 42

96.83% 98.47% 𝑃 1941 473

80.41% 99.18% 𝑁 20 1 𝑁 16 0

Worker 𝑃 1602 47

97.15% 99.38% 𝑃 2084 438

82.63% 99.81% 𝑁 10 1 𝑁 4 0

The IR measures the ability of the model to correctly detect the number of objects in each

frame as correctly as they belong to one of the three classes H, G and W. The testing results in

Table 2 reveals that the proposed detection model achieved an IR of 95%, 96.83% and 97.15

for the H, G and W, respectively. The IR validation results yielded 86.14%, 80.41% and 82.63%

for the H, G and W, respectively. Compared to the testing results, the validation results show

on average 13.27% decline of the IR. These slight declination in the IR can be explained in part

by the significant difference in the illumination level as difference capturing angles were also

included in the validation dataset. Meanwhile, the validation IR is still maintained above 80%

in the three classes.

3.3.2. Postures Detection Evaluation The average prediction accuracy measure was employed to examine the performance of the

proposed model on two datasets composed of various sequencing postures. The datasets

included a video of 1000 frames for model testing and another video of 1000 frames for model

validation. Since nine positions were included, a multiclass comparison approach was adopted

to construct the confusion matrix of the proposed model. Separate Python code has been

developed to perform the multiclass comparisons based on three main python libraries

including Seaborn, Pandas, Matplotlib and Numpy. The pseudocode in Figure 5-c describes the

algorithm applied to build the confusion matrix.

Figure 5. The Pseudocode of Constructing the Confusion Matrix.

Figure 5 summarises the results of the confusion matrix for the model testing and validation.

These results explain how the proposed model is performing when considering similarities

(c)

Page 9: Moohialdin, Ammar,Lamari, Fiona,Miska, Marc ...Real-Time...site workers and their PPE (helmet and safety gear) and postures, since the construction site environment consists multiple

WANG et al.

8

between different postures and identifies the effects of the misclassification among these

postures. The results also provide insight into the misclassified detections and, more

importantly, to what class that have been misclassified. Overall, the results of model testing

(Figure 5-a) shows higher accuracy when classifying the postures in which the upper and lower

body parts are clearly shown. For instance, the model achieved a higher accuracy (on average

0.89) in classifying standing, walking, overhead, carrying and pushing postures.

In contrast, the postures of sitting, kneeling, climbing and bending have a lower accuracy

(on average 0.77) compared to other classes. Also, the results revealed that climbing, bending,

kneeling and walking postures have more crossed confusions with at least three different

classes. Meanwhile, the highest misclassification rates were identified between walking-

standing (0.072) and standing-walking due to strong similarities in the extracted line and angle

features from upper and lower body parts. Interestingly, the model has no misclassification

cases in pushing and sitting classes, which can be explained to the wide distinctiveness of these

type of postures compared to the other classes.

In the case of validation results (Figure 5-b), it can be observed that all classes have at the

minimum one confusion instances. It can also be observed that the model achieved a

classification accuracy over 0.75 in the first seven classes while achieving the accuracy of 0.67

and 0.64 on the last two classes, respectively. An analysis the validation dataset and confusion

instances reveals that most of the misclassifications have resulted from walking-standing strong

similarities; and the 200 additional validation frames as they include more repeated occlusions

instances due to the improper camera position, high illumination levels as the sunlight was

directly hitting the camera’s viewfinder, as the workstation was one meter more elevated than

the surface on which the camera was mounted and 50 meters away from the camera. On top of

these, the 200 frames consist of different site activities, concreting activities, which the model

has not been trained on. The dataset also includes 100 frames of a varying camera angle such

that the camera was mounted about 8 meters over the work area s

4. CONCLUSION

This paper presented a novel CV system for automated workers’ PPE and postures

detections, which is equipped with a practical design for site cameras. Two layers of decision

algorithms have been developed to perform the detection in a real-time manner based on the

Tensorflow environment. The proposed system has been tested and validated in real site

conditions and the results show that the average performance identification rate to be 90.57%,

88.62% and 89.89%, while the recall rate was 93.67%, 98.83% and 99.60% for W, H and G,

respectively. Meanwhile, the model confusion analysis reveals a higher accuracy when

classifying workers’ postures, in particular, postures in which the upper and lower body parts

are clearly shown in the scene such as standing, walking, overhead, carrying and pushing. In

turn, the results show that climbing, bending, kneeling and walking postures have more

misclassification rates compared to other postures. The proposed CV system testing and

validation results carry the promise of practical application on the construction site in a real-

time manner. The ongoing future work involves the opportunities of applying the prosed system

for construction site safety and productivity purposes. Future work also includes more analysis

of various types of site occlusions and their effects on the model performance, as well as

methods that can be applied to overcome these effects.

ACKNOWLEDGEMENTS

The first author thanks Queensland University of Technology (QUT) for the finical support

to this research in the form of PhD scholarship. The authors also would like to acknowledge Dr

Miljenka Perovic and Mr Nathan Sianidis for their support in getting access to the construction

site for data collection. The authors also acknowledge the QUT’s High-Performance Centre

(HPC) for providing access to large data storage and computational resources.

Page 10: Moohialdin, Ammar,Lamari, Fiona,Miska, Marc ...Real-Time...site workers and their PPE (helmet and safety gear) and postures, since the construction site environment consists multiple

WANG et al.

9

REFERENCES

Chan, A. P. C., Yam, M. C. H., Chung, J. W. Y., & Yi, W. (2012a). Developing a heat stress model for

construction workers. Journal of Facilities Management, 10(1), 59–74.

https://doi.org/10.1108/14725961211200405

Chan, A. P. C., Yi, W., Wong, D. P., Yam, M. C. H., & Chan, D. W. M. (2012b). Determining an optimal

recovery time for construction rebar workers after working to exhaustion in a hot and humid

environment. Building and Environment, 58, 163–171.

https://doi.org/10.1016/j.buildenv.2012.07.006

Chi, S., & Caldas, C. H. (2011). Automated Object Identification Using Optical Video Cameras on

Construction Sites. Computer-Aided Civil and Infrastructure Engineering, 26(5), 368–380.

https://doi.org/10.1111/j.1467-8667.2010.00690.x

Dimitrov, A., & Golparvar-Fard, M. (2014). Vision-based material recognition for automated

monitoring of construction progress and generating building information modeling from unordered

site image collections. Advanced Engineering Informatics, 28(1), 37–49.

https://doi.org/10.1016/j.aei.2013.11.002

Gatti, U., Migliaccio, G., Bogus, S. M., Priyadarshini, S., & Scharrer, A. (2013). Using Workforce’s

Physiological Strain Monitoring to Enhance Social Sustainability of Construction. Journal of

Architectural Engineering, 19(3), 179–185. https://doi.org/10.1061/(ASCE)AE.1943-

5568.0000110

Gong, J., & Caldas, C. H. (2010). Computer Vision-Based Video Interpretation Model for Automated

Productivity Analysis of Construction Operations. Journal of Computing in Civil Engineering,

24(3), 252–263. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000027

Han, S., & Lee, S. (2013). A vision-based motion capture and recognition framework for behavior-based

safety management. Automation in Construction, 35, 131–141.

https://doi.org/10.1016/j.autcon.2013.05.001

Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., … Murphy, K. (2017).

Speed/accuracy trade-offs for modern convolutional object detectors. In IEEE conference on

computer vision and pattern recognition (pp. 7310–7311). Retrieved from

https://arxiv.org/pdf/1611.10012.pdf

Liang, C., Zheng, G., Zhu, N., Tian, Z., Lu, S., & Chen, Y. (2011). A new environmental heat stress

index for indoor hot and humid environments based on Cox regression. Building and Environment,

46(12), 2472–2479. https://doi.org/10.1016/j.buildenv.2011.06.013

Memarzadeh, M, Heydarian, A., Golparvar-Fard, M., & Niebles, J. C. (2012). Real-Time and

Automated Recognition and 2D Tracking of Construction Workers and Equipment from Site

Video Streams. In Computing in Civil Engineering (2012) (pp. 429–436). Reston, VA: American

Society of Civil Engineers. https://doi.org/10.1061/9780784412343.0054

Memarzadeh, Milad, Golparvar-Fard, M., & Niebles, J. C. (2013). Automated 2D detection of

construction equipment and workers from site video streams using histograms of oriented gradients

and colors. Automation in Construction, 32, 24–37. https://doi.org/10.1016/j.autcon.2012.12.002

Seo, J., Han, S., Lee, S., & Kim, H. (2015). Computer vision techniques for construction safety and

health monitoring. Advanced Engineering Informatics, 29, 239–251.

https://doi.org/10.1016/j.aei.2015.02.001

Wong, D. P., Chung, J. W., Chan, A. P.-C., Wong, F. K., & Yi, W. (2014). Comparing the physiological

and perceptual responses of construction workers (bar benders and bar fixers) in a hot environment.

Applied Ergonomics, 45(6), 1705–1711. https://doi.org/10.1016/j.apergo.2014.06.002

Yang, J., Park, M.-W., Vela, P. A., & Golparvar-Fard, M. (2015). Construction performance monitoring

via still images, time-lapse photos, and video streams: Now, tomorrow, and the future. Advanced

Engineering Informatics, 29(2), 211–224. https://doi.org/10.1016/j.aei.2015.01.011

Zhou, Z., Irizarry, J., & Li, Q. (2013). Applying advanced technology to improve safety management in

the construction industry: a literature review. Construction Management and Economics, 31(6),

606–622. https://doi.org/10.1080/01446193.2013.798423