4k lecture tracking system: movement recognition and lecturer … › downloads › final papers ›...
TRANSCRIPT
4K Lecture Tracking System: Movement Recognition and Lecturer TrackingMaximilian Karl Alfred Hahn
Honours, Department of Computer Science University of Cape Town
Rondebosch, 7701, South Africa
ABSTRACTThis paper is based on a lecture recording project using static 4K
cameras set by UCT’s CILT that includes blackboard
segmentation, lecturer recognition and smooth virtual panning. We
present a lecturer tracking solution using the OpenCV image
processing library. The work presented in this paper focuses on
movement detection, object tracking, lecturer recognition and near
real-time efficiency. Movement detection is performed using
OpenCV’s: absolute difference background subtraction,
thresholding and contour search. We find that this sequence
outperforms traditional contour searches such as Canny [2] in
terms of run-time while still providing useable results. The success
of our solution is evaluated by running a set of 17 lecture specific
test cases. We measure run-time of use cases as a factor of the
video’s length. Precision is measured as the implementation’s
ability to track the lecturer as a percentage of total time. We find
that the solution works for all common use cases and most
uncommon ones achieving near real-time processing with an
average process time of 1.13 times the length of the video.
CCS Concepts• Computing methodologies ➝ Computer vision problems
• Computing methodologies ➝ Video segmentation
• Computing methodologies ➝ Tracking
KeywordsBackground Segmentation; Movement Detection; Object Tracking;
Lecturer Recognition; OpenCV Library;
1. INTRODUCTIONSome lecturers fear that lecture recordings will be the end of class
attendance. However, lecture recording has become more popular
in educational institutions as studies show that students are willing
to attend lectures as well as make use of lecture recordings. Often
this combination of learning environments yielded the highest
marks overall [10] [7].
A recording systems can come with a variety of features that cover:
single tracking, multi-presenter tracking, blackboard and projector
screen segmentation and real-time processing. It can also be
implemented on a variety of camera’s such as a static camera, a
panning camera, a tilting camera, a zooming camera or a
combination of the above.
The advantages to these systems are that the recorded lectures need
minimal to no editing. For example, using blackboard segmentation
the frames containing blackboard writing don’t need to be manually
cropped out and saved to file. Most importantly a cinematographer
doesn’t need to be hired to film the lecturer or edit the lecture video
by manually panning a frame. Unfortunately, there aren’t any open
source systems available that implement a lecture tracking solution
that produce visually appealing results.
We were approached by the Centre for Innovation in Learning and
Teaching (CILT) at the University of Cape Town (UCT) to
implement such a system for their new high definition (3840 x 2160
pixels) 4K video cameras (Figure 1). CILT had previously
implemented many approaches to lecture tracking including 1080p
static cameras (Figure 2) and Pan-Tilt-Zoom (PTZ) cameras
(Figure 2) which cost R5000-R10,000 and R60,000-R80,000
respectively. The 1080p cameras don’t produce a clear enough
image and the PTZ cameras are very expensive with the addition of
the Raspberry Pi required to power them. The 4K camera’s that
CILT have invested in cost R15,000 each and produce very clear
video. Unfortunately streaming them requires a very fast internet
connection. The brief from CILT was to provide a cost effective
software system to segment boards, track the lecturer and output a
smooth-panning 720p frame extracted from the high resolution 4K
stream.
Figure 1 - CILT's 4K Video Camera
Figure 2 – Left: PTZ Camera, Right: Static Camera
The 4K lecture tracking program was divided into three
components: the video pre-processing and blackboard
segmentation system, the movement detection and lecturer tracking
system and lastly the virtual cinematographer.
1. The video pre-processing performs light correction to
smooth light-based changes between frames and uses the
processed frames to segment out blackboards. Lastly a
motion detection mask is created over a constant time
step to record where motion occurs over those frames.
The mask is part of our future work.
2. The movement detection system performs movement
detection using background subtraction between two
frames. The movement is then segmented into rectangles
of motion. A lecturer is chosen from these rectangles of
motion using intersection of rectangles across frames and
time on screen of a certain area. This can be thought of
as similar to a heat map.
3. The virtual cinematographer section focuses on panning
the output frame in a smooth fashion to avoid making the
video uncomfortable or jarring to watch. This attempts to
emulate the way an actual cameraman would film a
lecture.
Our aim is to develop a fully functioning 4K lecture tracking
solution. This includes making the solution both time efficient and
robust enough to track a lecturer precisely across a plethora of
possible cases. This paper addresses the 2nd module of this project.
The paper is laid out as follows. Section 2 covers work related to
image processing techniques used in this system. Section 3 details
the specifics of the system as well as system requirements and
constraints. Section 4 elaborates on how we evaluated the system
including the evaluation results. Section 5 focuses on analysis of
the results found. Section 6 focuses on concluding statements as
well as possible future directions and improvements for and to this
system.
2. BACKGROUND & RELATED WORK Below we discuss related work that implemented lecture tracking
solutions.
Zhang et al. (2005) present a tracking solution based on a pixel
motion histogram that implements virtual panning. In addition to
this they also use a PTZ camera that can pan horizontally to track
the lecturer when they move out of frame. They mention the
performance issues with face detection with regard to run time as
well as robustness. To make their system robust to lighting changes
and make it more efficient the motion pixel histogram is only
calculated around an area that surrounds the detected lecturer. This
means that changes in lighting don’t affect other areas of the screen
as they simply aren’t processed. While this solution appears to work
well, their frame was a 640 x 480 resolution which is far smaller
than the 4K frame we need to process. It is possible that
performance would become a problem.
Arseneau et al. (1999) present a tracking solution for classroom
environments where the camera isn’t mounted and pointed in an
optimal position like ours. A background subtraction technique that
divides horizontal and vertical maxima into bins is used. The global
maxima of these two axes is chosen as the center point of a region
of interest. These regions of interests are then processed using a 2:1
height:width ratio rectangle to output the location of the presenter.
This approach is more robust to room setup but doesn’t account for
other movement in the scene, if two humans were to enter the view
the region of interest could jump between successive frames.
Below we present a description and explanation of the image
processing techniques used in this work that are implemented as
part of the OpenCV library.
Mathematical Morphology is a processing technique
implemented on geometrically structured data such as a 2D grid of
binary pixels [5]. It can expand or shrink the data with the
application of dilation or erosion procedures. In our
implementation we make use of it to expand our thresholded image
so contours are easier to find.
Blurring functions reduces image sharpness by blending nearby
values together. This can be done in a variety of ways such as with
a box blur which simply averages all nearby values in a sliding
window [6] or with a Gaussian approximation which similarly
makes use of a sliding window but averages values using a
weighted matrix of values [3]. Blur is used in our implementation
to make contours easier to find.
Finding contours of an image can be done using border following
as in OpenCV’s implementation. This implementation follows
Suzuki et al. (1985) which describes a border following algorithm
that accepts binary images as input. It creates outline contours as
well as registers holes in contours. These contours outline and save
the movement detected by a background subtraction algorithm.
3. DESIGN & IMPLEMENTATION This section includes a description of the constraints set forth by
CILT, their requirements as well as the assumptions we made that
helped in developing our program. We then discuss our system’s
overview and our architecture as well as go into more depth about
the lecturer recognition portions of the sections. Finally, the
methods from OpenCV we used are discussed as well as an in-depth
walkthrough of our algorithm in Sections 3.4, 3.5 and 3.6.
3.1 System Constraints & Requirements The basic requirements set forth by CILT are to:
Segment blackboards saving clear and high definition
images of these segments, these segmented images are to
be updated as the board is written on.
Take a 4K lecture video as input and use computer vision
algorithms to track the lecturer.
Output a 720p video that follows the tracked lecturer
positions acting as a virtual cinematographer.
Important assumptions we made when implementing our system
include:
The camera is positioned such that the lecturer will be
around the middle of the screen with regard to the y-axis.
The camera is positioned such that the lecturer’s podium
and middle board or gap between two boards are in the
middle of the screen with regard to the x-axis.
The lecturer will not always be alone in the frame and as
such occlusion needs to be handled.
The lecturer can’t move very far between successive
frame reads.
The video captured doesn’t ever lag meaning that no
frames are dropped and duplicated.
3.1.1 Functional Requirements CILT requires that the system is able to track a lecturer correctly
throughout most normal lecture situations. From this we devised
the research aim that our system needs to be able to correctly
identify the lecturer for 90% of the video. The lecturer is correctly
identified when (in debug mode) the program shows a rectangle
that at least 70% contains the lecturer.
CILT requires the system to be able to run well across a plethora of
lecture scenarios. We therefore devised a set of 17 use cases that
are discussed in the methods section. These cover all realistic
scenarios that would occur in a lecture theatre as well as some edge
case scenarios. Successfully completing this requirement would
provide strong support for the robustness of the system across all
realistic scenarios.
3.1.2 Non-Functional Requirements CILT requires that lecture videos are currently released within 8
hours of the lecture finishing and this all video post-processing
must fall within that period This doesn’t leave much time for our
framework as other processes and steps also form part of this 8-
hour time limit. One example of these processes is needing to
manually cut the video to the exact start and end of the lecture.
CILT recommended this module take at most 3 times the runtime
of the actual video. From this we devised the research aim that
focuses on processing the lecturer tracking section in less than 2
times the length of the input video as we believe our planned
algorithm can meet and beat the requirement outlined.
Our research aims therefore are:
1. Can this section run efficiently enough to be processed in
less than 2 times the length of the video?
This will fulfill the process speed requirement meaning
lecture videos will be available to students faster without
creating a buffer of unedited videos.
2. Does our system correctly segment out all motion and
decide on the correct lecturer 90% of the time for likely
use cases?
This will fulfill the main functional requirements
meaning our solution will be usable without further work
for most lecture recording.
3. Does our system correctly segment out all motion and
decide on the correct lecturer 90% of the time for unlikely
(edge) use cases?
This will fulfill the edge case functional requirements
meaning our solution will work for a variety of odd cases
and is generally quite robust.
3.2 System Overview The movement detection and lecturer recognition module requires
the lecture video as input and reads all frames of the video file
during its execution. Not all frames need to be processed; in order
to reduce processing time for this module, we implemented the
ability to skip frames, these are only read, not processed. These
frames are passed through multiple computer vision algorithms
provided by the OpenCV library to yield rectangles that
encapsulate an area of motion. A history of these rectangles is
created in a class called “Ghost”. Once all frames have been read,
these ghosts are post-processed to select a lecturer for each frame
and the locations of the ghosts are shifted to more accurately
represent the position of the lecturer. This section sends the
locations of the lecturer at each frame processed to the virtual
cinematographer section.
We used OpenCV, an open source computer vision library, for
much of our processing. This library provided us with many
complicated image processing methods that allowed us to test a
plethora of approaches to a single problem with minimal effort.
OpenCV makes use of the OpenCL framework to parallelize their
methods as much as possible which means that OpenCV is written
to be very efficient and aims to provide the possibility for solutions
to run in real-time.
Since we needed a very efficient program we also had to decide
between C++, Java and Python implementations of the OpenCV
library. Prechelt (2000) found that C++ runtime for the same
program was on average 2 times faster than Java and almost 2 times
faster than Python. Similarly, C++ also performed best for memory
being 2 times more memory efficient than Python and 4 times more
memory efficient than Python. Given these results it seems logical
to use C++ as our processes need to be completed quickly and
minimizing memory usage is important when processing raw 4K
pixel data. Unfortunately, the study also revealed that development
using Python is two times faster than Java and C++. This meant that
although the code would run quicker with our choice of C++, it
would likely take much longer to develop especially since we were
new to the OpenCV library. A similar study also performed by
Prechelt (1999) mirrored these results between Java and C++.
Figure 3 Primary Sequence of MovementDetection Class
Figure 3 shows our ability to process every Nth frame. This means
we can find a balance between the run time of the program and how
precise the changes in movement can be registered. Sections 3.4,
3.5 and 3.6 all implement many small methods that are based on
assumptions about the way a lecturer will behave and tend towards
maximizing the chance the lecturer is found. Note that OpenCV is
implemented with a top-left origin co-ordinate system and all
methods assume knowledge of this logic.
A project system architecture diagram is presented in Appendix A
in the form of a class diagram.
3.3 OpenCV Methods Our first approach was to use Sobel filtering [4] which was slow
for many frames but also yielded all edges which we would then
have needed to segment. We ran into similar issues using Canny
edge detection [2]. We decided to change the direction of the
project to make use of movement detection through background
subtraction rather than edge detection.
Our first backgrounds subtraction approach made use of OpenCV’s
MOG2 background subtractor from Zivkovic et al. (2004) which
produced good foreground segmentation but resulted in lots of
noise all over the scene. In addition to this it also took noticeably
too long to process frames. We decided to go with a simple absolute
difference algorithm to segment differences between frames and
then applied a threshold function. This provided mostly clean
background subtraction with very little noise because we chose
very tight limits for our thresholding function. The runtime of this
algorithm was also noticeably better.
We then performed a morphological dilation as well as a blur using
a large 15 x 15 matrix to make contours more recognizable.
while(readNextFrame)
if(Nth frame)
1 absoluteDifference(frame a, frame b)
2 threshold(25, 255)
3 morphologicalDilate(3x3 structure)
4 blur(15x15 matrix)
5 findContours()
6 Process Rectangles Logic (Section 3.4)
7 Process Ghosts Logic (Section 3.5)
8 Store Ghosts and Rectangles
9 findLecturer() (Section 3.6)
10 adjustLecturer() (Section 3.6)
Because of this order of operation, we found that very little noise
got through and thus the dilation and blur only made areas we
wanted to detect larger. After this we found the contours of our
edited frame the output of which is shown in Figure 4 below.
Figure 4 - Output of Find Contours Including Short Contours.
3.4 Recognition Algorithm This section follows Figure 3 Step (6) and explains how contours
are converted to rectangles and the various culling and grouping
processes rectangles per frame undergo.
The first step in the process removes any contour chains that have
very few nodes. This is because any complex shape which we are
interested in will have many nodes, this also tends to ignore small
movements which can come about as a result of light changes or
refocusing of the camera. Very large simple shapes such as a
rectangle can be represented by short contour chains so this
removes those as well.
Figure 5 - Upper and Lower Bounds of Movement
A bounding rectangle is created around each contour using
OpenCV’s boundingRect method. Since we assume that the
lecturer is in the middle of the screen with regard to the y-axis any
rectangle’s top that is greater than a maximum y threshold is
deleted. For the same reason any rectangle’s bottom that is smaller
than a minimum y threshold is deleted. Figure 5 shows an example
of these thresholds where green rectangles are valid and red
rectangles are removed.
The next step is to check if the rectangle’s 𝑤𝑖𝑑𝑡ℎ: ℎ𝑒𝑖𝑔ℎ𝑡 ratio is
below a certain threshold. This is because we found that humans
being tracked don’t move in such a way as to generate very wide
rectangles. This check ignores boards and projector screens moving
vertically as those are often detected as very wide rectangles as in
Figure 6.
Figure 6 - Board Top Being Movement Detected
Overlapping and nearby rectangles need to be grouped together into
a larger rectangle. This is in an attempt to build a rectangle around
a person that has multiple small movements detected and therefore
has registered multiple unconnected contours in a frame.
Implementation of this is shown in Figure 7.
Figure 7 - Rectangle Overlap and Proximity Evaluation Logic
Figure 8 includes a rudimentary implementation of a clustering
algorithm for rectangles. Figure 8 shows the outcome of this
clustering algorithm. Each run of the logic in Figure 7 is evaluated
for changes as a new bounding rectangle could now intersect other
rectangles.
Two redundancy checks are then performed. The first checks if the
frame is too cluttered with rectangles which can occur where there
is too much motion or when the camera refocuses. In this case it is
difficult to extract useful information so a default middle bounding
rectangle replaces all others. The second case is if there are no
bounding rectangles found in which case a default rectangle is
similarly placed. This is necessary otherwise empty frames are
effectively discarded and the virtual cinematographer will begin to
while(no changes registered)
changeRegistered = false
for each rectangle R1
for each rectangle R2
if(R1 intersects R2)
replace R1 with new bounding rectangle
remove R2
changedRegistered = true
for each rectangle R1
for each rectangle R2
if(minDistance(R1, R2))*
replace R1 with new bounding rectangle
remove R2
changeRegistered = true
*minDistance(rectangle, rectangle) method finds distance
between closest two edge points of 2 rectangles
lag behind the actual video. This is because the wrong rectangles
are assigned to frames as a result of the skipped empty frames.
Figure 8 - Nearby Rectangles (left) and Their Resulting Bounding
Rectangle (right)
3.5 Ghost Tracking This section follows Figure 1 Steps (7 and 8) and explains what the
“Ghost” class does as well as how ghosts are chosen and used.
A ghost is a rectangle that’s size and longevity depends on how it
intersects with rectangles from the recognition stage (Section 3.4).
The ghost tracks the movement detection rectangles across multiple
frames recording how long an object moves as the number of
frames it is tracked in. The ghost takes uncorrelated rectangles in
multiple frames and establishes a relationship between them
assigning a group of rectangles across many frames to a single
entity.
A new ghost is instantiated when a rectangle is created that doesn’t
intersect with any previous ghosts. This ghost will have the
dimensions of the rectangle as well as a time on screen of 1 frame.
In successive frames, if this new ghost intersects other rectangles,
it will grow outwards towards that rectangle’s extremities as shown
in Figure 9 below.
Figure 9 - A Ghost (white) Updating Position Towards the
Movement Detected (blue)
If no intersection is found or the intersection is below a percentage
of the ghost’s area, then the ghost will shrink inwards. Once the
ghost has become small enough it will be deleted. The assumption
behind this is that people in the view will constantly be moving at
least a small amount. This should ensure that human objects are
constantly tracked whereas objects such as boards are only tracked
when they are moved. The resizing of the ghost based on
intersected rectangles also allows the ghost to track movement
laterally as in Figure 9. This assumes that people do not move
quickly enough to exit the ghost.
Figure 10 shows the logic for ghosts being merged together and
split apart. The merge and split algorithm was introduced to handle
occlusion of two or more tracked objects. This allows the highest
screen time count of the merged ghosts to be recorded and kept.
Figure 11 illustrates the split algorithm running, it shows two
rectangles being a certain distance apart, greater than the threshold
and therefore being split.
As a form of reset in the case that the system picks up the wrong
object as the lecturer there is a reset function that reduces each
ghost’s time by two thirds. This operation is performed every 120
frames and acts as a soft reset. We found through testing that the
120 frames were frequent enough to affect any object visible on
screen for more than 4 seconds. This is because it took us an
average of 4 seconds to walk across the lecture view. Thus only
affecting objects that are attempting to stay in the view.
At the end of the frame read loop, Step (8), all rectangles that
survived until this point as well as all ghosts are saved for post-read
processing.
Figure 10 - Ghost Intersect and Divide Logic
Figure 11 - Ghost (white) Being Split into Two
3.6 Lecturer Selection This stage follows Figure 1 Step (9 and 10) and processes all ghosts
stored when the video file was processed. It focuses on deciding
which ghost is the lecturer and then shifting the boundaries of the
ghost to better track the lecturer.
To find the lecturer, we begin by assuming that the lecturer will
tend to be in the center of the screen. The lecturer is selected as the
ghost with the highest value of a distance ratio operation. The ratio
is the distance from the x-sides and y-sides on a scale from 0 to 1.
Figure 13 displays an example of this scaling.
//intersect
for each ghost G1
for each ghost G2
if G1 intersects G2
merge G1 and G2
//divide
for each ghost G
for each rectangle R
if G intersects R
store R index in I
for each i1 in I and each other i2 in I
if i1 and i2 (don’t intersect and are inline on y-axis)
if distance(i1, i2) > minDistance
split G into G(i1) and G(i2)
𝑣𝑎𝑙𝑥 = 𝑜𝑛𝑆𝑐𝑟𝑒𝑒𝑛𝑇𝑖𝑚𝑒 ∗ (𝑟𝑎𝑡𝑖𝑜𝑋2)
𝑣𝑎𝑙𝑦 = 𝑜𝑛𝑆𝑐𝑟𝑒𝑒𝑛𝑇𝑖𝑚𝑒 ∗ (𝑟𝑎𝑡𝑖𝑜𝑌2)
Figure 12 – Positional Importance Formulae
Figure 13 - Positional Importance Visualization
The final decision value is calculated using the average of valx and
valy shown in Figure 12. This means that movement low down or
high up on the screen can be ignored as this is likely seated students,
a student crossing the venue or a projector screen moving above.
Since we assume the camera is pointed to the center of the lecture
space it also makes sense that the lecturer will occupy this central
space.
The last step is Step (10) which focuses on adjusting the selected
lecturer’s bounding rectangle explained in Figure 14. This is to
better match the original rectangles generated by this module. This
is necessary as the ghosts used to determine the lecturer tend to lag
behind the lecturer’s movements slightly.
Figure 14 - Lecturer Rectangle Adjustment Logic
The locations of the lecturer are saved as a vector of rectangles and
can be accessed later for the virtual cinematography module of the
program.
4. Experimental Data and Setup To test our system, we devised a set of 17 use cases illustrating how
a lecturer might move in a lecture. We then acted them out while
recording using a 4K camera. These videos were then processed
through this module. The use cases chosen are fairly broad and
cover many common cases as well as a couple of edge cases to test
the limits of this module. We analyzed a set of 5 videos of real
lectures. For each of our 17 use cases we estimated a score (1- 5)
explained in Table 1.
Table 1 - Explanation of Use Case Scores
Likelihood No. Definition
Very
Unlikely
1 Occurred once in one of the 5 lecture
videos
Unlikely 2 Occurred at least once in at least 3 of the
lecture videos
Possible 3 Occurred at least once in all 5 of the
lectures
Likely 4 Occurred at least 5 times in all 5 of the
lectures
Very
Likely
5 Occurred more than 5 times in all 5 of the
lectures
Table 2 - Lecture Use Cases
No. Description Likelihood
(1 – 5)
1 Basic lecturing – little movement, no
pacing or gesturing.
5
2 Lots of hand waving – moderate
movement, little pacing, lots of gesturing.
4
3 Lots of pacing – lots of movement, lots of
pacing, little gesturing.
4
4 Changing light – moderate movement
lecturing while light conditions change
often.
2
5 Moving boards – lecturing while moving
boards up and down often.
3
6 Screens and movement - setting projector
screens to go up and down while lecturing,
lots of lecturer movement and pacing.
2
7 Screens and stationary – setting projector
screens to go up and down while lecturing
with no movement.
2
8 Off and on – move off the screen and then
back on.
3
9 Student crosses – lecturing in center while
student crosses view.
3
10 Both move – both student and lecturer walk
from one side of view, lecturer halts in
middle.
2
11 Both move opposite – both student and
lecturer approach from different sides of
the view.
1
12 Running – lecturer running across the
view.
1
13 Throwing – lecturer throwing an object
between a student.
1
14 Multiple students – simulate a 3 student
presentation.
1
15 Two students cross – students walk past
lecturer in middle from either side of the
view.
1
16 Student chairs – student moving along
chairs in bottom part of view.
3
17 No movement – no one in the view. 2
We then constructed Table 2 using the scoring system which
showed the level to which the behavior in each use case occurred.
We decided that use cases with likelihood 3 and above (1, 2, 3, 5,
for each lecturer L
for each intersecting rectangle at L.index
if rectangle intersects and isn't 2 x larger than L
store intersecting rectangles R
for each stored rectangle R
shift boundaries of L as average of all R and L
8, 9 and 16) are grouped as likely to occur. Whereas 4, 6, 7, 10, 11,
12, 13, 14, 15, 17 are grouped as unlikely to occur. This division
will help us evaluate the overall success of this module as core use
cases are more important to cover than edge use cases as they will
be occurring more often. This distinction is also used to evaluate
research aims 2 and 3 which are the same objective for more and
less likely use cases.
To evaluate the performance of our approach we stepped through
the use case videos at a rate of 4 frames per step. We counted the
number of times the program incorrectly identified the lecturer.
Incorrect identification is counted when the rectangle tracked the
wrong person, a moving object or lagged behind the movement of
the lecturer (having the lecturer completely outside of the rectangle
but still tracking in their direction). In our final results we
simplified the data by rounding frames to the nearest second, this
handles the possibility of variable framerates occurring between
videos.
To evaluate the runtime of our system the processing time of the
use case is recorded. If the average processing time of all our use
cases takes longer than 2 times the average length of the use case
videos, the research aim is considered failed.
5. RESULTS All data presented below was a result of using a i7-6700HQ
processor with a boost frequency of 3.5GHz, 8 threads and 6MB
cache, 8GB of DDR3 memory and a 7200rpm HDD.
Table 3 - Use Case Evaluation Results
5.1 Runtime of Solution Table 3 shows the time taken to process each use case as a
percentage of the video’s length. The results are within the
parameter of research aim 1 as each falls well below 200% with
some even dipping below 100%. We decided to read every 4th
frame because we found a good balance between keeping this
module time efficient, while preventing objects from moving too
far between successive processed frames. These results can be
further improved with the use of a solid state hard drive(SSD). We
ran some test with an SSD and found that this largely reduced the
processing time. As illustrated in Figure 15 68% of the time taken
to process is read operations. This is likely because of the large size
of a single 4K frame being 7.45MB. It is likely that our fast
processing times are due to the efficiency of OpenCV and their
heavy parallelism. We were able to verify the effectiveness of
OpenCV’s parallel implementations by watching the Windows task
manager while running the program. It revealed that up to 30
threads were being pooled at any one time. It is also important to
note that our processes described in 3.4, 3.5 and 3.6 take up a very
small portion of the processing time (less than 13%). Most
processing in this project is attributed to the computer vision
techniques implemented by the OpenCV library and the read.
Figure 15 - Operation Time Taken
5.2 Likely Use-Cases Table 3 (blue highlighted numbers) show that all of our likely use
cases have a tracking rate of at least 90%. This means that for a
typical lecture this system can correctly track the lecturer at least
90% of the time, thus completing our research aim 2.
The solution presented registers large movement well. This is
because the first step of our algorithm employs absolute difference
background subtraction. Without movement there is nothing for the
system to register therefore the center of the screen will be tracked.
One of our stated assumptions is that the 4K cameras are always
pointed to the center of the lecture stage. Because of this
assumption use cases where there was minute or no movement
tended to track the lecturer as they were stationary resting by the
podium. For use cases where the lecturer was stationary in other
areas of the room slight movements were still recognized and
tracked.
The moving boards use case (5) tests lots of lecturer movement and
board movement. In this use case the lecturer is contained within
the greater tracked rectangle of the board being moved. While this
can be considered a failed detection, to our system and specifically
the virtual cinematographer module this has no negative effect as
the lecturer/board will still be tracked correctly horizontally. The
virtual cinematographer doesn’t need to pan vertically as of yet.
One of the likely use cases is that a student crosses the view of the
lecturer. What this means for our module is that the lecturer and
student will be pushed into a single ghost and given the same time
on screen. Because of the setup of a lecture the student will either
be going to a seat or going from a seat. In either case they will not
be in the view long and when they approach the edges of the view
their change to be tracked is reduced severely. This is because of
our weighted values approach illustrated in Figure 13.
Another use case is a student bending over in the view to seat
themselves in the front row. This is handled as we assume a lecturer
will be vertically in the middle of the frame because of the way in
which the cameras are positioned. The projector screen will be the
only thing moving at the top and students the only things moving
at the bottom of the screen.
The system developed and the stages explained in Section 3.4 aim
to capture as much useful movement data as possible. It then trims
this data to fit our assumptions. This data can then be assumed to
be an object being moved or a person moving themselves. This is
why our system works for all of these likely use cases as in each we
No. Length (s) Process Time (s) % Process Time Mistracked (s) % Correct Tracked
1 57 66,192 116,13% 2 96,49%
2 31 32,756 105,66% 0 100,00%
3 48 46,648 97,18% 0 100,00%
4 85 90,575 106,56% 60 29,41%
5 40 44,879 112,20% 3 92,50%
6 48 56,975 118,70% 1 97,92%
7 46 52,147 113,36% 3 93,48%
8 40 45,824 114,56% 3 92,50%
9 72 85,613 118,91% 4 94,44%
10 17 19,67 115,71% 1 94,12%
11 16 21,707 135,67% 0 100,00%
12 24 28,274 117,81% 14 41,67%
13 32 40,115 125,36% 2 93,75%
14 162 156,311 96,49% 62 61,73%
15 30 37,577 125,26% 3 90,00%
16 32 36,397 113,74% 1 96,88%
17 77 76,248 99,02% 0 100,00%
have identified a whole moving form and have made decisions on
those identifications.
5.3 Unlikely Use-Cases Table 3 (yellow highlighted numbers) shows that 7 out of 10 of our
unlikely use cases are correctly tracked at least 90% of the time.
Cases 4, 12 and 14 are the cases that failed the 90% requirement.
Therefore, we can mark our research aim 3 as failed.
When viewing the video for use-case 4 we realized that the
recording wasn’t smooth and often had duplicate frames in
sequence for a couple of frames at a time. In this particular use case
we put an emphasis on switching various lecture lights on and off
and opening and closing the blinds to test using natural lighting as
well. When a shift in light occurred the 4K camera would correct
the light very well going so far as to grayscale the image and add
illumination in very dark settings. We hypothesized that the
camera’s processor couldn’t keep up with this processing and thus
duplicated frames instead of processing new ones. This is possibly
why the video given to us was very laggy and jittery. In either case
this violates one of our original assumption that we would be
processing smooth video with continuous movement. As such we
can ignore this use case.
Use case 12 represents a lecturer running across the lecturing area.
When processed this video captures the lecturer’s movement but
when they move too quickly (completely out of the original
tracking frame) the lecturer is lost until this frame shrinks to
nothing. This is as a result of how our system moves “Ghosts” in
the scene. Each frame that generates new ghosts searches for
intersections with ghosts recorded in previous frames. If a new
ghost is too far away from a previous ghost it is assumed there is
no correlation. In this case the fact that we skip 3 frames for every
processed frame means that if the lecturer moves very quickly
(which is a very unlikely case) the program might not track the
lecturer properly. This can feasibly be solved by processing more
frames however this would increase the processing duration of the
module.
Use case 14 represents multiple students giving a presentation. This
use case only tracked the lecturer correctly 61,7% of the time.
Fundamentally this is meant to be a difficult use case for the module
to handle. When 3 students are presenting there is no indication
other than voice and nuanced movement distinction who the
lecturer at any one moment is. While our system was developed to
handle temporary passing and occlusion of students it still
fundamentally only handles one lecturer. With this in mind the
problems we noticed were that students who weren’t lecturing but
were in the view continued moving and thus retained their screen
time count. Additionally, because the role of speaker is passed
between the students, the screen time counts are all mixed together.
Therefore, the lecturer is often decided by who of the 3 is most
central in the view.
The use cases that include complicated configurations of students
crossing the view (10, 11, 15) act similar to use case 9, which is the
simplest configuration of this type. The lecturer was tracked
sufficiently well even though when students pass the lecturer they
pick up the lecturers on screen time. This occurs when ghosts merge
and then split as in Figure 11. These students always move towards
an edge so the actual lecturer was correctly chosen.
Use cases with moving projector screens (6, 7) also tracked the
lecturer well. This is because of the width of the projector screen’s
bottom weight. The entire screen is a uniform colour and material
except for the weight. Because of this only the weight at the bottom
is tracked as movement. The bottom weight’s very large width to
height ratio means it’s bounding rectangle is ignored early on.
Furthermore, bounding rectangles very high up are also ignored;
this covers the case of when the projector screen is almost
completely rolled up. When the lecturer isn’t moving as in use case
7 the tracking defaults to the center of the room where the lectern
is and therefore where the lecturer is.
The 2 valid use cases that failed the 90% requirement are both very
unlikely cases. Although this failure means that our tracking
solution can’t handle all the less likely cases, our results show that
the proposed solution is still viable for use in a general lecturing
environment.
6. CONCLUSIONS In this paper, we presented a software solution for the lecturer
tracking module of the 4K lecture recording project. The module
correctly identifies the lecturer for most use cases with a run-time
well within our defined restrictions. The run-time of the solution is
an average of 1.13 times the length of the video. We do highly
recommend that the software be run on an SSD with a very fast read
because the read time reduces the processing speed severely. We
also discuss other efficiencies discussed in future work. Although
our solution fails on 3 out of 17, less plausible, use cases we still
believe it is a viable piece of software to be used in lecture
recording. This belief is based on the result that likely use case will
occur much more often than unlikely use cases. We have made
many assumptions about the layout of a lecture theatre and it is
important to note that a more robust solution is needed to generalize
our software. This includes lecture theatres with no podium or a
podium in another location and a camera mounted less centrally and
is therefore aimed at an angle.
7. FUTURE WORK Gesture recognition could provide a new and useful element our
system. Gestures could be used by the virtual cinematographer
module to create more informative panning such as sideways in the
direction of a gesture. Gesture recognition could be implemented
by using contour information that is calculated in an early step. The
lecturer would be analyzed to segment arms and therefore track if
the lecturer is gesturing or resting their arms.
Another addition that would make this module more robust would
be to analyze contours when a board moves to segment the lecturer
out. This is necessary as often the lecturer and board are given the
same line segment at the contour detection step and are assumed to
be one object. Detection of a board merging with a lecturer could
be done by analyzing when a rectangle’s sizes vary wildly between
frames.
A recognition algorithm could be implemented on individual ghosts
to detect colour features. Perhaps a form of histogram equalization.
This could then be used to correctly redistribute on-screen time
when two ghosts merge and then split again.
The system for deciding which ghost is the lecturer could be
extended to make use of temporal states. Because we record all
ghost positions and then process them we can watch for ghosts that
disappear off the screen and remove all occurrences of them.
The project could be extended to begin by segmenting the podium
or center of lecturer movement out. This could then be used as the
center of the lecturing area. In this case a robust scaling system
would need to be developed to change the scaling of the formulae
from Figure 12.
A movement mask is calculated by the pre-processing module of
the program. The tracking module can make use of that mask to
limit processing in areas where there is no movement by doing a
simple binary check if that area is available. This would likely
require integration with OpenCV’s implementations of absolute
difference, threshold and contour detection.
8. REFERENCES 1. Arseneau, Shawn, and Jeremy R. Cooperstock. "Presenter
tracking in a classroom environment." Industrial
Electronics Society, 1999. IECON'99 Proceedings. The 25th
Annual Conference of the IEEE. Vol. 1. IEEE, 1999.
2. Canny, John. "A computational approach to edge
detection." IEEE Transactions on pattern analysis and
machine intelligence 6 (1986): 679-698.
3. Durand, Frédo, and Julie Dorsey. "Fast bilateral filtering for
the display of high-dynamic-range images." ACM
transactions on graphics (TOG). Vol. 21. No. 3. ACM,
2002.
4. Gao, Wenshuo, et al. "An improved Sobel edge
detection." Computer Science and Information Technology
(ICCSIT), 2010 3rd IEEE International Conference on. Vol.
5. IEEE, 2010.
5. Haralick, Robert M., Stanley R. Sternberg, and Xinhua
Zhuang. "Image analysis using mathematical
morphology." IEEE transactions on pattern analysis and
machine intelligence 4 (1987): 532-550.4
6. Jarosz, Wojciech. "Fast image convolutions." SIGGRAPH
Workshop. 2001.
7. Larkin, Helen E. "But they won't come to lectures..." The
impact of audio recorded lectures on student experience and
attendance." Australasian Journal of Educational
Technology 26.2 (2010): 238-249
8. Prechelt, Lutz. "An empirical comparison of C, C++, Java,
Perl, Python, Rexx and Tcl." IEEE Computer 33.10 (2000):
23-29.
9. Prechelt, Lutz. "Comparing Java vs. C/C++ Efficiency
Differences to Interpersonal Differences." Commun.
ACM 42.10 (1999): 109-112.
10. Soong, Swee Kit Alan, et al. "Impact of video recorded
lectures among students." Who's learning (2006): 789-793.
11. Suzuki, Satoshi. "Topological structural analysis of
digitized binary images by border following." Computer
Vision, Graphics, and Image Processing 30.1 (1985): 32-46.
12. Zhang, Cha, et al. "Hybrid speaker tracking in an automated
lecture room." 2005 IEEE International Conference on
Multimedia and Expo. IEEE, 2005.
13. Zivkovic, Zoran. "Improved adaptive Gaussian mixture
model for background subtraction." Pattern Recognition,
2004. ICPR 2004. Proceedings of the 17th International
Conference on. Vol. 2. IEEE, 2004.
Appendix A – Track 4K Class Diagram