risk-based implementation of colregs for autonomous

16
Risk-based implementation of COLREGs for autonomous surface vehicles using deep reinforcement learning Thomas Nakken Larsen a , Amalie Heiberg b , Eivind Meyer c , Adil Rasheed a,* , Omer San d , Damiano Varagnolo a a Department of Engineering Cybernetics, Norwegian University of Science and Technology b Equinor c Institute of Informatics, Technical University of Munich d School of Mechanical and Aerospace Engineering, Oklahoma State University Abstract Autonomous systems are becoming ubiquitous and gaining momentum within the marine sector. Since the electrification of transport is happening simultaneously, autonomous marine vessels can reduce environmental impact, lower costs, and increase eciency. Although close monitoring is still required to ensure safety, the ultimate goal is full autonomy. One major milestone is to develop a control system that is versatile enough to handle any weather and encounter that is also robust and reliable. Additionally, the control system must adhere to the International Regulations for Preventing Collisions at Sea (COLREGs) for successful interaction with human sailors. Since the COLREGs were written for the human mind to interpret, they are written in ambiguous prose and therefore not machine-readable or verifiable. Due to these challenges and the wide variety of situations to be tackled, classical model-based approaches prove complicated to implement and computationally heavy. Within machine learning (ML), deep reinforcement learning (DRL) has shown great potential for a wide range of applications. The model-free and self-learning properties of DRL make it a promising candidate for autonomous vessels. In this work, a subset of the COLREGs is incorporated into a DRL-based path following and obstacle avoidance system using collision risk theory. The resulting autonomous agent dynamically interpolates between path following and COLREG-compliant collision avoidance in the training scenario, isolated encounter situations, and AIS-based simulations of real-world scenarios. Keywords: Deep Reinforcement Learning, Collision Avoidance, Path Following, Collision Risk Indices, Machine Learning Controller, Autonomous Surface Vehicle 1. Introduction Over the last few years, the promising idea of autonomous ships has gained traction through projects like ReVolt (DNV GL (2020)) and Yara Birkeland (KONGSBERG (2020)). Such research projects are increasingly incentivized as the funding bodies recognize the potential benefits of autonomy at sea. A notable example is the EU-funded four-year project Autoship Horizon 2020, which seeks to speed up the transi- tion towards autonomous ships in the EU (Autoship (2020)). For the first time in history, the promise of lower emissions, higher eciency, and fewer accidents via autonomy is becom- ing tangible. Human error is a leading cause of accidents on the road (Din- gus et al. (2016); Thomas et al. (2013)), and reports show that accidents at sea are no dierent. According to the Annual Overview of Marine Casualties and Incidents published by the European Maritime Safety Agency (EMSA), human error was attributed to over 50% of accidental events between 2011- 17 (European Maritime Safety Agency (EMSA) (2018)). In addition to reducing accidents (and thereby fatalities), envi- ronmental damage, and costs, autonomous marine operations * Adil Rasheed Email address: [email protected] (Adil Rasheed) allow for optimized route planning. This optimization can be done with respect to time spent and fuel costs. Furthermore, autonomous ships can move cargo transport from the road to the sea, leading to less tracked roads. For instance, the au- tonomous container ship Yara Birkeland is expected to reduce the number of trips made by diesel trucks by 40,000 a year after its launch in 2020 (Skredderberget (2018)). With the widespread electrification taking place, reduced air pollution is another likely and desirable eect. An overall reduction of errors from introducing autonomy de- pends on developing robust and reliable systems, which is no trivial task. For autonomous navigation at sea, the vessel’s control system must deal appropriately with a wide range of situations depending on the position of the ownship (OS) and other ships within a certain radius and environmental factors such as wind and ocean currents, and waves. Another crucial element is detection, classification, and tracking of objects, which might be challenging in certain weather conditions. The currently proposed solutions generally make significant simplifications and assumptions. Low-level controllers, or autopilots, are already commercially available, but more re- search on high-level path planning and collision avoidance is needed to ensure safe autonomous navigation in real sit- uations. For collision avoidance, compliance with the Inter- Preprint submitted to Elsevier December 2, 2021 arXiv:2112.00115v1 [cs.RO] 30 Nov 2021

Upload: others

Post on 23-Apr-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Risk-based implementation of COLREGs for autonomous

Risk-based implementation of COLREGs for autonomous surface vehicles using deepreinforcement learning

Thomas Nakken Larsena, Amalie Heibergb, Eivind Meyerc, Adil Rasheeda,∗, Omer Sand, Damiano Varagnoloa

aDepartment of Engineering Cybernetics, Norwegian University of Science and TechnologybEquinor

cInstitute of Informatics, Technical University of MunichdSchool of Mechanical and Aerospace Engineering, Oklahoma State University

Abstract

Autonomous systems are becoming ubiquitous and gaining momentum within the marine sector. Since the electrification oftransport is happening simultaneously, autonomous marine vessels can reduce environmental impact, lower costs, and increaseefficiency. Although close monitoring is still required to ensure safety, the ultimate goal is full autonomy. One major milestoneis to develop a control system that is versatile enough to handle any weather and encounter that is also robust and reliable.Additionally, the control system must adhere to the International Regulations for Preventing Collisions at Sea (COLREGs) forsuccessful interaction with human sailors. Since the COLREGs were written for the human mind to interpret, they are writtenin ambiguous prose and therefore not machine-readable or verifiable. Due to these challenges and the wide variety of situationsto be tackled, classical model-based approaches prove complicated to implement and computationally heavy. Within machinelearning (ML), deep reinforcement learning (DRL) has shown great potential for a wide range of applications. The model-freeand self-learning properties of DRL make it a promising candidate for autonomous vessels. In this work, a subset of theCOLREGs is incorporated into a DRL-based path following and obstacle avoidance system using collision risk theory. Theresulting autonomous agent dynamically interpolates between path following and COLREG-compliant collision avoidance inthe training scenario, isolated encounter situations, and AIS-based simulations of real-world scenarios.

Keywords: Deep Reinforcement Learning, Collision Avoidance, Path Following, Collision Risk Indices, Machine LearningController, Autonomous Surface Vehicle

1. Introduction

Over the last few years, the promising idea of autonomousships has gained traction through projects like ReVolt (DNVGL (2020)) and Yara Birkeland (KONGSBERG (2020)).Such research projects are increasingly incentivized as thefunding bodies recognize the potential benefits of autonomyat sea. A notable example is the EU-funded four-year projectAutoship Horizon 2020, which seeks to speed up the transi-tion towards autonomous ships in the EU (Autoship (2020)).For the first time in history, the promise of lower emissions,higher efficiency, and fewer accidents via autonomy is becom-ing tangible.

Human error is a leading cause of accidents on the road (Din-gus et al. (2016); Thomas et al. (2013)), and reports show thataccidents at sea are no different. According to the AnnualOverview of Marine Casualties and Incidents published bythe European Maritime Safety Agency (EMSA), human errorwas attributed to over 50% of accidental events between 2011-17 (European Maritime Safety Agency (EMSA) (2018)). Inaddition to reducing accidents (and thereby fatalities), envi-ronmental damage, and costs, autonomous marine operations

∗Adil RasheedEmail address: [email protected] (Adil Rasheed)

allow for optimized route planning. This optimization can bedone with respect to time spent and fuel costs. Furthermore,autonomous ships can move cargo transport from the road tothe sea, leading to less trafficked roads. For instance, the au-tonomous container ship Yara Birkeland is expected to reducethe number of trips made by diesel trucks by 40,000 a yearafter its launch in 2020 (Skredderberget (2018)). With thewidespread electrification taking place, reduced air pollutionis another likely and desirable effect.

An overall reduction of errors from introducing autonomy de-pends on developing robust and reliable systems, which is notrivial task. For autonomous navigation at sea, the vessel’scontrol system must deal appropriately with a wide range ofsituations depending on the position of the ownship (OS) andother ships within a certain radius and environmental factorssuch as wind and ocean currents, and waves. Another crucialelement is detection, classification, and tracking of objects,which might be challenging in certain weather conditions.The currently proposed solutions generally make significantsimplifications and assumptions. Low-level controllers, orautopilots, are already commercially available, but more re-search on high-level path planning and collision avoidanceis needed to ensure safe autonomous navigation in real sit-uations. For collision avoidance, compliance with the Inter-

Preprint submitted to Elsevier December 2, 2021

arX

iv:2

112.

0011

5v1

[cs

.RO

] 3

0 N

ov 2

021

Page 2: Risk-based implementation of COLREGs for autonomous

national Regulations for Preventing Collisions at Sea (COL-REGs) is crucial to ensure safety when encountering othervessels.

Due to the complex nature of autonomy at sea, classicalmodel-based methods may be challenging to implement forfull autonomy. Modern machine learning (ML) methods areproficient in approximating such complex models. Super-vised learning approaches are powerful but limited by theirdependency on labeled training data. Reinforcement learning(RL) circumvents this by producing the training data as thedecision-making agent interacts with its environment. How-ever, there exists limited research on the combined topic ofCOLREG-compliant RL controllers. Therefore, this workaims to incorporate a subset of the COLREGs, directly re-lated to collision avoidance, into an autonomous path follow-ing and collision avoidance system based on deep reinforce-ment learning (DRL) conditioned on measures of collisionrisk. Thus, the research questions are as follows:

• Can we use state-of-the-art collision risk theory to de-fine a novel risk-based approach for implementing COL-REGs into an autonomous DRL agent?

• Can the risk-based DRL agent learn to intelligently in-terpolate between path following and collision avoidancewhile maneuvering in a COLREG-compliant manner?

Section 2 describes the state-of-the-art in collision avoidancefor marine guidance, in which the COLREGs are generallyignored. Section 3 introduces the COLREGs relevant forcollision avoidance and essential concepts within guidanceand control, collision risk theory, and DRL. Section 4 de-fines the simulation environments, DRL problem formulation,and evaluation methods. Section 5 presents and discusses theCOLREG-compliance and general path following and colli-sion avoidance performance of the resulting autonomous con-troller. Section 6 summarizes the findings and suggests futurework.

2. Background

Collision alert systems (CAS) aid the captain and crew onboard a marine vessel. Such systems primarily extend ex-teroceptive sensors, converting raw measurements into moreinterpretable information. Examples of CAS systems are Au-tomatic Radar Plotting Aid (ARPA) and Automatic Identifi-cation System (AIS) (compared in Lin and Huang (2006)),routinely used for collision risk evaluation (Xu and Wang(2014)). As we move into the fourth industrial revolution,solutions such as digital twins and remote sensing are mak-ing their way into the maritime industry (DNV GL (2019)).Decision-making is thus gradually being taken from the cog-nitive realm and into the digital domain, and the need forhighly robust and flexible guidance, navigation, and con-trol (GNC) systems is growing. Since collision avoidance(COLAV) systems are responsible for one of the most safety-critical aspects of a vessel’s operation, any GNC system op-erating in a dynamic environment requires a robust COLAV

strategy (Aniculaesei et al. (2016)). Therefore, reliable andtransparent COLAV systems are crucial to reach full auton-omy at sea.

All vessels above 300 tonnes engaged on international voy-ages, all cargo ships above 500 tonnes, and all passenger shipsare required to carry an AIS (International Maritime Organi-zation (IMO) (2020)). The AIS transmits and receives infor-mation such as identity, position, course, and speed, whichcan be incorporated into a COLAV system. Such systems canthus enhance the quality of information about other vesselsbut may also depend on the communication infrastructure.Since one cannot expect complete availability, ships typicallyutilize additional exteroceptive sensors such as cameras, li-dars, and radars. Ideally, an autonomous COLAV system usesAIS information without depending on it.

Before autonomous vessels became a possibility, the Inter-national Regulations for Preventing Collisions at Sea (COL-REGs) were formulated to prevent collisions between two ormore vessels (International Maritime Organization (1972)).Although technological advancement has been significantsince their publication in 1972, COLREG-compliance for au-tonomous vessels is still understudied. One of the main chal-lenges is that the COLREGs were written for humans to inter-pret and require a translation to a machine-readable and veri-fiable format. Another potential challenge is the indirect com-munication that occurs when two vessels meet in a situationwith a high risk of collision. For instance, the COLREGs re-quire sharp maneuvers for clear communication between ves-sels when a high-risk situation is encountered. However, thisis often not the optimal behavior from an energy efficiency(or even collision risk) point of view. So long as there maybe both human and autonomous operators of marine vesselsat sea, the autonomous controller should behave in a way thata human-operated vessel can interpret its intent.

In addition to the challenges inherent to the COLREGs, au-tonomous collision avoidance can be demanding due to thecomplex dynamics of ships, varying speeds, and changing en-vironmental conditions (Tam et al. (2009)). The majority ofthe proposed solutions for autonomy make assumptions thatdo not represent reality. Examples of such assumptions arethe constant speed of the OS or other ships, good weatherconditions, or that the system only operates while the ship isat open sea. An adequate autonomous vessel must master allthe situations the current fleet handles. For instance, givensufficient situational awareness, a full-fledged autonomousCOLAV system should be expected to handle situations in-volving all sorts of moving and stationary objects, from con-tainer ships to kayaks. For generalization, the system musttrack a high number of objects simultaneously and performwell in congested waters.

A plethora of COLAV algorithms and architectures for au-tonomous control have been, and still are, researched. Here,we distinguish between classical and soft systems (Statheroset al. (2008)). Classical systems find an optimal strat-

2

Page 3: Risk-based implementation of COLREGs for autonomous

egy analytically and numerically from mathematical mod-els and logic, which are typically accompanied by conver-gence proofs. Model predictive control (MPC) can be usedto develop COLAV systems compliant with the primary rulesof COLREGs. MPC can also be applied to nonlinear sys-tems with uncertain environmental disturbances (Solopertoet al. (2019)). The Velocity Obstacle (VO) method (Fior-ini and Shiller (1998)) models artificial obstacles represent-ing the velocities that would result in a collision, and Kuwataet al. (2014) shows that maritime navigation using the VOmethod can be COLREGs-compliant. Interval Programming(IvP), a multi-objective optimization approach, has success-fully produced COLREGs-compliant COLAV systems (Ben-jamin et al. (2006); Woernner (2014)). Dynamic Window(DW) is an optimization-based method that has been re-searched for marine applications (Serigstad et al. (2018)), thestrength of which can be found in its focus on fast dynamicsthrough reducing the search space to the reachable velocitieswithin a short time interval Fox et al. (1997).

Based on artificial intelligence (AI), soft systems assumethat the problem is not readily quantified. Heuristics areexperience-based methods for finding an acceptable solutionto a problem. The A* heuristic (Hart et al. (1968)) mightbe the most well-known and widely used soft approach; A*is a greedy search algorithm for finding the shortest distancebetween two nodes in a graph, in which a heuristic measureweights the edges between nodes. It is often used for high-level path and trajectory planning, as was done in Eriksen(2019). Another well-known heuristic is the genetic algorithm(GA) based on evolutionary theory. Smierzchalski (1999) ap-plies a genetic algorithm for trajectory planning in an envi-ronment with static and dynamic obstacles. Kim et al. (2015)showed that Distributed Tabu Search, a metaheuristic method,can be used for collision avoidance in highly congested ar-eas. Another group of soft systems is machine learning (ML).ML techniques such as deep learning (DL) and reinforce-ment learning (RL) have recently gotten significant attentionin the context of autonomous systems and decision-makingproblems, as they benefit from neural networks’ currently un-matched function approximation capabilities. Model-free RLmethods can find a control law even without any mathemati-cal model of the system (Silver et al. (2021)). However, onlya limited amount of research has been devoted to autonomousmarine vessels compared to driver-less cars, for instance. InXu et al. (2017), a deep convolutional neural network (CNN)is trained for COLREGs-compliant collision avoidance fora crewless surface vehicle. This method is based on imagerecognition, using the CNN’s ability to process spatially struc-tured data. The Deep Deterministic Policy Gradient (DDPG)algorithm has demonstrated successful path following andsimple collision avoidance for marine vessel models (Martin-sen (2018); Martinsen and Lekkas (2019); Vallestad (2019)).In addition, Zhao and Roh (2019); Meyer et al. (2020b) show-cased the Proximal Policy Optimization (PPO) algorithm formulti-ship collision avoidance.

Alternatively, COLAV systems can be classified as delibera-tive or reactive systems (Siciliano and Khatib (2008)). De-liberative systems work in a “sense-plan-act” fashion. Intu-itively, reactive systems are then considered “sense-act” sys-tems. Hybrid COLAV systems emerge when combining dif-ferent system categories, e.g., deliberate and reactive systems.This approach is made with increasing frequency (Ding et al.(2011)). Multi-layered systems are also being developed,where each subsystem lies on a spectrum between reactiveand deliberative. Such hybrid architectures are able to harvestthe strengths of several methods, using each where they per-form best. Loe (2008) applies a two-layered approach wheredeliberation is done by a Rapidly-Exploring Random Tree(RRT) algorithm combined with the deliberative A* heuristic,and the reactive component consists of a modified DW algo-rithm. In Eriksen (2019), A* is combined with a mid-layerand a reactive MPC-based algorithm, forming a three-layeredCOLAV system. Casalino et al. (2009); Svec et al. (2013)have proposed similar layered architectures.

In summary, a wide range of COLAV systems have beenproposed in literature, generally disregarding the COLREGs.At the same time, the increased focus on autonomous sys-tems in later years requires COLREG-compliance for suffi-cient safety. This gap combined with the promise of DRLfor autonomous navigation, shapes the objective of the article— to investigate COLREG-compliance in a path followingand collision avoidance system based on deep reinforcementlearning, conditioned on measures of collision risk.

3. Theory

3.1. Dynamics of a marine vessel

The dynamical model considered in this work is CyberShip II:a 1:70 scale replica of a supply ship (Skjetne et al. (2004b)).This model is simulated in a calm ocean surface environmentwith the following assumptions.

Assumption 1 (State space restriction). The vessel is alwayslocated on the surface, and thus there is no heave motion.Also, there is no pitching or rolling motion.

Assumption 2 (Calm sea). There are no external distur-bances to the vessel, such as wind, ocean currents, or waves.

Following SNAME notation (SNAME, The Society of NavalArchitecture and Marine Engineers (1950)), the navigationstate vector then consists of the generalized coordinates, η =[xn, yn, ψ

]T , where xn and yn are the North and East positions,respectively, in the North-East-Down (NED) reference frame{n}, and ψ is the yaw angle, i.e., the current angle between thevessel’s longitudinal axis xb and the North axis xn, illustratedby Figure 1. Correspondingly, the translational and angularvelocity vector ν = [u, v, r]T consists of the surge (i.e., for-ward) velocity u, the sway (i.e., sideways) velocity v and yawrate r.

3

Page 4: Risk-based implementation of COLREGs for autonomous

yn

xn

xn

ψ

yb

xb

Figure 1: Illustration of the NED and body coordinate frames.

3.1.1. Vessel modelGiven the established assumptions, the 3-DOF vessel dynam-ics can be expressed in a compact matrix-vector form

η = Rz,ψ(η)νMν + C(ν)ν + D(ν)ν = B f ,

where Rz,ψ represents a rotation of ψ radians around the zn-axis as defined by

Rz,ψ =

cosψ − sinψ 0sinψ cosψ 0

0 0 1

Furthermore, M ∈ R3×3 is the mass matrix and includes theeffects of both rigid-body and added mass, C(ν) ∈ R3×3 in-corporates centripetal and Coriolis effects, and D(ν) ∈ R3×3

is the damping matrix. Finally, B ∈ R3×2 is the actuator con-figuration matrix. The numerical values of the matrices arefound in Skjetne et al. (2004a), where the model parameterswere estimated experimentally for CyberShip II in a marinecontrol laboratory.

We disregard the ship’s bow thruster and allow only the aftthrusters and control surfaces to be applied by the Reinforce-ment Learning (RL) agent as control signals. This omissionsimplifies the RL agent’s action space and is further moti-vated by the bow thrusters’ limited effectiveness at higherspeeds (Sørensen et al. (2017)). Thus, the control vector,f = [Tu,Tr]T , consists of the surge force input, Tu, and theyaw’s moment input, Tr.

3.1.2. COLREG rulesAmong the 41 rules in the International Regulations forPreventing Collisions at Sea (Organization (1972)), onlythe directly relevant rules for COLAV are presented below.The two main takeaways from these rules are that 1) thegive-way vessel should take early and substantial action, and

2) safe speed should be ensured at all times, such that coursealteration is effective towards avoiding collisions where thereis sufficient sea-room. Since rules 6 and 8 are particularlytough to quantify, this work focuses on compliance to rules14-16.

Rule 6: Safe speed

Every vessel shall at all times proceed at a safespeed so that she can take proper and effective ac-tion to avoid collision and be stopped within a dis-tance appropriate to the prevailing circumstancesand conditions.

Rule 8: Action to avoid collision

(b) Any alteration of course and/or speed to avoidcollision shall, if the circumstances of the case ad-mit, be large enough to be readily apparent to an-other vessel observing visually or by radar; a suc-cession of small alterations of course and/or speedshould be avoided.

(c) If there is sufficient sea-room, alteration ofcourse alone may be the most effective action toavoid a close-quarters situation provided that it ismade in good time, is substantial and does not re-sult in another close-quarters situation.

(d) Action taken to avoid collision with another ves-sel shall be such as to result in passing at a safe dis-tance. The effectiveness of the action shall be care-fully checked until the other vessel is finally pastand clear.

(e) If necessary to avoid collision or allow moretime to assess the situation, a vessel shall slackenher speed or take all way off by stopping or revers-ing her means of propulsion.

Rule 14: Head-on situation

(a) When two power-driven vessels are meeting onreciprocal or nearly reciprocal courses so as to in-volve risk of collision each shall alter her course tostarboard so that each shall pass on the port side ofthe other.

(b) Such a situation shall be deemed to exist when avessel sees the other ahead or nearly ahead and bynight she could see the masthead lights of the otherin a line or nearly in a line and/or both sidelightsand by day she observes the corresponding aspectof the other vessel.

(c) When a vessel is in any doubt as to whether sucha situation exists she shall assume that it does existand act accordingly.

Rule 15: Crossing situation

4

Page 5: Risk-based implementation of COLREGs for autonomous

When two power-driven vessels are crossing so asto involve risk of collision, the vessel which has theother on her own starboard side shall keep out ofthe way and shall, if the circumstances of the caseadmit, avoid crossing ahead of the other vessel.

Rule 16: Action by give-way vessel

Every vessel which is directed to keep out of the wayof another vessel shall, so far as possible, take earlyand substantial action to keep well clear.

3.2. Measures of collision risk

The rules presented above are intended for human interpreta-tion and contain ambiguities such as “large enough” (Rule 8)and “substantial action” (Rule 16). How can they be trans-lated into a form suitable for reinforcement learning? An es-sential first step is recognizing the relationship between theCOLREGs and collision risk. The COLREGs are in place toreduce collision risk and indirectly affect the risk level by in-fluencing the probable behavior of the target ship (TS). Sincethere is a correlation between the rules and the risk level, em-ploying a measure of risk as a proxy for the COLREGs mayenable the RL agent to learn COLREG-compliant behavior.

By analyzing the historical trends of measuring collision risk,three main developments can be observed (Xu and Wang(2014)): traffic flow theory, ship safety domains, and colli-sion risk indices. The initial efforts to quantify collision riskwere based on traffic flow theory, a method built on empir-ical studies and statistical traffic analysis in specific waters.For instance, Cockcroft (1981) investigated the collision ratesfor ships of varying tonnage relative to their position in a wa-ter area. Goodwin (1978) took it further and studied the rateof dangerous encounters. As statistical analysis of historicaldata was deemed insufficient for dynamic collision avoidance,ship safety domains were introduced. The ship safety do-main defines a region around the ship in question that otherships should not enter. Hence, there is a risk of collision ifone ship is inside the safety domain of another, and the shipdomain can be said to be a generalization of a safe distance(Szlapczynski and Szlapczynska (2017)). When applying theship domain to an encounter situation in order to determinerisk, one of the four safety criteria are normally used: 1) theOS domain should not be violated by a TS, 2) a TS domainshould not be violated by the OS, 3) neither of the ship do-mains should be violated, or 4) ship domains should not over-lap, such that they remain mutually exclusive. Rawson et al.(2014) and Wang and Chin (2016) use the latter criterion ofnon-overlapping ship domains.

It is important to note that a ship domain is usually defineddepending on which situation the ship finds itself in to respectthe COLREGs. For instance, the domain used while the OSis overtaking another ship is symmetrical, with its origin co-inciding with the center of the OS. Conversely, the origin isshifted to the right in a head-on situation, as close encounterson the starboard side should be avoided.

Davis et al. (1980) expanded the theory of ship safety domainsin their well-known work on ship arenas. The ship arena de-fines the distances around the OS at which action should bemade to avoid a dangerous encounter and is, therefore, largerthan the ship safety domains proposed initially. In addition tothe OS’s length and velocity, the distance to the closest pointof approach (DCPA) and the time to the closest point of ap-proach (TCPA) are used to construct the limits of the shiparena. A geometrical representation of DCPA and TCPA arepresented in Figure 2, giving rise to the equations

DCPA = R sin(χR − χOS − θT − π) (1)

and

TCPA =RVR

cos(χR − χOS − θT − π) (2)

where R is the absolute distance between the OS and TS, andVR and χR are the relative speed and course between them. Inaddition, χOS is the course of the OS, while θT is the bearingof the TS relative to the OS.

Figure 2: Geometric representation of CPA and DCPA.

This leads to the subsequent development in collision riskevaluation, namely collision risk indices (CRIs), which areprimarily based on the DCPA and TCPA. In addition, a CRIcan include the absolute distance from the OS to the TS R,velocity ratio K of two encountering ships, relative course χR,and other key features. Recently, simple CRIs alone are con-sidered unable to capture collision risk’s gradual and complexnature. As a result, combining the CRI with fuzzy logic orthe fuzzy comprehensive evaluation method has become thenorm. In fuzzy logic, fuzzy IF-THEN rules are applied to theparameters involved, such as DCPA and TCPA, to determinethe risk level. In the fuzzy comprehensive evaluation method,on the other hand, membership functions, u(·) ∈ [0, 1], areused instead of IF-THEN rules, taking more details into ac-count. The final CRI is then given as the weighted sum of themembership function outputs, as exemplified below:

5

Page 6: Risk-based implementation of COLREGs for autonomous

CRI = αDCPA · uDCPA(DCPA) (3a)+ αTCPA · uTCPA(TCPA)+ αR · uR(R)

αDCPA + αTCPA + αR = 1 (3b)

3.3. Deep reinforcement learning

Model-free reinforcement learning (RL) methods train adecision-making agent through trial and error, where theagent is gathering experience from an environment supply-ing only a situational observation state and a correspondingreward. Applications of RL on high-dimensional, continu-ous control tasks heavily rely on function approximators togeneralize over the state space. Even if classical, tabularsolution methods such as Q-learning can be made to work(provided a discretizing of the continuous action space), thisis not considered an efficient approach for control applica-tions (Lillicrap et al. (2015)). In recent years, given theirremarkable generalization ability over high-dimensional in-put spaces, the dominant approach has been the applicationof deep neural networks optimized using gradient methods.Several algorithms built on this principle have gained signif-icant traction in the RL research community, most notablyDeep Deterministic Policy Gradient (DDPG) (Lillicrap et al.(2015)), Asynchronous Advantage Actor Critic (A3C) (Mnihet al. (2016)), Proximal Policy Optimization (PPO) (Schul-man et al. (2017)), and Soft Actor-Critic (SAC) (Haarnojaet al. (2017)). For continuous control tasks, this family of pol-icy gradient methods is commonly considered the more effi-cient approach (Tai et al. (2016)). Based on previous work,where the PPO algorithm significantly outperformed othermethods on a learning problem similar to the one covered inthis work (Meyer et al. (2020b); Larsen et al. (2021)), we fo-cus our efforts on this method.

4. Methodology

4.1. Training environment

DRL-based autonomous agents have a remarkable ability togeneralize their policy over the observation space, includingthe domain of unseen observations. Moreover, given the com-plexity and heterogeneity of the Trondheim Fjord environ-ment, with archipelagos, shorelines, and skerries (see Fig-ure 3), this ability will be fundamental to the agent’s per-formance. However, the training environment in which theagent evolves from a blank slate to an intelligent vessel con-troller must be representative, challenging, and unpredictableto facilitate the generalization. If not for the generalization is-sues associated with this approach (Codevilla et al. (2019)), itwould also allow the agent to train via behavior cloning basedon historical AIS data. However, given the resolution of ourterrain data, the resulting obstacle geometry is typically verycomplex, leading to overly high computational demands for

simulating the functioning of the distance sensor suite. More-over, the agent’s perceptive observation space (Section 4.2.2)undergoes significant dimensionality reduction, resulting inthe agent not benefiting from such high-frequency details inthe simulation. Thus, the better choice is to craft an artificialtraining scenario with simple obstacle geometries. To reflectthe dynamics of a real-world marine environment, we let thestochastic initialization method of the training scenario spawnother target vessels with deterministic, linear trajectories. Ad-ditionally, circular obstacles scattered around the environmentsubstitute the real-world terrain. Figure 4 illustrates an instan-tiation of the training environment.

20.0 40.0 60.0 80.0 100.0 120.0East (km)

20.0

40.0

60.0

80.0

100.0

120.0

Nor

th (k

m)

Figure 3: Snapshot of the marine traffic from 01.01.2020 to 06.02.2020 inthe Trondheim fjord, based on AIS data. Each red line represents a recordedtravel.

-4.0 -2.0 0.0 2.0 4.0East (km)

-4.0

-2.0

0.0

2.0

4.0

Nor

th (k

m)

Goal

Start

Figure 4: Random sample of the stochastically generated path followingtraining scenario with moving obstacles. The circles are static obstacles rep-resenting landmasses, and the vessel-shaped objects are moving according tothe trajectory lines and velocity vectors.

4.2. Observation vectorTo facilitate the learning of a decision-making policy, the RLagent requires an observation vector, s, containing sufficient

6

Page 7: Risk-based implementation of COLREGs for autonomous

information about the vessel’s state relative to the path in ad-dition to situational sensor information. The complete obser-vation vector is then constructed by concatenating navigation-based and perception-based features, which formally trans-lates to s = [sn, sp]T . In the context of this paper, we con-sider the term navigation as the characterization of the ves-sel’s state, i.e., its position, orientation, and velocity, with re-spect to the desired path. On the other hand, perception refersto the observations made via the rangefinder sensor measure-ments. In the following, the path navigation feature vector, sn,and the perceptive feature vector, sp, are covered in detail.

4.2.1. Navigation featuresA sufficiently information-rich path navigation feature vectorwould be such that it, on its own, could facilitate a satisfactorypath-following controller. A few concepts often used in ves-sel guidance and control are helpful to formalize this. First,

Figure 5: Illustration of key path-following concepts in vessel guidance andcontrol. The path reference point, pd(ω), describes the point on the path withthe closest Euclidean distance to the vessel, while the look-ahead referencepoint, pd(ω + ∆LA), is located a distance, ∆LA, further along the path.

we introduce the mathematical representation of the parame-terized path, which is expressed as

pd(ω) =[xd(ω), yd(ω)

]T (4)

where xd(ω) and yd(ω) are defined in the NED frame. Nav-igating the path necessitates a reference point, which is con-tinuously updated based on the vessel’s position. We definethis reference point as the point on the path that has the clos-est Euclidean distance to the vessel, given its current position,as illustrated in Figure 5. To find this, we calculate the corre-sponding value of the path variable ω at each time step. Thisis an equivalent problem formulation because the path is de-fined implicitly by the value of ω. Formally, this translates tothe optimization problem

ω = arg minω

(xn − xd(ω))2+ (yn − yd(ω))2 , (5)

which, using the Newton–Raphson method, can be calculatedaccurately and efficiently at each time step. We define the cor-

responding Euclidean distance to the path, i.e., the deviationbetween the desired path and the current track, as the cross-track error (CTE) ε. Formally, we thus have that

ε =∥∥∥[xn, yn]T

− pd(ω)∥∥∥ . (6)

Next, we consider the look-ahead point, pd(ω+∆LA), to be thepoint that lies a constant distance further along the path fromthe reference point pd(ω). Look-ahead based steering, i.e.,setting the look-ahead point direction as the desired course an-gle, is a commonly used guidance principle (Fossen (2011)).The look-ahead distance, ∆LA, is set by the user and controlshow aggressively the vessel should reduce the distance to thepath.

We then define the heading error, ψ, as the change in headingneeded for the vessel to navigate straight towards the look-ahead point from its position, as illustrated in Figure 5. For-mally, ψ is defined as

ψ = atan2(

yd(ω + ∆LA) − yn

xd(ω + ∆LA) − xn

)− ψ, (7)

where ψ is the vessel’s heading and xn, yn are the NED-framevessel coordinates as defined earlier.

However, even if minimizing the heading error will yield goodpath adherence, taking into account the path direction at thelook-ahead point might improve the smoothness of the result-ing vessel trajectory. Referring to the first-order path deriva-tives as x′p(ω) and y′p(ω), we have that the path angle, γp, ingeneral, can be expressed as a function of arc-length, ω, suchthat

γp(ω) = atan2 (y′p(ω), x′d(ω)). (8)

As visualized in Figure 5, the path direction at the look-aheadpoint is then given by γp(ω + ∆LA). We then define the look-ahead heading error, which is zero in the case when the vesselis heading in a direction that is parallel to the path direction atthe look-ahead point, as

ψLA = γp(ω + ∆LA) − ψ (9)

Our assumption is then that the navigation feature vector sn,defined as outlined in Table 1, should provide a sufficient basisfor the agent to intelligently adhere to the desired path. Thenavigation features are then formally defined as

s(t)n =

[u(t), v(t), r(t), ε(t), ψ(t), ψ(t)

LA

]T. (10)

Table 1: Path-following feature vector sn at timestep t.

Feature DefinitionSurge velocity u(t)

Sway velocity v(t)

Yaw rate r(t)

Cross-track error ε(t)

Heading error ψ(t)

Look-ahead heading error ψ(t)LA

7

Page 8: Risk-based implementation of COLREGs for autonomous

4.2.2. Perception featuresUsing a set of rangefinder sensors as the basis for obstacleavoidance is a natural choice, as it yields a comprehensiveyet intuitive representation of any neighboring obstacles. Thisconfiguration should also enable a relatively straightforwardtransition from the simulated environment to a real-world one,given that rangefinder sensors such as lidars, radars, sonars, ordepth cameras are commonly used

In our setup, the vessel is equipped with N distance sensorswith a maximum detection range of S r, distributed uniformlywith 360° coverage. While the area behind the vessel is obvi-ously of lesser importance, e.g., unnecessary to consider whennavigating purely static terrain, the possibility of overtakingsituations where the agent must react to another vessel ap-proaching from behind makes full sensor coverage necessary.The most natural approach to constructing the final observa-tion vector would be to concatenate the path information fea-ture vector with the array of sensor outputs. However, initialexperiments with this approach resulted in the training pro-cess stagnating at an unsatisfactory agent performance level.A likely explanation for this failure is the size of the obser-vation vector, which was fed to the agent’s fully connectedpolicy and value networks; as the input size becomes large,the agent suffers from the well-known curse of dimensional-ity. Due to the resulting network complexity and the expo-nential relationship between the dimensionality and volumeof the observation space, the agent fails to generalize new,unseen observations intelligently (Goodfellow et al. (2016)).An obvious solution is to reduce the observation space’s di-mensionality significantly. However, simply reducing the res-olution is infeasible, as this would accordingly degrade theagent’s situational awareness.

In this work, we partition the sensor suite into D sectors, eachof which produces a scalar measurement included in the finalobservation vector, effectively summarizing the local sensorreadings within the sector. However, given our desire to min-imize its dimensionality, dividing the sensors into sectors ofuniform size is sub-optimal as obstacles located in front of thevessel are significantly more critical and thus require higherperceptive accuracy than those located at its rear. In orderto realize such a non-uniform partitioning, we use a logisticfunction - a choice that also fulfills our general preference forsymmetry. Assuming a counter-clockwise ordering of sensorsand sectors starting at the rear of the vessel, we map a givensensor index, i ∈ {1, . . . ,N}, to a sector index, k ∈ {1, . . . ,D},according to

κ : i 7→ κ(i) =

Dσ(γCiN−γC

2

)︸ ︷︷ ︸Non-linear mapping

−Dσ(−γC

2

)︸ ︷︷ ︸Constant offset

, (11)

where σ is the logistic sigmoid function, and γC is a scal-ing parameter controlling the density of the sector distributionsuch that decreasing it will yield a more evenly distributedpartitioning. We can then formally define the distance mea-

surement vector for the kth sector, which we denote by wk,according to

wk,i = xi for i ∈ {1, . . . ,N} such that κ(i) = k

Figure 6: Rangefinder sensor suite containing N distance sensors, partitionedinto D sectors (black dashed lines) according to the mapping function κ. Thedashed edges illustrate the maximum reachable distance in each sector, ascalculated by the feasibility pooling algorithm. The perceptive distance com-ponent of the RL-agent’s observation space consists of the closeness mappingof these distances.

Next, we select a mapping f : Rn 7→ R, which takes the vectorof distance measurements wk, for an arbitrary sector index k,as input, and outputs a scalar value based on the current sensorreadings within that sector. The feasibility pooling procedure,introduced in Meyer et al. (2020b), calculates the maximumreachable distance within each sector, taking into account theobstacle sensor readings’ location and the vessel’s width. Thismethod iterates over the sector’s distance measurements in as-cending order and checks whether it is feasible for the vesselto advance beyond this level. As soon as the broadest avail-able opening within a distance level is deemed too narrowgiven the vessel’s width, the maximum reachable distance hasbeen reached. Formally, we define f as the feasibility poolingalgorithm, and the resulting perceptive distance observation issummarized in Figure 6. To finalize the processing of distancemeasurements, we introduce the concept of closeness. An ob-stacle’s closeness is zero if it is at a distance further than S r

away from the vessel and unity if the vessel has collided withthe obstacle. Furthermore, within this range, is it reasonableto map distance to closeness in a logarithmic fashion, suchthat, following human intuition, the difference between 10mand 100m is more significant than the difference between, forinstance, 510m and 600m. Formally, the maximum reachabledistance, d, maps to closeness, c(d) : R 7→ [0, 1], accordingto

c(d) = clip(1 −

log (d + 1)log (S r + 1)

, 0, 1). (12)

8

Page 9: Risk-based implementation of COLREGs for autonomous

4.2.3. Motion detectionThe maximum reachable distance in a sector may equal themaximum sensor range even though there is an obstacle inthat sector. Thus, by applying the feasibility pooling algo-rithm to reduce the dimensionality of the rangefinder suite,the resulting closeness observation may fail to inform the RLagent about nearby obstacles. To make the agent aware ofnearby moving obstacles, we incorporate the velocities of thenearest obstacle in each sector into the observation vector.Admittedly, while this implementation is trivial in a simu-lated environment, a real-world implementation will neces-sitate a reliable way of estimating obstacle velocities basedon sensor data. However, even if this can be challengingdue to uncertainty in the sensor readings, object tracking isa well-researched computer vision discipline. We reserve theimplementation of such a method to future research but re-fer the reader to Granstrom et al. (2016) for a comprehensiveoverview of the current state of this field.

Specifically, the decomposition, which yields the x and ycomponent of the obstacle velocity, considers the coordinateframe in which the y-axis is parallel to the centerline of thesensor sector in which the obstacle is present. Thus, we pro-vide the decomposed velocity of the closest moving obstaclewithin each sector as features for the agent’s observation vec-tor. For each sector k, we denote the corresponding decom-posed x and y velocities as vx,k and vy,k, respectively. Nat-urally, if no moving obstacles are present within the sector,both components are zero.

4.2.4. Perception state vectorBy concatenating the closeness of the maximum reachabledistance and the decomposed obstacle velocity for each sec-tor, we then define the perception state vector, sp, as

s(t)p =

c ((w(t)

1

)), v(t)

x,1, v(t)y,1︸ ︷︷ ︸

First sector

, . . .

T

. (13)

4.3. Risk-based implementation of COLREGs

In model-free RL, the trained agent will assume a policy thatmaximizes the expected reward. To lead this policy to adhereto the COLREGs, we must incorporate them into the rewardfunction. As previously mentioned, the rules are ambiguousand cannot be implemented explicitly. Instead, we use col-lision risk indices (CRIs) as analogs, and the following mo-tivates how they are intended to guide the RL agent towardsCOLREG-compliance.

4.3.1. Risk-based reward functionBuilding on the theory presented in Section 3.2, a collisionrisk index (CRI) is calculated using fuzzy evaluation. Here,this translates to a weighted sum of evaluated risk factors, amethod described in detail in Section 4.3.2. This method en-capsulates collision risk’s continuous and fuzzy nature, mak-ing it a convincing choice for translating the COLREGs into

a DRL-based framework. Collision risk is typically only ap-plied to encounter situations between two dynamic objects,and the collision risk index to be presented here is no excep-tion. Thus, the reward components for path following, staticobstacle avoidance, collision penalty, and living penalty mustbe defined separately. The corresponding components from aprevious approach (Meyer et al. (2020a)) are applied here dueto the excellent path following and obstacle avoidance results.The reward components for path following and static obstacleavoidance are given in Equations (14) and (15), while the col-lision and living penalties are negative constants. As a result,the total reward function has the same structure, reiterated inEquation (16), except for a risk-based penalty for dynamicobstacles (rcolav,dyn).

r(t)path =

(u(t)

Umaxcos ψ(t) + γr

)︸ ︷︷ ︸Velocity-based reward

(exp

(−γε |ε

(t)|)

+ γr

)︸ ︷︷ ︸CTE-based reward

−γ2r (14)

r(t)colav,stat = −

N∑i=1

αx

1 + γθ,stat |θi|exp (−γxxi)

N∑i=1

11 + γθ,stat |θi|

(15)

r =

rcollision, if collisionλrpath + (1 − λ) rcolav + rexists, otherwise

(16)

The penalty for dynamic obstacles makes part of the overallpenalty for collision avoidance, denoted rcolav and given by

rcolav = rcolav,dyn + rcolav,stat. (17)

For every TS within the OS’s sensor range, a collision riskindex (CRI) ∈ [0, 1] is calculated (see Section 4.3.2). Sincethe CRI increases proportionally to collision risk, it can beused semi-directly in the reward function. By multiplying theCRIi of each target vessel, i, with a scaling factor βCRI > 0,the penalty level can be weighted relative to the rest of thereward function:

rcolav,dyn = −∑

βCRI ·CRIi (18)

4.3.2. Calculating the collision risk indexIn order to determine the collision risk in an encounter situa-tion, one must first define what constitutes a collision risk andhow much each risk factor contributes to the overall risk. Thestate-of-the-art methods of computing CRIs generally usefuzzy evaluation (Xu and Wang (2014)), making it a natural

9

Page 10: Risk-based implementation of COLREGs for autonomous

choice here too. In short, three steps should be followed:

1. Define individual risk factors.2. Define membership functions.3. Design overall CRI as a function of membership func-

tions.

The chosen risk factors and their membership functions areelaborated on in the following, leading up to the CRI functiondesign.

A common starting point for defining risk is looking at the dis-tance and time to the point of closest approach, denoted DCPAand TCPA. As the descriptive name suggests, the closest pointof approach (CPA) is the closest point relative to the OS thatthe TS in question will come, given that the relative courseand relative velocity between the two ships stay the same. TheDCPA, then, is the distance to the CPA, while the TCPA isthe time until the TS arrives at the CPA. Put differently, theDCPA quantifies the severity of a potential collision situation,while the TCPA quantifies its urgency. When determiningthe risk level associated with them, it is customary to employupper and lower bounds for these quantities, denoted dL anddU for DCPA, and tL and tU for TCPA. Doing so, the mem-bership functions uDCPA and uTCPA output unity (highest risklevel) whenever |DCPA| ≤ dL and |TCPA| ≤ tL, respectively.Conversely, their outputs are zero when |DCPA| ≥ dU and|TCPA| ≥ tU . As was done in Gang et al. (2016), a second-order function is used between the two extremities. Chen et al.(2014) use a sinusoidal function instead. Although the latterhas the virtue of being smooth, it was deemed inexpedient dueto the large outputs for a wide interval of values, overshadow-ing other elements of the CRI. Since the sensor range usedin this work is relatively short (1500 m), the steeper second-order function improved learning. It is worth noting that thesinusoidal function may be more suited in a setup with fewerobstacles and vessels where AIS data from a larger region isused.

The values for the lower and upper bounds depend largelyon the application. In general, dL defines the minimal safeencounter distance, and dU is the absolute safe encounter dis-tance (Gang et al. (2016)). For DCPA, the membership func-tion is defined as

uDCPA =

1 if |DCPA| ≤ dL

0 if |DCPA| ≥ dU(dU−|DCPA|

dU−dL

)2otherwise

(19)

with dL and dU as positive integers. The DCPA membershipfunction is presented graphically in Figure 7.

For the bounds on TCPA, the method used in Gang et al.(2016) and presented in Equation (20) is employed. Doing soadjusts the output of uTCPA according to the distance between

0 250 500 750 1000 1250 1500 1750

|DCPA| [m]

0.0

0.2

0.4

0.6

0.8

1.0

Degreeof

mem

bership

Figure 7: Membership function for DCPA with dL = 320 m and dU = 1500m.

the OS and TS, accurately presenting the high risk when thedistance is below or close to the lower bound dL and low riskwhen it is closer to the upper bound dU . It is assumed thatDCPA never exceeds dU , meaning that dU is set to the maxi-mum detectable DCPA.

tL =

d2L−DCPA2

vRif |DCPA| ≤ dL

dL−DCPAvR

if |DCPA| > dL

(20a)

tU =

√d2

U − DCPA2

vR(20b)

In Gang et al. (2016), equal importance has been given to pos-itive and negative values of TCPA through the membershipfunction below:

uTCPA =

1 if |TCPA| ≤ tL

0 if |TCPA| ≥ tU(tU−|TCPA|

tU−tL

)2else

(21)

However, noting that negative values of TCPA indicate thatthe OS and TS have passed each other, it makes sense to payattention to the sign of TCPA. This is supported by Park et al.(2006), where a fuzzy case-based reasoning system for colli-sion avoidance is proposed. In their work, the TCPA mem-bership function in Figure 8 is applied, indicating the signifi-cantly higher risk associated with positive values of TCPA.

Figure 8: Membership function for TCPA employed in Park et al. (2006).SAN = Safe Negative, DAN = Dangerous Negative, VDP = Very DangerousPositive, DAP = Dangerous Positive, MEP = Medium Positive, SAP = SafePositive, VSP = Very Safe Positive.

Following this line of reasoning, a distinction between posi-tive and negative values of TCPA is made according to Equa-tion (22). The cut-off value for negative values (negative

10

Page 11: Risk-based implementation of COLREGs for autonomous

limit) was chosen as tNL = dLvR

, such that the degree of mem-bership is larger than zero whenever the OS is less than tNL

time steps away from the TS. The membership function forTCPA is plotted in Figure 9.

uTCPA =

1 if TCPA ≤ tL

0 if TCPA ≥ tU(tU−TCPA

tU−tL

)2else

if TCPA ≥ 0

0 if TCPA ≤ tL(tNL−|TCPA|

tNL

)2else

if TCPA < 0

(22)

−500 −250 0 250 500 750 1000 1250 1500

TCPA [s]

0.0

0.2

0.4

0.6

0.8

1.0

Degreeof

mem

bership

Figure 9: Membership function for TCPA with dL = 320 m, dU = 1500 mand vR = 1 m/s.

Further, the collision risk depends on the position of the TSrelative to the OS, which can be expressed through the abso-lute distance, R, between them and the bearing angle of theTS, θT . Since the risk is higher on the starboard side of theOS, as expressed in Rule 14 (head-on situation) of the COL-REGs, the membership functions should be designed with abias on that side. Inspired by Davis et al. (1980), it is custom-ary to introduce a bias of 19° starboard. Davis developed theconcept of ship arena, briefly described in Section 3.2, anddesigned a scaling of the upper bound:

RD = 1.7 cos(θT

π

180− 19°

)+

√(4.4 + 2.89 cos2

(θT

π

180− 19°

)),

(23)

while the lower bound is usually 12 times the OS length Lpp

(Gang et al. (2016)) but set to 8Lpp here due to the smallerscale. Initially, the upper bound given by RD was imple-mented, but it quickly became apparent that adjustments hadto be made to ensure that the agent received sufficiently nega-tive reward when approaching TSs, regardless of their bearingangle. The difference in scaling of 4.4 times for ships detectedat 19° and 161° (180°−19°) was too large considering the rel-atively densely populated training and testing scenarios and arestricted sensor range of 1500 m. Through testing, it was ob-served that the distance membership function could be madeuniform while still preserving the correct behavior in head-onsituations as long as the membership function for the bearingangle, θT , was given enough weight. As a result, the lowerand upper bounds for the absolute distance, R, were chosen as

RL = βRLLpp (24a)

RU = βRU Lpp (24b)

with βRL and βRU chosen as appropriate scaling constants.

0 25 50 75 100 125 150 175

Distance [m]

0.0

0.2

0.4

0.6

0.8

1.0

Degreeof

mem

bership

Figure 10: Membership function for distance to the target ship, with θT = 0°.

Following the logic applied to the membership functions forTCPA and DCPA, we arrive at the following membershipfunction for the absolute distance between the OS and TS:

uR =

1 if R ≤ RL

0 if R ≥ RU(RU−RRU−RL

)2else

(25)

To encourage the appropriate behavior in head-on situations,the function for the bearing angle of the TS relative to theOS should be largest on the starboard side. Defining θPU ,θPL, θNU , and θNL as the positive upper, positive lower, nega-tive upper, and negative lower bounds on θT , the membershipfunction for the bearing angle can be defined as below andillustrated in Figure 11.

uθT =

clip

((θPU−θTθPU−θPL

)2, 0, 1

)if θT ≥ 0

clip((

θNU−|θT |

θNU−θNL

)2, 0, 1

)if θT < 0

(26)

−150 −100 −50 0 50 100 150

Bearing of target ship [deg]

0.0

0.2

0.4

0.6

0.8

1.0

Degreeof

mem

bership

Figure 11: Membership function for the bearing angle, θT , of the target ship,with bounds θPU = 180°, θPL = 45°, θNU = 90°, and θNL = 22.5°.

After implementing a CRI containing the four membershipfunctions introduced so far, it became clear that it was nec-essary to add an element to the CRI to deter the OS fromcrossing ahead of a TS. Since the TS’s speed towards theOS can quantify whether the OS is ahead of the TS and is

11

Page 12: Risk-based implementation of COLREGs for autonomous

readily available in the observation vector (vy and vx), an ad-ditional membership function is designed. Hence, we defineuV (·) as the ratio of the TS’s speed towards the OS to its ab-solute speed, as described in Equation (27). Such a ratio waschosen to avoid issues with differences in speed among theTSs, which quickly could have arisen if the numerical valueof vy had been used instead. On the other hand, it might bedesirable to distinguish between crossing ahead ships travel-ing at different speeds, as faster ships naturally pose a higherrisk. However, this is considered to be outside the scope ofthis work. It is worth noting that uV (·) is negative when vy isnegative, emphasizing the advantage of astern crossings.

uV =vy√

v2x + v2

y

(27)

Integrating the introduced membership functions into a colli-sion risk index, we have that

CRI = max(0, αCPA

√uDCPA · uTCPA + αθT uθT + αRuR + αVuV

)(28)

where the CPA composite term was designed in such that acombination of low values for both DCPA and TCPA givesrise to a high CRI. It also accurately expresses how a lowvalue of either DCPA or TCPA significantly reduces the over-all risk. The max-function is applied to ensure that the CRI isalways larger or equal to zero.

Finally, values are assigned to the weights such that the sumis equal to unity, giving

αCPA + αθT + αR + αV = 1 (29)

In this work, the parameter values specified in Table 2 areused. Initial choices were made based on values suggestedin the literature (Chen et al. (2014); Yan (2002)), emphasiz-ing DCPA and TCPA. However, it was discovered that moreweight had to be placed on the target bearing angle, absolutedistance, and approaching velocity to achieve the desired be-havior. The configuration of the path following and static ob-stacle rewards listed in Meyer et al. (2020a) have been appliedin this work.

4.4. Performance evaluation

A three-step evaluation process is employed to assess the per-formance of the RL agent. First, the agent’s behavior and per-formance in the training environment are assessed, and snip-pets from situations relevant to rules 14-16 of the COLREGsare presented. Next, two-vessel testing scenarios are con-structed to test for COLREG-compliance specifically. Lastly,the agents are evaluated in AIS-based environments. Thesemodes of assessment are described individually in the follow-ing subsections.

4.4.1. Performance in the training environmentA natural starting point for performance evaluation is assess-ing the agent’s behavior in its training environment. The over-all performance can be evaluated by collecting statistics onthe collision rate, level of path completion, and reward. Thesestatistics serve as a guide for when to stop the training and apoint of comparison between approaches. Moreover, a qual-itative assessment is made by observing the agent’s behav-ior through video recordings. Snippets are chosen from thevideos to highlight the behavior in situations where the COL-REGs apply. This is not always the case since the trainingenvironment often presents the agent with difficult situationscontaining various static and dynamic obstacles, which can-not be accurately subjected to the COLREGs.

4.4.2. Testing of COLREG-complianceThe next step in the testing process is subjecting the agentto scenarios specifically designed to capture COLREG-compliance. This is especially useful since it is challenging tofind scenarios that perfectly showcase COLREG-compliancein the training environment. However, the agent’s successcan easily be quantified through simpler two-vessel scenarios.One scenario to be tested is self-evident, namely the head-onscenario. In addition, two different crossing situations, onefrom the starboard and one from the port side, were chosen.For each scenario, the TS’s initial angles and path angles arevaried slightly within a range of ±5° of the default angles,which allows for an accumulation of statistics on success ratein the respective scenarios. It should be noted that the targetships have been modeled exclusively large in the testing sce-narios to reflect the size of the large ships encountered in theAIS-based scenarios and for visual clarity.

4.4.3. AIS-based testingLastly, three environments based on real-world high-fidelityterrain data are used to assess the generalization performanceof the agent. These environments were developed by Meyer(2020) using AIS tracking data and terrain data from theTrondheim Fjord area and are distinctly different. A dashedblack line represents the desired OS trajectory in the follow-ing illustrations. Each TS is drawn at its initial position, andtrajectories are drawn as dotted red lines. Note that these areexamples of spawned environments and that a set amount oftarget ships are chosen from the AIS database each time aninstance of the specific scenario is created. Additionally, theapparent density of TS trajectories does not directly reflect thenumber of encounters, as this depends on the speed of eachvessel.

The first AIS-based scenario is the Trondheim scenario (Fig-ure 15b), in which the agent is required to cross a fjord ofwidth ∼12 km while following a straight path. Doing so, itmainly meets crossing traffic consisting of larger vessels. Inthe challenging Ørland-Agdenes scenario (Figure 15a), theagent encounters two-way traffic in a narrow fjord entranceregion. It must blend into the heavy traffic to complete thepath while avoiding head-on collisions. In addition, the abil-

12

Page 13: Risk-based implementation of COLREGs for autonomous

Table 2: Reward configuration for the risk-based approach.

Parameter Interpretation ValueβCRI Scaling factor for overall risk level 10βRL Scaling factor for the lower bound on distance 8βRU Scaling factor for the upper bound on distance 18θPU Positive upper limit for the bearing angle θT 180°θPL Positive lower limit for the bearing angle θT 45°θNU Negative upper limit for the bearing angle θT 90°θNL Negative lower limit for the bearing angle θT 22.5°αCPA Weighting of CPA membership function 0.3αθT Weighting of target bearing angle membership function 0.2αR Weighting of absolute distance membership function 0.3αV Weighting of approaching velocity membership function 0.2dL Minimal safe encounter distance 320 mdU Absolute safe encounter distance 1500 m

ity to overtake other vessels is assessed. As in the Trondheimscenario, the vessels are primarily bigger than the OS. Lastly,the Froan scenario (Figure 15c) offers a demanding terrainwith hundreds of small islands. As a result, it tests the abilityof the agent to generalize to a challenging environment witha high density of static obstacles in varying shapes and sizes.The area is less trafficked, and the vessels encountered arephysically similar to the OS.

5. Results and discussion

In this section, the results from the risk-based implementa-tion of the COLREGs are presented and evaluated. First, theRL agent is evaluated in the synthetic training environment,considering its general path following and collision avoid-ance performance. Second, the presence and consistency ofCOLREG-compliant behavior are assessed in isolated, high-risk encounters. Finally, the agent is presented the simulatedreal-world AIS-based scenarios to see how the learned policygeneralizes to complex and unseen situations.

5.1. Training and testing in the synthetic environmentAfter training the RL agent in the synthetic environment (Fig-ure 4) for approximately 4000 episodes, its collision ratedropped to near zero, and the progress rate rose to 100%.Snippets from the training environment have been included inFigure 12, showcasing training scenarios in which the agentbehaves in a COLREG-compliant manner. The COLREGsclearly define these situations: passing on the right in head-onsituations, slowing down and passing astern instead of ahead,and allowing space between it and the TS during overtaking.Although the training statistics indicate the agent’s ability tonavigate and avoid collisions, they do not reveal whether theCOLREG-compliance is consistent, which must be evaluatedseparately.

5.1.1. Testing of COLREG-complianceThe next step in the evaluation process is COLREG-compliance testing with repetitive testing in different en-

-1.0 -0.5 0.0 0.5

East (km)

-1.8

-1.5

-1.2

-1.0

-0.8

-0.5

-0.2

0.0

Nor

th(k

m)

(a) Head-on situation

-2.5 -2.0 -1.5 -1.0

East (km)

0.0

0.2

0.5

0.8

1.0

1.2

1.5

1.8

Nor

th(k

m)

(b) Astern passing

-0.5 0.0 0.5 1.0

East (km)

-0.8

-0.5

-0.2

0.0

0.2

0.5

0.8

1.0

Nor

th(k

m)

(c) Overtaking

-2.0 -1.5 -1.0 -0.5

East (km)

-1.0

-0.8

-0.5

-0.2

0.0

0.2

0.5

0.8

Nor

th(k

m)

(d) Static COLAV

Figure 12: Risk-based agent performing common naval collision avoidancemaneuvers in the training environment. The agent’s trajectory is drawn as ablue dashed line, and the target ships with trajectories are drawn in red. Thedotted vessel outlines show their positions 100 time steps prior.

counter scenarios. Figure 13 shows how the agent avoids col-lision in a COLREG-compliant manner. In addition, the agentfollows the path well once the encounter has passed. Repeti-tive testing reveals that these results are stable, as the correctbehavior was seen in 100% of the episodes for each testingscenario, as summarised in Table 3. These results indicatethat the agent can intelligently interpolate between path fol-lowing and COLREG-compliant collision avoidance in iso-lated high-risk encounters. However, there is no guaranteethat this behavior translates into more complex scenarios.

13

Page 14: Risk-based implementation of COLREGs for autonomous

Table 3: Results from repetitive testing of COLREG-compliance with slightlyvarying scenarios, 100 episodes.

Scenario Success rateHead-on 100%Crossing from starboard 100%Crossing from port 100%

-4.0 -2.0 0.0 2.0 4.0East (km)

-2.0

-1.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Nor

th (k

m)

Goal

Start-1.0 0.0 1.0 2.0

East (km)

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

Nor

th (k

m)

(a) Test scenario 1:Head on.

-4.0 -2.0 0.0 2.0 4.0East (km)

-2.0

-1.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Nor

th (k

m)

Goal

Start

-2.0 -1.0 0.0 1.0East (km)

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Nor

th (

km)

(b) Test scenario 2:Crossing from starboard.

-4.0 -2.0 0.0 2.0 4.0East (km)

-2.0

-1.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Nor

th (k

m)

Goal

Start-1.0 0.0 1.0 2.0

East (km)

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Nor

th (

km)

(c) Test scenario 2:Crossing from port.

Figure 13: Agent behaviour in COLREG-compliance test scenarios. Theagent’s trajectory is drawn as a blue dashed line, and the target ships withtrajectories are drawn in red. The dotted vessel outlines show their positions100 time steps prior to the present time.

5.2. Testing in AIS-based environments

Finally, the risk-based agent is assessed in AIS-based real-world environments to find how well the agent generalizes topreviously unseen scenarios. As these environments are mod-eled using real-world terrain mapping and AIS traffic data,the agent will likely encounter complex scenarios where theCOLREGs do not clearly define the correct behavior. There-fore, we do not expect the agent always to find a COLREG-compliant solution. The agent’s excellent static obstacleavoidance and COLREG-compliant behavior are highlightedin Figure 14. Note that the static obstacles in Figure 14d aresignificantly smaller than those encountered in the trainingscenario.

Lastly, trajectories from each environment are presented inFigure 15. These trajectories illustrate the agent’s ability todynamically follow a predetermined path in the face of staticand moving obstacles. Whether the agent is faced with heavytwo-way parallel traffic (Figure 15a), crossing traffic (Fig-ure 15b), or an untraversable path (Figure 15c), it adapts to

the situation and finds a suitable solution. Thus, the agentgeneralizes its decision-making policy from the synthetic andstochastic training environment to previously unseen environ-ments.

10.0 10.5 11.0 11.5East (km)

8.2

8.5

8.8

9.0

9.2

9.5

9.8

10.0

Nor

th (k

m)

(a) Head-on situation

14.0 14.5 15.0 15.5East (km)

13.5

13.8

14.0

14.2

14.5

14.8

15.0

15.2

Nor

th (k

m)

(b) Astern passing

2.5 3.0 3.5 4.0East (km)

8.5

8.8

9.0

9.2

9.5

9.8

10.0

10.2

Nor

th (k

m)

(c) Overtaking

7.0 7.5 8.0 8.5East (km)

4.5

4.8

5.0

5.2

5.5

5.8

6.0

6.2

Nor

th (k

m)

(d) Static COLAV

Figure 14: Risk-based agent performing common naval collision avoidancemaneuvers in the AIS-based environment. The agent trajectory is drawn asa blue dashed line, and the target ships are drawn in red. The dotted vesseloutlines show their positions 100 time steps prior to the present time.

0.0 2.5 5.0 7.5 10.0 12.5East (km)

2.0

4.0

6.0

8.0

10.0

12.0

14.0

Nor

th (k

m)

Goal

Start

(a) Risk-based agent’strajectory in the Ørland-Agdenes test scenario. Theagent intelligently mergeswith the two-way traffic,and follows it towards thegoal.

10.0 15.0 20.0 25.0East (km)

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

Nor

th (k

m)

Goal

Start

(b) Risk-based agent’s tra-jectory in the Trondheimtest scenario. The agent ex-hibits excellent path follow-ing, and is seen to maneu-ver the crossing traffic be-fore returning to the pathand reaching the goal.

6.0 8.0 10.0 12.0East (km)

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

Nor

th (k

m)

GoalStart

(c) Risk-based agent’s tra-jectory in the Froan testscenario. When presentedwith an impossible path, theagent intelligently finds an-other solution for navigatingthe dense archipelago andmerge into the parallel traf-fic.

Figure 15: Trajectories from three different AIS-based environments drawnas blue dashed lines. Target ships and trajectories are drawn in red.

6. Conclusion

The primary objective of this work was to investigate whetherCOLREG-compliance in a path following and collision avoid-ance system based on model-free DRL is possible using arisk-based approach. In summary, we found that

• using state-of-the-art collision risk theory, the ambiguityof the COLREGs (rules 14-16) can be circumvented by

14

Page 15: Risk-based implementation of COLREGs for autonomous

conditioning the decision-making agent on collision riskindices as a proxy for the COLREGs.

• the agent learned complex rules by training in a stochas-tic and synthetic training environment, which translatedwell into real-world testing environments. Moreover, theapproach produced COLREG-compliant behavior whentested in isolated encounters.

As described in Section 4.2.2, the agent perceives its sur-roundings by summarizing the information from N distancesensors into D sectors using feasibility pooling. Conse-quently, the agent directly reads the decomposed velocity ofthe nearest TS in each sector to ensure their detection. Thisapproach leads to significant information loss, as the agentcan neither observe high-frequency details, such as the sizeof the TS, nor if there are multiple obstacles in a single sec-tor. The curse of dimensionality, which motivates the feasi-bility pooling algorithm, can potentially be avoided by apply-ing a convolutional neural network (CNN). Unlike fully con-nected networks, CNNs directly utilize spatial information instructured data. Thus, a CNN can exploit the fact that thereis a strong correlation between neighboring sensor measure-ments, given that the resolution is high enough. Furthermore,by stacking multiple observations temporally, the observationthen contains sufficient information for the agent to infer anyobstacle’s relative velocity and acceleration from the distancemeasurements alone, removing the need for object trackingand velocity estimation methods.

Additionally, the COLREGs must be adapted for machinereadability and interpretability, clearly defining the requiredbehavior in different environments and situations. Withoutdoing so, it will be impossible to accurately assess the successof an autonomous vessel and claim that it is fully COLREG-compliant. Nevertheless, the results obtained in this work im-ply the potential of DRL to handle COLREG rules throughrisk-based conditioning. Thus, once the COLREGs are mod-ernized for digital applications, DRL can be expected to pro-duce COLREG-compliant and autonomous collision avoid-ance systems.

Acknowledgments

The authors acknowledge the financial support from the Nor-wegian Research Council and the industrial partners: DNVGL, Kongsberg and Maritime Robotics of the Autosit project.(Grant No.: 295033).

References

Aniculaesei, A., Arnsberger, D., Howar, F., Rausch, A., 2016. Towards theverification of safety-critical autonomous systems in dynamic environ-ments. Electronic Proceedings in Theoretical Computer Science, EPTCS232, 79–90. doi:10.4204/EPTCS.232.10.

Autoship, 2020. AUTOSHIP – Autonomous Shipping Initiative for EuropeanWaters. https://www.autoship-project.eu/. Accessed: 2020-06-10.

Benjamin, M.R., Leonard, J.J., Curcio, J.A., Newman, P.M., 2006. A Methodfor Protocol-Based Collision Avoidance between Autonomous MarineSurface Crafts. Journal of Field Robotics 29, 554–575. doi:10.1002/rob.

Casalino, G., Turetta, A., Simetti, E., 2009. A three-layered architecturefor real time path planning and obstacle avoidance for surveillance USVsoperating in harbour fields. OCEANS ’09 IEEE Bremen: Balancing Tech-nology with Future Needs , 1–8doi:10.1109/OCEANSE.2009.5278104.

Chen, S., Ahmad, R., Lee, B.G., Kim, D.H., 2014. Composition ship collisionrisk based on fuzzy theory. Journal of Central South University 21, 4296–4302. doi:10.1007/s11771-014-2428-z.

Cockcroft, A.N., 1981. The estimation of collision risk for marine traffic.Journal of Navigation 34, 145–147. doi:10.1017/S0373463300024310.

Codevilla, F., Santana, E., Lopez, A.M., Gaidon, A., 2019. Ex-ploring the limitations of behavior cloning for autonomous driving.arXiv:1904.08980.

Davis, P.V., Dove, M.J., Stockel, C.T., 1980. A computer simulation of ma-rine traffic using domains and arenas. Journal of Navigation 33, 215–222.doi:10.1017/S0373463300035220.

Ding, J., Gillula, J.H., Huang, H., Vitus, M.P., Zhang, W., , Tomlin, C.J.,2011. Hybrid Systems in Robotics: Toward Reachability-Based Con-troller Design. IEEE Robotics & Automation Magazine .

Dingus, T.A., Guo, F., Lee, S., Antin, J.F., Perez, M., Buchanan-King, M.,Hankey, J., 2016. Driver crash risk factors and prevalence evaluation us-ing naturalistic driving data. Proceedings of the National Academy ofSciences of the United States of America 113, 2636–2641. doi:10.1073/pnas.1513271113.

DNV GL, 2019. Digital twins and sensor monitoring. https:

//www.dnvgl.com/expert-story/maritime-impact/Digital-twins-and-sensor-monitoring.html. Accessed: 2020-02-25.

DNV GL, 2020. The ReVolt – A new inspirational ship concept. https://www.dnvgl.com/technology-innovation/revolt/index.html. Ac-cessed: 2020-27-05.

Eriksen, B.O.H., 2019. Collision Avoidance and Motion Control for Au-tonomous Surface Vehicles. Ph.D. thesis. Norwegian University of Sci-ence and Technology.

European Maritime Safety Agency (EMSA), 2018. Annual Overview ofMarine Casualties and Incidents 2018. http://www.emsa.europa.eu/news-a-press-centre/external-news/item/3406-annual-

overview-of-marine-casualties-and-incidents-2018.html.Accessed: 2019-11-05.

Fiorini, P., Shiller, Z., 1998. Motion planning in dynamic environments usingvelocity obstacles. International Journal of Robotics Research 17, 760–772. doi:10.1177/027836499801700706.

Fossen, T.I., 2011. Handbook of Marine Craft Hydrodynamics and MotionControl. doi:10.1002/9781119994138.

Fox, D., Wolfram, B., Sebastian, T., 1997. The Dynamic Window Approachto Collision Avoidance. IEEE Robotics & Automation Magazine 4, 137–146.

Gang, L., Wang, Y., Sun, Y., Zhou, L., Zhang, M., 2016. Estimation ofvessel collision risk index based on support vector machine. Advances inMechanical Engineering 8, 1–10. doi:10.1177/1687814016671250.

Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. The MITPress.

Goodwin, E.M., 1978. Marine encounter rates. Journal of Navigation 31,357–369. doi:10.1017/S0373463300041904.

Granstrom, K., Baum, M., Reuter, S., 2016. Extended object tracking: Intro-duction, overview and applications. arXiv:1604.00970.

Haarnoja, T., Zhou, A., Abbeel, P., Levine, S., 2017. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochas-tic actor arXiv:1801.01290. arXiv [Preprint]. Available at: https:

//arxiv.org/abs/1801.01290 (Accessed July 07, 2021).Hart, P.E., Nilsson, N.J., Raphael, B., 1968. A Formal Basis for the Heuristic

Determination of Minimum Cost Paths. Journal of the Society for Indus-trial and Applied Mathematics 9, 514–532. doi:10.1137/0109044.

International Maritime Organization, 1972. COLREGS - International Regu-lations for Preventing Collisions at Sea. URL: http://www.imo.org/en/About/Conventions/ListOfConventions/Pages/COLREG.aspx.

International Maritime Organization (IMO), 2020. AIS Transponders.http://www.imo.org/en/OurWork/Safety/Navigation/Pages/AIS.aspx. Accessed: 2019-09-16.

Kim, D.G., Hirayama, K., Okimoto, T., 2015. Ship Collision Avoidance by

15

Page 16: Risk-based implementation of COLREGs for autonomous

Distributed Tabu Search. TransNav, the International Journal on MarineNavigation and Safety of Sea Transportation 9, 23–29. doi:10.12716/1001.09.01.03.

KONGSBERG, 2020. Autonomous ship project, key facts about YaraBirkeland. https://www.kongsberg.com/no/maritime/support/themes/autonomous-ship-project-key-facts-about-yara-

birkeland/. Accessed: 2020-27-05.Kuwata, Y., Wolf, M.T., Zarzhitsky, D., Huntsberger, T.L., 2014. Safe

maritime autonomous navigation with COLREGS, using velocity obsta-cles. IEEE Journal of Oceanic Engineering 39, 110–119. doi:10.1109/JOE.2013.2254214.

Larsen, T.N., Teigen, H.Ø., Laache, T., Varagnolo, D., Rasheed, A., 2021.Comparing deep reinforcement learning algorithms’ ability to safelynavigate challenging waters. Frontiers in Robotics and AI 8, 287.doi:10.3389/frobt.2021.738113.

Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D.,Wierstra, D., 2015. Continuous control with deep reinforcement learning.arXiv:1509.02971.

Lin, B., Huang, C.H., 2006. Comparison between arpa radar and AIS char-acteristics for vessel traffic services. Journal of Marine Science and Tech-nology 14, 182–189.

Loe, Ø.A.G., 2008. Collision Avoidance for Unmanned Surface Vehicles.Ph.D. thesis. Norwegian University of Science and Technology. URL:http://ntnu.diva-portal.org/smash/record.jsf?pid=diva2:347606.

Martinsen, A.B., 2018. End-to-end training for path following and control ofmarine vehicles.

Martinsen, A.B., Lekkas, A.M., 2019. Curved Path Following withDeep Reinforcement Learning: Results from Three Vessel Models.OCEANS 2018 MTS/IEEE Charleston, OCEAN 2018 doi:10.1109/OCEANS.2018.8604829.

Meyer, E., 2020. On Course Towards Model-Free Guidance: A Self-LearningApproach To Dynamic Collision Avoidance for Autonomous Surface Ve-hicles. Master’s thesis. Norwegian University of Science and Technology.

Meyer, E., Heiberg, A., Rasheed, A., San, O., 2020a. COLREG-Compliant Collision Avoidance for Unmanned Surface Ve-hicle using Deep Reinforcement Learning. arXiv e-prints ,arXiv:2006.09540arXiv:2006.09540.

Meyer, E., Robinson, H., Rasheed, A., San, O., 2020b. Taming an au-tonomous surface vehicle for path following and collision avoidance usingdeep reinforcement learning. IEEE Access 8, 41466–41481.

Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T.P., Harley, T., Sil-ver, D., Kavukcuoglu, K., 2016. Asynchronous methods for deep rein-forcement learning. arXiv:1602.01783.

Organization, I.M., 1972. COLREGS - International Regulations forPreventing Collisions at Sea . https://www.imo.org/en/About/Conventions/Pages/COLREG.aspx. Accessed: 2021-11-15.

Park, G.K., Benedictos, J.L.R.M., Shin, S.C., Nam-Kyun, I., 2006. De-sign of a Fuzzy-CBR Support System for Ship ’ s Collision Avoidance.SCIS&ISIS2006 , 240–248.

Rawson, A., Rogers, E., Foster, D., Phillips, D., 2014. Practical applicationof domain analysis: Port of london case study. Journal of Navigation 67,193–209. doi:10.1017/S0373463313000684.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Prox-imal policy optimization algorithms. arXiv:1707.06347.

Serigstad, E., Eriksen, B.O.H., Breivik, M., 2018. Hybrid Collision Avoid-ance for Autonomous Surface Vehicles. IFAC-PapersOnLine 51, 1–7.doi:10.1016/j.ifacol.2018.09.460.

Siciliano, B., Khatib, O., 2008. Springer Handbook of Robotics. volume 1.Springer.

Silver, D., Singh, S., Precup, D., Sutton, R.S., 2021. Reward is enough.Artificial Intelligence 299, 103535. doi:https://doi.org/10.1016/j.artint.2021.103535.

Skjetne, R., Øyvind Smogeli, Fossen, T.I., 2004a. Modeling, identification,and adaptive maneuvering of cybership ii: A complete design with exper-iments. IFAC Proceedings Volumes 37, 203 – 208. IFAC Conference onComputer Applications in Marine Systems - CAMS 2004, Ancona, Italy,7-9 July 2004.

Skjetne, R., Smogeli, Ø.N., Fossen, T.I., 2004b. A nonlinear ship manoeu-vering model: Identification and adaptive control with experiments for amodel ship.

Skredderberget, A., 2018. The first ever zero emission, autonomousship. https://www.yara.com/knowledge-grows/game-changer-for-the-environment/. Accessed: 2019-11-12.

Smierzchalski, R., 1999. Evolutionary trajectory planning of ships in nav-igation traffic areas. Journal of Marine Science and Technology 4, 1–6.doi:10.1007/s007730050001.

SNAME, The Society of Naval Architecture and Marine Engineers, 1950.Nomenclature for treating the motion of a submerged body through afluid”.

Soloperto, R., Kohler, J., Allguwer, F., Muller, M.A., 2019. Collision avoid-ance for uncertain nonlinear systems with moving obstacles using robustModel Predictive Control. 18th European Control Conference (ECC) ,811–817doi:10.23919/ecc.2019.8796049.

Statheros, T., Howells, G., McDonald-Maier, K., 2008. Autonomous shipcollision avoidance navigation concepts, technologies and techniques.Journal of Navigation 61, 129–142. doi:10.1017/S037346330700447X.

Svec, P., Shah, B.C., Bertaska, I.R., Alvarez, J., Sinisterra, A.J., Von Ellen-rieder, K., Dhanak, M., Gupta, S.K., 2013. Dynamics-aware target fol-lowing for an autonomous surface vehicle operating under COLREGs incivilian traffic. IEEE International Conference on Intelligent Robots andSystems , 3871–3878doi:10.1109/IROS.2013.6696910.

Szlapczynski, R., Szlapczynska, J., 2017. Review of ship safety do-mains: Models and applications. Ocean Engineering 145, 277–289. URL: https://doi.org/10.1016/j.oceaneng.2017.09.020,doi:10.1016/j.oceaneng.2017.09.020.

Sørensen, M.E.N., Breivik, M., Eriksen, B.H., 2017. A ship heading andspeed control concept inherently satisfying actuator constraints, in: 2017IEEE Conference on Control Technology and Applications (CCTA), pp.323–330.

Tai, L., Zhang, J., Liu, M., Boedecker, J., Burgard, W., 2016. A survey ofdeep network solutions for learning control in robotics: From reinforce-ment to imitation. arXiv:1612.07139.

Tam, C.K., Bucknall, R., Greig, A., 2009. Review of collision avoidanceand path planning methods for ships in close range encounters. Journal ofNavigation 62, 455–476. doi:10.1017/S0373463308005134.

Thomas, P., Morris, A., Talbot, R., Fagerlind, H., 2013. Identifying the causesof road crashes in Europe. Annals of Advances in Automotive Medicine57, 13–22.

Vallestad, I.J., 2019. Path Following and Collision Avoidance for MarineVessels with Deep Reinforcement Learning. Master’s thesis. NorwegianUniversity of Science and Technology.

Wang, Y., Chin, H.C., 2016. An Empirically-Calibrated Ship Domain as aSafety Criterion for Navigation in Confined Waters. Journal of Navigation69, 257–276. doi:10.1017/S0373463315000533.

Woernner, K., 2014. COLREGs-Compliant Autonomous Collision Avoid-ance Using Multi-Objective Optimization with Interval Programming.Naval Engineers Journal 126, 64–65. doi:10.21236/ada609415.

Xu, Q., Wang, N., 2014. A survey on ship collision risk evaluation. Promet -Traffic&Transportation 26, 475–486. doi:10.7307/ptt.v26i6.1386.

Xu, Q., Zhang, C., Zhang, L., 2017. Deep convolutional neural networkbased unmanned surface vehicle maneuvering. Proceedings, 2017 ChineseAutomation Congress (CAC) , 878–881doi:10.1109/CAC.2017.8242889.

Yan, Q., 2002. A Model for Estimating the Risk Degrees of Collisions. Jour-nal of Wuhan University of Technology (Transportation Science & Engi-neering) 26.

Zhao, L., Roh, M.I., 2019. COLREGs-compliant multiship collisionavoidance based on deep reinforcement learning. Ocean Engineeringdoi:10.1016/j.oceaneng.2019.106436.

16