Reinforcement Learning PPO Sato for Ship Collision Avoidance

Environment Setup for Collision Avoidance Learning

The learning environment is designed to simulate maritime navigation, featuring target ships, a designated waypoint, and a target area representing open sea. In this setup, we assume an unobstructed open sea, devoid of coastlines or buoys, to focus purely on ship interactions. The primary objective for the own ship, acting as a controllable agent, is to reach a predefined waypoint while effectively avoiding collisions with target ships. These target ships are strategically placed within the target area to create various encounter scenarios.

Ship motion within the simulation is governed by Nomoto’s equation [29] for heading dynamics and a first-order delay equation for rudder motion, as detailed in Eq. 8. The coordinate system for ship motion is visualized in Fig. 8.

$$begin{aligned} begin{bmatrix} dot{psi } \ dot{r} \ dot{delta } end{bmatrix} = begin{bmatrix} 0 &{} 1 &{} 0 \ 0 &{} -1/T &{} K/T \ 0 &{} 0 &{} -1/T_E end{bmatrix} begin{bmatrix} psi \ r \ delta end{bmatrix} + begin{bmatrix} 0 &{} 0 &{} 0 \ 0 &{} 0 &{} 0 \ 0 &{} 0 &{} 1/T_E end{bmatrix} begin{bmatrix} 0\ 0\ delta _C end{bmatrix}, end{aligned}$$

(8)

Here, ($psi$) represents the heading angle, r is the rate of turn, and ($delta$) is the rudder angle, with ($delta _C$) being the command rudder angle. T and ($T_E$) are time coefficients for heading and rudder motion, and K is the gain. For simplicity and consistency, both own and target ships are configured with identical motion parameters, simulating similar cargo ship types. Ship speeds remain constant throughout the simulation for both own and target vessels. The Runge–Kutta method is employed for numerical integration of motion equations.

Two distinct control methods are utilized to train the reinforcement learning model within continuous action spaces: a rudder control model, which directly outputs command rudder angles, and an autopilot model, which outputs command heading angles. The rudder control model adjusts the command rudder angle within a range of -10° to +10° based on the current policy. The autopilot model, on the other hand, selects changes in the command heading angle, also within a -10° to +10° range, at each time step according to the policy. Table 1 provides a summary of vessel parameters, which are uniform across all ships in the simulation.

The Proximal Policy Optimization (PPO) algorithm’s policy and value networks receive a state vector from the environment. This state vector is composed of: (1) Own-ship Zone of Threat (OZT) information derived from grid sensor detections, (2) normalized values for the own ship’s heading angle, rate of turn, speed, and rudder angle, and (3) normalized values for the azimuth angle and distance to the waypoint, along with the command rudder angle from the autopilot towards the waypoint. A key feature of this approach is that target ship dynamics are inferred solely from grid sensor detection results. This means the algorithm does not directly handle dynamic information of target ships or absolute position information of the own ship, simplifying the input and focusing on relative observations. To enhance computational efficiency, grid sensor detections and command rudder angle updates are performed every 10 seconds of simulation time, while motion calculations are integrated at 1-second intervals. Further details of the learning environment settings are available in Table 2. The grid sensor is designed to mimic AIS data, with a range based on the practical communication range of shipborne AIS, approximately 12 nautical miles [30]. The environment itself is implemented using Python with OpenAI Gym, a standard platform for reinforcement learning development, ensuring compatibility and ease of integration with other Deep Reinforcement Learning (DRL) algorithms, including those based on Reinforcement Learning Ppo Sato methodologies.

Fig. 8

Fig. 8: Illustration of the coordinate system used to model ship motion in the reinforcement learning environment for collision avoidance.

Table 1 Subjects of ships for learningFull size table

Table 2 Configurations of environmentFull size table

Reward System Design

The reward system is crucial in guiding the learning process in reinforcement learning ppo sato based models, and is structured using two types of rewards: basic rewards, applied at every time step, and achievement rewards, granted at the conclusion of each episode. An episode spans from the simulation start to its termination, which occurs either when the own ship reaches within a specified distance of the waypoint or when the simulation reaches a predetermined maximum number of steps.

Basic rewards, termed Costs, are calculated using Eqs. 9–12. The waypoint cost, ($mathrm{{Costs}}_mathrm{{wp}}$), from Eq. 10, provides a positive reward that increases as the own ship approaches the waypoint, encouraging goal-oriented navigation. To incorporate maritime regulations (COLREGs), a small positive reward, ($mathrm{{Costs}}_mathrm{{starboard}}$), is given to incentivize the own ship to navigate on the starboard side of the line connecting its starting position to the waypoint. This subtly guides the agent to adopt starboard-side collision avoidance maneuvers. A stability cost, ($mathrm{{Costs}}_mathrm{{stable}}$), is implemented to promote stable heading control by the trained models, penalizing excessive turning rates. Importantly, instead of immediately terminating an episode upon collision, a penalty of -5 is applied, allowing the agent to learn from near-collision states without prematurely ending the learning sequence.

$$begin{aligned} mathrm{{Costs}}= & {} mathrm{{Costs}}_mathrm{{wp}} + mathrm{{Costs}}_mathrm{{starboard}} + mathrm{{Costs}}_mathrm{{stable}}, end{aligned}$$

(9)

$$begin{aligned} mathrm{{Costs}}_mathrm{{wp}}= & {} 0.9tanh (1/d_mathrm{{wp}}), end{aligned}$$

(10)

$$begin{aligned} mathrm{{Costs}}_mathrm{{starboard}}= & {} {left{ begin{array}{ll} 0.05, &{} mathrm{{Az}}_mathrm{{wp}} ge 0 \ 0.0, &{} mathrm{{Az}}_mathrm{{wp}} < 0, end{array}right. } end{aligned}$$

(11)

$$begin{aligned} mathrm{{Costs}}_mathrm{{stable}}= & {} -0.01|r/pi |, end{aligned}$$

(12)

Here, ($d_mathrm{{wp}}$) and ($mathrm{{Az}}_mathrm{{wp}}$) are the distance and azimuth to the waypoint from the own ship. Achievement rewards are set based on episode outcomes: -50 for deviating from the target area, -50 for a collision, and +50 for successfully reaching the waypoint within the specified distance without any collisions. The magnitude of these achievement rewards was determined through preliminary learning experiments to ensure that the subtle influence of ($mathrm{{Costs}}_mathrm{{starboard}}$) is maintained while effectively encouraging collision avoidance behavior.

Network Structures and Update Methodology

The policy and value functions in the PPO algorithm are realized using deep neural networks. Previous research using discrete action spaces and purely convolutional and fully-connected (FC) networks [17] showed limitations in achieving satisfactory performance. A potential reason is the inability of these network architectures to effectively process temporal dependencies in the environment. To address this, Recurrent Neural Networks (RNNs), specifically Long Short-Term Memory (LSTM) networks, are introduced to handle time-series data inherent in navigation tasks.

Given the diverse nature of input states—grid sensor detection results, own ship dynamics, and waypoint information—a segregated input processing approach is adopted. Grid sensor data, akin to image data, is processed through convolutional layers, while numerical data like ship dynamics and waypoint information is directly fed into fully connected layers. This separation leverages the strengths of convolutional layers in spatial feature extraction and fully connected layers in processing low-dimensional numerical data. The outputs from these parallel pathways are then merged into a unified network, as illustrated in Fig. 9. For continuous action space learning, the network structure, also shown in Fig. 9, differs from discrete action space networks primarily in the output layers and the inclusion of an LSTM cell before the output layer to capture temporal dynamics. In this implementation of reinforcement learning ppo sato, the policy and value networks do not share parameters. Specifically, for the rudder control model, the policy network incorporates two convolutional layers, while the value function network uses a single convolutional layer. Network updates are performed using the Adam optimizer [31]. Hyperparameters specific to PPO in continuous action spaces are detailed in Table 3, while hyperparameters for the earlier discrete action space model are available in reference [17].

Fig. 9

Fig. 9: Depicts the network architectures utilized in PPO for continuous action spaces (rudder and autopilot control) and for discrete action spaces as described in prior work. Note the incorporation of LSTM layers for temporal data processing in continuous action models.

Scenario Design for Training

The scenarios employed during the learning phase significantly influence the effectiveness of the trained collision avoidance model. An ideal scenario set should range from simple, one-on-one encounters to complex, multi-ship situations. For collision avoidance system testing, Woerner et al. proposed a scenario set [32], and Cai and Hasegawa developed an evaluation method using the Imazu problem as a benchmark [33]. The Imazu problem [19], illustrated in Fig. 10, consists of both basic and complex ship encounter scenarios and is used in this study for training the reinforcement learning ppo sato model.

Figure 10 visualizes the Imazu problem scenarios; numbers in the top left of each box indicate the scenario case number. Velocity vectors are represented by short bars emanating from triangles (own ship) and circles (target ships). Cai and Hasegawa’s research indicates that collision risk is reduced when target ships are allowed to perform avoidance maneuvers. However, in this study, target ships maintain a straight course without any avoidance actions or waypoint navigation, simplifying the learning environment and focusing the learning on the own ship’s behavior. To enhance the generalization of the learned model, a randomized scenario with three randomly positioned ships is added to the 22 cases of the Imazu problem, resulting in a total of 23 training scenarios.

In each scenario, initial positions and courses are set such that ships would collide at the origin if no avoidance action is taken. During training, one of these 23 scenarios is randomly selected at the start of each episode. Target ships are positioned according to the pre-configured setups for each case. The Time to Closest Point of Approach (TCPA) for each target ship is initially set to 30 minutes. The own ship starts at coordinates (X [NM], Y [NM]) = ((-,6.0), 0.0), with its heading angle randomly initialized within a range of -5° to +5° at the beginning of each episode to promote generalizability. Appendix 1 provides detailed initial positions and heading angles for target ships in each Imazu problem case.

Fig. 10

Fig. 10: Illustration of the Imazu problem scenarios used for training the reinforcement learning model, encompassing various ship encounter configurations from simple to complex.

Reinforcement Learning PPO Sato for Ship Collision Avoidance

Environment Setup for Collision Avoidance Learning

Reward System Design

Network Structures and Update Methodology

Scenario Design for Training

Comments

Leave a Reply Cancel reply