Please run a version of Pytho
n older than 3.13
, 3.12.6
and 3.10.11
worked for me (probably 3.11
would work too).
I have created a requirements.txt
, try to install that, it should work.
models/
➡ contains all the models trainedmodels/best
➡ contains the best-performing models saved during trainingmodels/evals
➡ contains the models to be evaluated inside the scriptsrc/eval/evaluate.py
src/eval
➡ contains the scripts to evaluate the modelssrc/eval/results
➡ contains screenshots of the evaluation to show what it produces
src/run
➡ contains the scripts to runs the trained modelssrc/train
➡ contains the scripts to train the modelssrc/train/logs
➡ contains all the tensorboard logs of the trained modelssrc/train/old
➡ contains old training scripts used during training
src/utils
➡ contains the various scripts for the convenience of the developer (i.e. script to test if you have gpu support)
- The aim of this project is to develop a deep reinforcement learning (DRL) agent capable of autonomous driving using the HighwayEnv environment and a custom reward function.
This project customizes the HighwayEnv
environment to develop a deep reinforcement learning (DRL) agent for autonomous driving. The primary enhancement is a custom reward function designed to encourage safe, efficient, and smooth driving behaviors.
-
Custom Reward Function:
- Lane Change Penalty: Discourages unnecessary lane changes to promote smoother driving by penalizing each change.
- Safe Following Distance: Rewards maintaining a safe distance from other vehicles and penalizes being too close.
- Combined Reward Components: Integrates existing and new rewards such as collision penalties, high-speed rewards, right-lane incentives, and traffic-aware adjustments.
- Normalization: Scales rewards dynamically to maintain consistent values across varying situations.
-
Closest Vehicle Distance Calculation:
- A helper method computes the distance to the nearest vehicle in the same lane, ensuring rewards reflect traffic awareness.
-
Custom Environment Registration:
- Registers the custom environment using
gymnasium
, enabling seamless integration and experimentation.
- Registers the custom environment using
- A2C (Advantage Actor-Critic): Balances policy and value optimization by leveraging advantage estimates for improved performance.
- PPO (Proximal Policy Optimization): Ensures stable learning through policy clipping, allowing for efficient and reliable updates.
- DQN (Deep Q-Network): Utilizes experience replay and target networks to effectively manage discrete action spaces.
- TD3 (Twin Delayed Deep Deterministic Policy Gradient): Enhances performance in continuous action spaces by addressing overestimation bias and introducing delayed policy updates.
- TRPO (Trust Region Policy Optimization): Provides guarantees on the policy updates to maintain a trust region, facilitating safer and more effective learning.
- SAC (Soft Actor-Critic): Combines off-policy learning with maximum entropy reinforcement learning for robust exploration in continuous action spaces.
env.unwrapped.config["lanes_count"] = 4
changing the lanes of the envenv.unwrapped.config["duration"] = 60
making the duration longerenv.unwrapped.config["vehicles_density"] = 1.6
each lane has more carsenv.unwrapped.config["vehicle_count"] = 70
adding more cars to the env
To see video demos about the models go to the folder videos.
-
reward functions:
highway_env/envs/custom_highway_env_v0.py
➡ (MAIN) the best performing one we used this for training and evaluatehighway_env/envs/custom_high_env_v1_antonio.py
➡ previous custom environment didnt perform quite right with PPOhighway_env/envs/custom_high_env_v1_lukas.py
➡ previous custom environment didnt perform quite right with PPO
-
src/train:
➡ contains all the script to train the models:train_a2c_highway.py
➡ train A2C model*train_dqn_highway.py
➡ train DQN model*train_ppo_highway.py
➡ train PPO model*train_sac_highway.py
➡ train SAC model*train_td3_highway.py
➡ train TD3 model*train_trpo_highway.py
➡ train TRPO model*
-
src/run:
➡ contains all the script to run the models:run_a2c_highway.py
➡ run the trained A2C model*run_dqn_highway.py
➡ run the trained DQN model*run_ppo_highway.py
➡ run the trained PPO model*run_sac_highway.py
➡ run the trained SAC model*run_td3_highway.py
➡ run the trained TD3 model*run_trpo_highway.py
➡ run the trained TRPO model*
-
src/eval/evaluate.py
evaluate all the models
*both default environment (highway-fast-v0) and the custom one.
We have implemented an evaluation function with a separate environment to save the best model found during training every 2000 epoches.
the model saved ends with the name _best
name_env = "highway-fast-v0"
eval_env = gymnasium.make(name_env, render_mode='rgb_array') # Separate eval environment
eval_env = Monitor(eval_env) # Add the Monitor wrapper for tracking statistics
....
# Callback for saving the best model
eval_callback = EvalCallback(
eval_env,
best_model_save_path=f"../../models/{name_model}_best",
log_path=f"logs/trpo/{name_model}/",
eval_freq=2_000, # Evaluate every 2,000 timesteps
deterministic=True, # Use deterministic actions for evaluation
render=False
)
....
# Train the model with same timesteps as other models
model.learn(total_timesteps=int(100_000), callback=eval_callback)
In the training scripts for SAC
and TD3
we played around with continuous and discrete environments
class DiscreteToBoxWrapper(gymnasium.ActionWrapper):
def __init__(self, env):
super().__init__(env)
n_actions = env.action_space.n
self.action_space = spaces.Box(low=-1, high=1, shape=(1,), dtype=np.float32)
self._n_actions = n_actions
def action(self, action):
# Convert continuous action to discrete
# Map [-1, 1] to [0, n_actions-1]
action = np.clip(action, self.action_space.low, self.action_space.high)
scaled_action = (action + 1) * (self._n_actions - 1) / 2
discrete_action = int(np.round(scaled_action.item()))
discrete_action = np.clip(discrete_action, 0, self._n_actions - 1)
return discrete_action
## custom environment
#name_model = "highway_sac_custom_2"
#name_env = "custom-highway-env-v0"
## default environment
name_model = "highway_sac_default_fast"
name_env = "highway-fast-v0"
# Create the environment and wrap it
env = gymnasium.make(name_env, render_mode='rgb_array')
env = DiscreteToBoxWrapper(env)
You can find it here highway_env/envs/custom_highway_env_v0.py
The reward function combines multiple elements:
- Penalizes collisions and unnecessary lane changes.
- Rewards high-speed driving, staying in the right lane, and maintaining safe distances.
- Dynamically adjusts reward scaling through normalization if configured.
Tracks the agent’s current lane and penalizes switches to encourage stability.
Calculates and rewards safe spacing from other vehicles to enhance traffic safety.
Registers the customized environment as custom-highway-env-v0
, making it easily accessible for training and evaluation.
highway-fast-v0
vs custom-highway-env-v0
- Custom Environments: Models like SAC, TD3, and TRPO performed exceptionally well, achieving mean episode lengths around 40, indicating effective learning and control.
- Fast Environments: All models showed reduced performance, with lower episode lengths compared to their custom counterparts, suggesting that the fast environment parameters may increase difficulty or reduce the time available for training.
This graph shows the overall length of the episodes during training, we can notice that DQN with the custom environment stays alive more compared to the fast default one.
SAC achieved the highest mean reward in both custom and fast environments, with values of approximately 27 and 21 respectively, suggesting it effectively balances exploration and exploitation.
DQN and PPO also performed well, particularly in the custom environments, but their rewards dropped significantly in the fast environments, indicating sensitivity to the environment's dynamics.
Overall rewards during training, we see a big dump around 20K from DQN, the custom environment in this case seems more stable.
A2C also tends to perform much better than the default environment in this case.
The exploration rate for DQN was consistent across environments. This indicates that the model maintained its exploration strategy, which is crucial for learning in different settings.
The value loss for A2C in the custom environment was significantly lower than in the fast environment, indicating better approximation of value functions during training.
PPO and SAC showed consistent losses, with SAC having a notably lower actor loss in the custom environment, which is indicative of more stable training.
The learning rates were consistent across runs for each model type.
Higher entropy losses in A2C for the fast environment indicate that it may struggle with exploration in this configuration, leading to reduced policy diversity.
PPO showed relatively low policy losses, suggesting a stable policy update process.
This comparison is related from what we can see from the tensorboard, in reality high-reward and longer episode do not always reflect with good performance in the highway environment when you see it in person.
(sometimes the high reward/episode_time it is just because the model is stalling for time to get the best reward)
Performed moderately well, especially in custom environments, but struggled in the fast environment both in terms of episode length and rewards.
The custom environment yielded better results compared to the fast one, indicating it may require more tuning for environments with different dynamics.
Demonstrated solid performance across both environments, particularly excelling in custom settings. It maintained stability in training, as evidenced by low policy and value losses.
The standout performer, SAC showed resilience and effectiveness in both environments, suggesting it is well-suited for the highway tasks given its high rewards and episode lengths.
Similar to SAC, TD3 performed well but slightly lagged behind in rewards compared to SAC, indicating it might benefit from further exploration tuning.
Consistent performance in custom environments, but like others, faced challenges in the fast environment. Its ability to maintain line search success is a positive sign for stability.
highway-v0
vs custom-highway-env-v0
- A2C: The default A2C achieved an average episode length of approximately 30.41, which is slightly better than the custom environment's 28.07.
- DQN: The default DQN maintained a length of around 35.20, showing solid performance, though slightly lower than the custom setup (37.00).
- PPO: Performance decreased slightly in the default environment (28.85) compared to the custom version (34.16).
- SAC: SAC's performance was stable, with episode lengths of around 39.23 in the default versus 40 in the custom.
- TD3: Similar to SAC, TD3 also maintained strong performance with lengths of 40 in the custom and slightly lower at 40 in the default.
- TRPO: TRPO performed consistently well, with both environments showing around 40 episode lengths.
- A2C: Mean rewards improved in the default environment (22.81) compared to the custom (19.57).
- DQN: Default DQN yielded a mean reward of 25.64, which is higher than the custom (24.22), indicating effective policy learning.
- PPO: The mean reward showed a decrease in the default environment (22.47) compared to the custom (24.54).
- SAC: Similar to episode lengths, SAC's mean reward in the default environment (27.71) was slightly lower than in the custom (26.37).
- TD3: The mean reward in the default environment (27.92) was slightly better than the custom (25.86).
- TRPO: TRPO showed consistent performance with a mean reward of 27.83 in the default environment compared to 26.10 in the custom.
SAC (45.67) and TD3 (34.99) performed well in episode length, indicating their ability to sustain longer runs in the default environment.
SAC again led with a mean reward of 30.75, while A2C lagged behind with 10.70 in the default environment.
A2C's value loss was notably high in the default environment (2,13), suggesting difficulties in approximating value functions.
SAC and TD3 had comparatively lower losses, indicating more stable training dynamics.
PPO's loss was relatively low (0.040), indicating effective updates.
The exploration rate for DQN remained consistent at 0.1 across both environments, suggesting stable exploration strategies.
PPO maintained low entropy losses in both environments, indicating a consistent exploration-exploitation balance.
A2C experienced significant policy loss in the default environment, which may reflect challenges in adapting to the dynamics of the highway environment.
-
A2C: While the default environment yielded better episode lengths and rewards compared to the custom one, the high value loss indicates potential overfitting or instability in learning.
-
DQN: Showed solid consistency and even improvement in rewards in the default environment, suggesting it may be more robust to changes in environment settings.
-
PPO: Maintained stability and lower losses, indicating effective learning dynamics but slightly lower performance in the default environment compared to custom.
-
SAC: Consistently high performance across metrics in both environments, indicating it is well-suited for the highway tasks.
-
TD3: Similar to SAC, TD3 performed well and showed resilience, suggesting effective learning strategies.
-
TRPO: Demonstrated stable performance in both environments, with a good balance of exploration and exploitation.
The evaluation can be done through the evaluate.py
script which compares a handfull of trained models, and runs them in a default and custom enviroment. The end results can be used to compare the performance of each model on a statistical basis.
Evaluation can be found here
Inside the evaluation script, SAC
and TD3
are underperforming because they are using a continous/descrete environment, at the time of the creation of the evaluation script this wasn't implemented (sorry we forgot, TBC next release :/)
You can see all of them in the path src/eval/results
Mainly that we didn't have access to GPU at the time, so the trials with the custom environment took a really long time.
If you see the custom environment stats you will see that it trained for 16+ hours.
With a Nvidia GTX 980 Ti I could train all 6x the models in parallel in around 8h.
In this research, we evaluated several reinforcement learning models—A2C, DQN, PPO, SAC, TD3, and TRPO—in the context of the highway environment. Our findings indicate that SAC, A2C, DQN, and PPO consistently demonstrated strong performance during evaluations, effectively navigating the complexities of the environment. In contrast, the other models displayed erratic behavior, often resulting in collisions or inefficient driving strategies. These results highlight the robustness of SAC, A2C, DQN, and PPO, making them preferable choices for applications requiring reliable and effective decision-making in dynamic environments. Future work should focus on fine-tuning the performance of the less successful models to enhance their stability and effectiveness.