Skip to content

MysterHawk/kdg-dai6-reinforcement-learning

Repository files navigation

Car crash simulator - Highway environment

Image of two car crashing

Reinforcement Learning - Team 1


Requirements

Please run a version of Python older than 3.13, 3.12.6 and 3.10.11 worked for me (probably 3.11 would work too).

I have created a requirements.txt, try to install that, it should work.

You can find all the sources for the assignment in the src/ folder

Folders:

  • models/ ➡ contains all the models trained
    • models/best ➡ contains the best-performing models saved during training
    • models/evals ➡ contains the models to be evaluated inside the script src/eval/evaluate.py
  • src/eval ➡ contains the scripts to evaluate the models
    • src/eval/results ➡ contains screenshots of the evaluation to show what it produces
  • src/run ➡ contains the scripts to runs the trained models
  • src/train ➡ contains the scripts to train the models
    • src/train/logs ➡ contains all the tensorboard logs of the trained models
    • src/train/old ➡ contains old training scripts used during training
  • src/utils ➡ contains the various scripts for the convenience of the developer (i.e. script to test if you have gpu support)

Overview of the project and objectives

  • The aim of this project is to develop a deep reinforcement learning (DRL) agent capable of autonomous driving using the HighwayEnv environment and a custom reward function.

Methodology

Custom HighwayEnv with Reward Function

Overview

This project customizes the HighwayEnv environment to develop a deep reinforcement learning (DRL) agent for autonomous driving. The primary enhancement is a custom reward function designed to encourage safe, efficient, and smooth driving behaviors.

Key Features

  1. Custom Reward Function:

    • Lane Change Penalty: Discourages unnecessary lane changes to promote smoother driving by penalizing each change.
    • Safe Following Distance: Rewards maintaining a safe distance from other vehicles and penalizes being too close.
    • Combined Reward Components: Integrates existing and new rewards such as collision penalties, high-speed rewards, right-lane incentives, and traffic-aware adjustments.
    • Normalization: Scales rewards dynamically to maintain consistent values across varying situations.
  2. Closest Vehicle Distance Calculation:

    • A helper method computes the distance to the nearest vehicle in the same lane, ensuring rewards reflect traffic awareness.
  3. Custom Environment Registration:

    • Registers the custom environment using gymnasium, enabling seamless integration and experimentation.

DRL Algorithms Used

  • A2C (Advantage Actor-Critic): Balances policy and value optimization by leveraging advantage estimates for improved performance.
  • PPO (Proximal Policy Optimization): Ensures stable learning through policy clipping, allowing for efficient and reliable updates.
  • DQN (Deep Q-Network): Utilizes experience replay and target networks to effectively manage discrete action spaces.
  • TD3 (Twin Delayed Deep Deterministic Policy Gradient): Enhances performance in continuous action spaces by addressing overestimation bias and introducing delayed policy updates.
  • TRPO (Trust Region Policy Optimization): Provides guarantees on the policy updates to maintain a trust region, facilitating safer and more effective learning.
  • SAC (Soft Actor-Critic): Combines off-policy learning with maximum entropy reinforcement learning for robust exploration in continuous action spaces.

Details on environment modifications.

  • env.unwrapped.config["lanes_count"] = 4 changing the lanes of the env
  • env.unwrapped.config["duration"] = 60 making the duration longer
  • env.unwrapped.config["vehicles_density"] = 1.6 each lane has more cars
  • env.unwrapped.config["vehicle_count"] = 70 adding more cars to the env

Demo

To see video demos about the models go to the folder videos.

Implementation

  • reward functions:

    • highway_env/envs/custom_highway_env_v0.py ➡ (MAIN) the best performing one we used this for training and evaluate
    • highway_env/envs/custom_high_env_v1_antonio.py ➡ previous custom environment didnt perform quite right with PPO
    • highway_env/envs/custom_high_env_v1_lukas.py ➡ previous custom environment didnt perform quite right with PPO
  • src/train: ➡ contains all the script to train the models:

    • train_a2c_highway.py ➡ train A2C model*
    • train_dqn_highway.py ➡ train DQN model*
    • train_ppo_highway.py ➡ train PPO model*
    • train_sac_highway.py ➡ train SAC model*
    • train_td3_highway.py ➡ train TD3 model*
    • train_trpo_highway.py ➡ train TRPO model*
  • src/run: ➡ contains all the script to run the models:

    • run_a2c_highway.py ➡ run the trained A2C model*
    • run_dqn_highway.py ➡ run the trained DQN model*
    • run_ppo_highway.py ➡ run the trained PPO model*
    • run_sac_highway.py ➡ run the trained SAC model*
    • run_td3_highway.py ➡ run the trained TD3 model*
    • run_trpo_highway.py ➡ run the trained TRPO model*
  • src/eval/evaluate.py evaluate all the models

*both default environment (highway-fast-v0) and the custom one.

Code Highlights

Evaluation function

We have implemented an evaluation function with a separate environment to save the best model found during training every 2000 epoches. the model saved ends with the name _best

name_env = "highway-fast-v0"
eval_env = gymnasium.make(name_env, render_mode='rgb_array')  # Separate eval environment
eval_env = Monitor(eval_env)  # Add the Monitor wrapper for tracking statistics

....

# Callback for saving the best model
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path=f"../../models/{name_model}_best",
    log_path=f"logs/trpo/{name_model}/",
    eval_freq=2_000,           # Evaluate every 2,000 timesteps
    deterministic=True,         # Use deterministic actions for evaluation
    render=False
)

....

# Train the model with same timesteps as other models
model.learn(total_timesteps=int(100_000), callback=eval_callback)

Action space

In the training scripts for SAC and TD3 we played around with continuous and discrete environments

class DiscreteToBoxWrapper(gymnasium.ActionWrapper):
    def __init__(self, env):
        super().__init__(env)
        n_actions = env.action_space.n
        self.action_space = spaces.Box(low=-1, high=1, shape=(1,), dtype=np.float32)
        self._n_actions = n_actions

    def action(self, action):
        # Convert continuous action to discrete
        # Map [-1, 1] to [0, n_actions-1]
        action = np.clip(action, self.action_space.low, self.action_space.high)
        scaled_action = (action + 1) * (self._n_actions - 1) / 2
        discrete_action = int(np.round(scaled_action.item()))
        discrete_action = np.clip(discrete_action, 0, self._n_actions - 1)
        return discrete_action

## custom environment
#name_model = "highway_sac_custom_2"
#name_env = "custom-highway-env-v0"

## default environment
name_model = "highway_sac_default_fast"
name_env = "highway-fast-v0"

# Create the environment and wrap it
env = gymnasium.make(name_env, render_mode='rgb_array')
env = DiscreteToBoxWrapper(env)

Reward Function Implementation

You can find it here highway_env/envs/custom_highway_env_v0.py

The reward function combines multiple elements:

  • Penalizes collisions and unnecessary lane changes.
  • Rewards high-speed driving, staying in the right lane, and maintaining safe distances.
  • Dynamically adjusts reward scaling through normalization if configured.

Lane Change Penalty

Tracks the agent’s current lane and penalizes switches to encourage stability.

Safe Distance Rewards

Calculates and rewards safe spacing from other vehicles to enhance traffic safety.

Environment Registration

Registers the customized environment as custom-highway-env-v0, making it easily accessible for training and evaluation.

Results from the logs while training fast vs custom environment

highway-fast-v0 vs custom-highway-env-v0

Mean Episode Length:

  • Custom Environments: Models like SAC, TD3, and TRPO performed exceptionally well, achieving mean episode lengths around 40, indicating effective learning and control.
  • Fast Environments: All models showed reduced performance, with lower episode lengths compared to their custom counterparts, suggesting that the fast environment parameters may increase difficulty or reduce the time available for training.

This graph shows the overall length of the episodes during training, we can notice that DQN with the custom environment stays alive more compared to the fast default one.

Mean Reward:

SAC achieved the highest mean reward in both custom and fast environments, with values of approximately 27 and 21 respectively, suggesting it effectively balances exploration and exploitation.

DQN and PPO also performed well, particularly in the custom environments, but their rewards dropped significantly in the fast environments, indicating sensitivity to the environment's dynamics.

Overall rewards during training, we see a big dump around 20K from DQN, the custom environment in this case seems more stable. A2C also tends to perform much better than the default environment in this case.

Exploration Rate:

The exploration rate for DQN was consistent across environments. This indicates that the model maintained its exploration strategy, which is crucial for learning in different settings.

Training Losses:

The value loss for A2C in the custom environment was significantly lower than in the fast environment, indicating better approximation of value functions during training.

PPO and SAC showed consistent losses, with SAC having a notably lower actor loss in the custom environment, which is indicative of more stable training.

Learning Rate:

The learning rates were consistent across runs for each model type.

Entropy and Policy Loss:


Higher entropy losses in A2C for the fast environment indicate that it may struggle with exploration in this configuration, leading to reduced policy diversity.
PPO showed relatively low policy losses, suggesting a stable policy update process.

Model Comparisons

This comparison is related from what we can see from the tensorboard, in reality high-reward and longer episode do not always reflect with good performance in the highway environment when you see it in person.

(sometimes the high reward/episode_time it is just because the model is stalling for time to get the best reward)

A2C:

Performed moderately well, especially in custom environments, but struggled in the fast environment both in terms of episode length and rewards.

DQN:

The custom environment yielded better results compared to the fast one, indicating it may require more tuning for environments with different dynamics.

PPO:

Demonstrated solid performance across both environments, particularly excelling in custom settings. It maintained stability in training, as evidenced by low policy and value losses.

SAC:

The standout performer, SAC showed resilience and effectiveness in both environments, suggesting it is well-suited for the highway tasks given its high rewards and episode lengths.

TD3:

Similar to SAC, TD3 performed well but slightly lagged behind in rewards compared to SAC, indicating it might benefit from further exploration tuning.

TRPO:

Consistent performance in custom environments, but like others, faced challenges in the fast environment. Its ability to maintain line search success is a positive sign for stability.


Results from the logs while training default vs custom environment

highway-v0 vs custom-highway-env-v0

Mean Episode Length:

  • A2C: The default A2C achieved an average episode length of approximately 30.41, which is slightly better than the custom environment's 28.07.
  • DQN: The default DQN maintained a length of around 35.20, showing solid performance, though slightly lower than the custom setup (37.00).
  • PPO: Performance decreased slightly in the default environment (28.85) compared to the custom version (34.16).
  • SAC: SAC's performance was stable, with episode lengths of around 39.23 in the default versus 40 in the custom.
  • TD3: Similar to SAC, TD3 also maintained strong performance with lengths of 40 in the custom and slightly lower at 40 in the default.
  • TRPO: TRPO performed consistently well, with both environments showing around 40 episode lengths.

Mean Reward:

  • A2C: Mean rewards improved in the default environment (22.81) compared to the custom (19.57).
  • DQN: Default DQN yielded a mean reward of 25.64, which is higher than the custom (24.22), indicating effective policy learning.
  • PPO: The mean reward showed a decrease in the default environment (22.47) compared to the custom (24.54).
  • SAC: Similar to episode lengths, SAC's mean reward in the default environment (27.71) was slightly lower than in the custom (26.37).
  • TD3: The mean reward in the default environment (27.92) was slightly better than the custom (25.86).
  • TRPO: TRPO showed consistent performance with a mean reward of 27.83 in the default environment compared to 26.10 in the custom.

Rollout Metrics:

Episode Length:

SAC (45.67) and TD3 (34.99) performed well in episode length, indicating their ability to sustain longer runs in the default environment.

Mean Reward:

SAC again led with a mean reward of 30.75, while A2C lagged behind with 10.70 in the default environment.

Training Dynamics:

Loss Metrics:

A2C's value loss was notably high in the default environment (2,13), suggesting difficulties in approximating value functions.

SAC and TD3 had comparatively lower losses, indicating more stable training dynamics.

PPO's loss was relatively low (0.040), indicating effective updates.

Exploration Rate

The exploration rate for DQN remained consistent at 0.1 across both environments, suggesting stable exploration strategies.

Entropy and Policy Loss:

PPO maintained low entropy losses in both environments, indicating a consistent exploration-exploitation balance.

A2C experienced significant policy loss in the default environment, which may reflect challenges in adapting to the dynamics of the highway environment.

Model Comparisons

  • A2C: While the default environment yielded better episode lengths and rewards compared to the custom one, the high value loss indicates potential overfitting or instability in learning.

  • DQN: Showed solid consistency and even improvement in rewards in the default environment, suggesting it may be more robust to changes in environment settings.

  • PPO: Maintained stability and lower losses, indicating effective learning dynamics but slightly lower performance in the default environment compared to custom.

  • SAC: Consistently high performance across metrics in both environments, indicating it is well-suited for the highway tasks.

  • TD3: Similar to SAC, TD3 performed well and showed resilience, suggesting effective learning strategies.

  • TRPO: Demonstrated stable performance in both environments, with a good balance of exploration and exploitation.

Evaluation

The evaluation can be done through the evaluate.py script which compares a handfull of trained models, and runs them in a default and custom enviroment. The end results can be used to compare the performance of each model on a statistical basis. Evaluation can be found here

Note

Inside the evaluation script, SAC and TD3 are underperforming because they are using a continous/descrete environment, at the time of the creation of the evaluation script this wasn't implemented (sorry we forgot, TBC next release :/)

Example of evaluation graphs generated

You can see all of them in the path src/eval/results

Challenges faced

Mainly that we didn't have access to GPU at the time, so the trials with the custom environment took a really long time.

If you see the custom environment stats you will see that it trained for 16+ hours.

With a Nvidia GTX 980 Ti I could train all 6x the models in parallel in around 8h.

Conclusion

In this research, we evaluated several reinforcement learning models—A2C, DQN, PPO, SAC, TD3, and TRPO—in the context of the highway environment. Our findings indicate that SAC, A2C, DQN, and PPO consistently demonstrated strong performance during evaluations, effectively navigating the complexities of the environment. In contrast, the other models displayed erratic behavior, often resulting in collisions or inefficient driving strategies. These results highlight the robustness of SAC, A2C, DQN, and PPO, making them preferable choices for applications requiring reliable and effective decision-making in dynamic environments. Future work should focus on fine-tuning the performance of the less successful models to enhance their stability and effectiveness.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published