Evaluating Moral Agency in Reinforcement Learning Agents

University of Toronto September-December 2023

In this project, I explored how reinforcement learning (RL) agents simulate goal-directed behavior and whether they can be understood as morally intelligent systems. I implemented two core RL algorithms—Sarsa (on-policy) and Q-learning (off-policy)—in a custom gridworld using MATLAB. I then trained and analyzed both model-based and model-free agents in the classic cart-pole balancing task, comparing performance under different learning rates and policy structures.

Through detailed experimentation and visual analysis of reward trends and simulation behavior, I found that model-based agents trained faster (250 episodes) but often failed to learn robust policies, while model-free agents trained more slowly (933 episodes) but achieved higher long-term reward. Adjusting the learning rate significantly reduced training time without compromising performance.

This plot visualizes the training performance of a model-based reinforcement learning agent in the cart-pole environment. The agent successfully completed training in only 250 episodes, reflecting the efficiency of planning-based methods. The average reward was relatively low, indicating the agent learned a policy, but with limited optimization of performance. This highlights the tradeoff in model-based systems between planning speed and reward maximization.

Figure 1: Model-Based Agent Training Graph

In this simulation visual, the trained model-based agent remains in a stable but static position, failing to actively balance the pole. This behavior suggests that although the agent completed training quickly, it did not learn an effective control policy. The visual reinforces the idea that fast convergence does not guarantee high-performance behavior, especially when reward optimization is shallow.

Figure 2: Model-Based Agent Cart-Pole Visualizer

The model-free agent required 933 episodes to complete training—significantly longer than the model-based counterpart—but achieved a higher average reward. This plot illustrates the classic tradeoff in reinforcement learning: model-free agents learn more slowly but can ultimately achieve stronger performance due to experiential learning rather than abstract planning.

Figure 3: Model-Free Agent Training Graph

The visual for the model-free agent shows a similar lack of movement to the model-based agent, suggesting that despite better reward signals, the agent’s learned policy may not effectively generalize in simulation. This emphasizes a recurring challenge in reinforcement learning: bridging the gap between reward function optimization and real-world behavioral competence.

Figure 4: Model-Free Agent Cart-Pole Visualizer

This graph shows that lowering the learning rate to 0.1 drastically reduced the total number of steps needed for convergence (from ~30,000 to ~3,700), demonstrating the importance of hyperparameter tuning in accelerating training. Despite improved training efficiency, behavioral performance remained modest, underlining that tuning alone cannot overcome model limitations.

Figure 5: Model-Based Agent with Adjusted Learning Rate

Following the adjusted learning rate, the cart-pole visualizer shows slight improvement: the pendulum rests closer to vertical, and the agent maintains a more centered position. However, it still falls short of full task mastery. This suggests that while the adjusted learning rate improved training efficiency, it did not meaningfully enhance final policy quality.

Figure 6: Adjusted Model-Based Agent Cart-Pole Visualizer

With a learning rate of 0.5, the model-free agent completed training ten times faster (4 minutes vs. 54 minutes) while maintaining its original performance level. This result demonstrates that model-free methods can be significantly optimized through hyperparameter adjustment, making them more scalable for complex, real-time environments.

Figure 7: Adjusted Model-Free Agent Training Graph

Building on this, I applied philosophical frameworks from Railton, Dreyfus, and Haas to evaluate whether these agents exhibit moral reasoning or act “for a reason.” I critically examined the assumptions behind rule-governed, context-free computation, concluding that while RL agents can mimic decision patterns, they lack the representational depth and self-aware valuation necessary for genuine moral cognition. I contrasted rationalismsentimentalism, and valuationism as models of moral agency, arguing (with Haas) that moral valuationism best aligns with how RL agents weigh rewards and make decisions.

This interdisciplinary project bridged machine learning, cognitive science, and moral philosophy, offering insight into how technical systems learn—and whether that learning can be considered truly cognitive or ethical.