2. Reinforcement Learning. (? points)
In this question, you will write Python programs to implement a simulated environment
and build a reinforcement learning agent that discovers the optimal (shortest) path to
a goal. The agent’s environment is a rectangular grid made up of cells and is shown in
Figure 2. Here, the single blue cell is the initial state of the agent, the single yellow cell
is the goal state, and the more numerous turquoise (blue/green) cells are barriers. The
agent must get from the initial state to the goal state without crossing any barriers.
At each time step, the agent may move one cell to the left, right, up, or down (but not
diagonally). The environment does not wraparound.
Any of the four actions is possible at any state, but if the agent tries to move into a
barrier or off the grid, then its location does not change. Thus, if the agent is in the
lower left corner and tries to move left, it will not move and will remain in the same
cell. Likewise, if the agent is in the cell below the initial (blue) state and tries to move
down (into a barrier), it will remain in the same cell.
You should implement the Q-Learning algorithm given on slide 39 of Lecture 8, but
with a modified exploration policy. In particular, instead of choosing the next action,
At using the -greedy policy, you should use the softmax policy given on slide 41. The
program starts with the agent at the initial state, and runs until it reaches the goal
state. The reward at the goal is 25, but at every other state is 0. However, your
program should be general enough to work for different grid sizes, different barriers
and different rewards.
The actions, the environment and the Q function may be stored in global variables.
You should represent the Q function as an array. You may represent the environment,
actions and locations in any you like as long as it is reasonably efficient, easy to code
and easy to understand.
Hand in your source code, well commented. Also hand in a brief write-up describing
your implementation, including what each function returns, how actions, locations and
the environment are represented, any additional functions you may have implemented,
and any other major implementation decisions. Be brief and clear.
In this question, you will not be dealing with large vectors or matrices, so vectorized
code is less important (but should still be used in place of loops when possible, since
loops in Python are very slow).
(a) Create and display the environment shown in Figure 2. You may store the en
vironment any way you like, but you should display it as in Figure 2, which
was generated using the function imshow in matplotlib.pyplot. The display
should show the barriers and the goal state, but need not show the initial state.
The colours need not be the same as in Figure 2, but the figure should be self
explanatory, and the barriers and the goal state should be obvious. Title the
figure, Question 2(a): Grid World.
(b) Define a Python function Trans(L,a) that implements the transition function.
Specifically, given the current location L of the agent, and an action a, return the
new location of the agent and the immediate reward. Do not use any loops.
(c) Define a Python function choose(L,beta) that implements the softmax explo
ration policy on slide 41 of Lecture 8. Here, L is the location of the agent, and
beta = β is the softmax parameter on slide 41. The choose function should select
an action probabilistically, and return it. You may find the functions choice or
multinomial in numpy.random useful for this purpose. Do not use any loops.
(d) Define a Python function updateQ(L,a,alpha,gamma) that updates the Q func
tion as specified in the last line on slide 39 of Lecture 8. Here, L = St is the
location of the agent, a = At is an action, alpha = α is the learning rate, and
gamma = γ is the discount factor. The function may return anything you like. Do
not use any loops.
(e) Define a Python function episode(L,alpha,gamma,beta) that implements the
loop in slide 39. The loop should start with the agent at location L and terminate
when the agent reaches the goal state. This is called one episode of Q learning.
Here, alpha = α, gamma = γ and beta = β, as above. The episode function
should not initialize the Q function. It may return anything you like.
(f) Define a Python function learn(N,L,alpha,gamma,beta) that initializes the Q
function and performs N episodes of Q learning, where each episode starts with
the agent at location L. The Q function should be initialized to be 0 everywhere.
The learn function may return anything you like.
(g) To test your functions above, use learn to perform 50 episodes of Q learning
starting from the initial state in Figure 2 with α = 1, γ = 0:9 and β = 1. The
first line in your code for this question should set the random seed to 7.
Generate a graph of episode length vs episode number, where episode length is
the number of actions taken in an episode (i:e:, to go from the initial state to the
goal state). Label the axes Episode Length and Episode, respectively. Title the
graph, Question 2(g): one run of Q learning. Use the function grid in pyplot to
place a grid over the graph. If everything is working properly, the graph should be
similar to (but different from) the graph in Figure 3, which was generated using
a different random seed.
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx