# Python代写｜COMP3106—Introduction to Artificial Intelligence Assignment 3

## Submission through Brightspace:

You need to submit two files:
1. One “A3.py” file that contains the functions you implemented.
2. One “A3-answer.pdf” file that contains your answers/solutions to the questions in this assignment. This PDF file should be no more than 8 pages. Please use a standard page format (i.e. page size 8.5” × 11”, minimum 1/2”margins, minimum font size 1′,no condensed fonts or spacing).

## Note:

• Programming language: Python 3. You may use standard Python libraries and the NumPy package. You must implement the code yourself. Do not copy-and-paste code from other sources. Your implementation must follow the required specifications. Implementations which do not follow the specifications may receive a grade of zero. Please make sure your code is readable. Your implementation will be evaluated on a set of test cases for correctness.
• Please prepare your answers in the PDF file using computer tools.

## Problem

In this assignment, we consider MDPs and reinforcement learning in the world of a 4 × 4 grid. One example of this grid world is shown in Figure 1:
The agent’s state is described by its position in this grid world; a state s is described as [si, sj] where si is the row index and sj is the column index for the position Figure 1: An example of 4 × 4 grid world

of states. This grid world has two terminal states: a victory terminal state s+ and a failure terminal state s−. In the grid world example in Figure 1, the victory terminal state (in green color) is s+ = [0, 3] and the failure terminal state (in red color) is s− = [1, 3].
The immediate reward function for the agent in this environment is For the grid world example in Figure 1, we have R+ = 10 and R− = −10.
At each non-terminal state s, there are four possible deterministic actions, A={‘L’,
‘R’, ‘U’, ‘D’}, such that:
‘L’: move left—if sj > 0, the move left action will reduce sj by 1;
if sj = 0, the move left action will make no change in position.
‘R’: move right—if sj < 3, the move right action will increase sj by 1;
if sj = 3, the move left action will make no change in position.
‘U’: move up—if si > 0, the move up action will reduce si by 1;
if si = 0, the move up action will make no change in position.
‘D’: move down—if si < 3, the move down action will increase si by 1;
if si = 3, the move down action will make no change in position.
Let S(s, a) denotes the resulting state by applying an action a on state s. For example,
we have
S([1, 2], ‘L’) = [1, 1], S([1, 0], ‘L’) = [1, 0], S([2, 2], ‘R’) = [2, 3], S([2, 3], ‘R’) = [2, 3],
S([1, 2], ‘U’) = [0, 2], S([0, 2], ‘U’) = [0, 2], S([2, 2], ‘D’) = [3, 2], and S([3, 2], ‘D’) = [3, 2].

Let N (s) denote the set of neighbor states of s. Note any internal state s (i.e., 0 < si < 3
& 0 < sj < 3) has four neighbor states, such that
N (s) = {[si, sj − 1], [si, sj + 1], [si − 1, sj], [si + 1, sj]};
any corner state s (i.e., [0,0], [0,3], [3,0], or [3,3]) has only two neighbors, such that
N ([0, 0]) = {[0, 1], [1, 0]},
N ([0, 3]) = {[0, 2], [1, 3]},
N ([3, 0]) = {[2, 0], [3, 1]},
and N ([3, 3]) = {[2, 3], [3, 2]};
any non-corner border state s (i.e., [0, x], [3, x], [y, 0], [y, 3] for 0 < x, y < 3) has three neighbors, such that
N ([0, x]) = {[1, x], [0, x − 1], [0, x + 1]},
N ([3, x]) = {[2, x], [3, x − 1], [3, x + 1]},
N ([y, 0]) = {[y, 1], [y − 1, 0], [y + 1, 0]},
and N ([y, 3]) = {[y, 2], [y − 1, 3], [y + 1, 3]}.
Next we define the transition model P (or T) for the grid world. We consider stochastic transitions; i.e., when taking an action a in a state s, the transition model not only will have a large probability (1 − pn) to enter into the resulting state S(s, a), but also will have a small probability pn to enter into the resulting states for two other actions that can be triggered by the given action a. Let G(a) denote the actions that can be triggered by an action a; we specify G as follows:
G(‘L’)= {‘U’, ‘D’}, G(‘R’)= {‘U’, ‘D’}, G(‘U’)= {‘L’, ‘R’}, G(‘D’)= {‘L’, ‘R’}.
Let M(s, G(a)) denote the set of resulting states from s when applying the two actions triggered by action a. Then M(s, G(a)) is the union of S(s, a0) for all a0 ∈ G(a). Since each action a triggers two other actions, each M(s, G(a)) should have two states. Moreover,when s is a corner state, S(s, a) and M(s, G(a)) can have a common state s, when both action a and a triggered action make no change in position. Overall, we define the stochastic transition model as follows: where 0 ≤ pn ≤ 0.3 is the transition noise probability.

With the definitions above, we can provide values for s+, s−, R+, R− and pn to define a 4 × 4 grid world with its specific terminal states, reward function,and transition model.
In this assignment, you will need to implement MDP and reinforcement learning functions to find the corresponding utility values or policies for each specific 4 × 4 grid world.

Representation
We consider deterministic policy π. We represent a policy π as a nested list in Python.
For example, for the grid world environment above with terminal states s+ = [0, 3] and s− = [1, 3], one policy π can be written as:
π =[ [‘R’, ‘R’, ‘R’, ‘X’],
[‘R’, ‘U’, ‘U’, ‘X’],
[‘R’, ‘U’, ‘U’, ‘L’],
[‘U’, ‘U’, ‘U’, ‘U’] ]
where π[si][sj] provides the action for state s under policy π. Note since the policy mapping is for non-terminal states, we put a constant ‘X’ for the terminal states.

For the Q(s, a) function, we represent it as a three-dimensional matrix (i.e., a nested list in Python as well). We consider the four actions in the order of ‘L’, ‘R’, ‘U’, ‘D’. Let Qvalues denote this nested list of Q function values for the 4 × 4 grid world, then
Qvalues[si][sj]: the Q(s, a) value for state s and action a=‘L’.
Qvalues[si][sj]: the Q(s, a) value for state s and action a=‘R’.
Qvalues[si][sj]: the Q(s, a) value for state s and action a=‘U’.
Qvalues[si][sj]: the Q(s, a) value for state s and action a=‘D’.
For the utility function U(s), we represent it as a nested list in Python (a two-dimensional matrix). Let Uvalues denote the nested list of the utility values for for the 4 × 4 grid world,then Uvalues[si][sj] denotes the U(s) value for a non-terminal state s = [si, sj]. (We do not require U values for the terminal states and you can keep them as zeros.) E-mail: vipdue@outlook.com  微信号:vipnxx 