Question 1: Rook Jumping Maze
Starting at the circled square in the upper-left corner, find a path to the goal square marked “G”. From each numbered square, one may move that exact number of squares horizontally or vertically in a straight line. The actions are Up, Down, Left, and Right. One cannot jump off the grid. Moving does not wrap from one side of the grid to another.
For example, the starting state in the upper left corner has two successor states: three down and three to the right. The solution to this maze requires 13 jumps: D, R, L, U, D, L, R, U, L, L, R, D, U
1.a. (1 pt) What is the average branching factor for the maze above? _________________
1.b. (1 pt) Suppose one wanted to solve the problem of moving from the Goals (G) to the start (the circled cell). Instead of a successor function, one needs a predecessor function. Give a sentence or two that describes how to compute all the predecessors for any given state s:
1.c. (1 pt) What is the branching factor with this predecessor function for the example maze above? ________________
1.d. (1 pt) Is it better to run the search forward (from start state to goal state) or to run the search backward (from goal state to start state)? Why or why not?
Question 2. Consider the following grid world. The agent can move up, down, left, and right. There is a cost of 1 for every action. The agent starts in the designated cell and must reach the cell marked “goal”. It is a terminal state and also has a reward value of 100.
2.a. (1 pt) If the transition function is deterministic, one would use A* with a heuristic
h(state) = manhattan_distance(state, goal). What is the length of the optimal path from the start to the goal?
2.b. (1 pt) How many states must be visited? Hint: the maximum f-value of any state that can be part of the solution path is 18.
2.c. (2 pts) Suppose we didn’t have the transition function and used Q-learning instead. We refer to the fact that there is only a single state, the “goal”, that provides a reward as a “sparse” reward function. With an empty Q-table and with most states returning zero reward, it can take quite a while for the agent to find the goal randomly and to start propagating q-values to neighboring states. If we had a “dense” reward function, then every state would provide a reward and the agent would learn much faster. A perfect dense reward function would be one in which each state gave out reward proportional to its true utility (i.e. imagine we had a transition function and used value iteration until convergence then copied the utility values into the reward function).
Given that we cannot create a perfect dense reward function, perhaps we can compute in O(n) time (where n is the number of states) a dense reward function that approximates the true utility values for each state. Describe in a few sentences how to compute such a reward function.
Hint: think about how you design admissible heuristics for A* by relaxing the problem and using the actual cost of the relaxed problem as the heuristic cost of a state in the full problem. This is how one gets the Manhattan distance heuristic or the straight-line distance heuristic, for example.
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx