How many different values for dx are possible in theory ? If you were to create a tabular reinforcement learning agent, what size is your table for this problem ?

Words: 554

Pages: 3

Subject: Other

Topics: Assignment help, College essays, deviation, episode, Essayhelp, minirace

Create a RL agent for Minirace (level 1) 15 points
The code in minirace.py provides an environment to create an agent that can be
trained with reinforcement learning .
The following is a description of the environment dynamics:
• The square represents the car, it is 2 pixels wide. The car always appears in the
bottom row, and at each step of the simulation the track scrolls by one row below
the car.
3
• The agent can control the steering of the car, by moving it two pixels to the left
or right. The agent can also choose to do nothing, in which case the car drives
straight. The car cannot be moved outside the boundaries.
• The agent will receive a positive reward at each step where the front part of the
car is still on track.
• An episode is finished when the front of the car hits non-drivable terrain.
In a level 1 version of the game, the observed state consists of one number: dx. It is the relative position of the middle of the track right in front of the car . When the track turns left in front of the car, this value will be negative, and when the track turns right, dx is positive. As the track is six pixels wide, the car can drive either on the left, middle, or right of a piece of track .
For this task, you should initialise the simulation like this:
therace = Minirace(level=1)
When you run the simulation, step() returns dx for the state.
Steps
1. Manually create a policy (no RL) that successfully plays drives the car, just se-
lecting actions based on the state information. The minirace.py code contains
a function my policy() that you should modify for this task.
2. How many different values for dx are possible in theory ? If you were to create a tabular reinforcement learning agent, what size is your table for this problem ?
3. Create a TD agent that learns to drive. If you decide to use - greedy action selection, set = 1, initially, and reduce it during your training to a minimum of 0.01. Keep your training going until you are either happy with the result or the performance does not improve1.
4. When you run your training, reset the environment after every episode. Store the
sum of rewards. After or during the training, plot the total sum of rewards per
episode. This plot — the Training Reward plot — indicates the extent to which
your agent is learning to improve his cumulative reward. It is your decision when
1This means: do not stop just because reached 0.01 – you may want to stop earlier, or you may want to keep going, just do not reduce any further.
4 to stop training. It is not required to submit a perfectly performing agent, but
show how it learns.
5. After you decide the training to be completed, run 50 test episodes using your
trained policy, but with = 0.0 for all 50 episodes. Again, reset the environment
at the beginning of each episode. Calculate the average over sum-of-rewards-per-
episode , and the standard deviation. These values indicate how your trained agent performs.
What to submit:
• Submit the python code of your solutions .
• For your report, describe the solution, mention the Test-Average and Test-Standard-
Deviation, and include the Training Reward plot described above. After how
many episodes did you decide to stop training, and how long did it take?