PCC. AN APPLICATION TO STUDY RL APPLIED TO CAR RACING.

INTRODUCTION


This is a simulation software of a car competition on a plain track, created to test some reinforcement learning algorithms using a stochastic, discrete and partially observable environment. It's a multiagent system. We have two agents, each trying to control a different car. There are some other cars on the road but they're only obstacles, since "they don't think" and just keep running at an only speed changing their rail randomly.

The objective of the simulation is to make it possible that agents find by themselves the necessary knowledge to drive properly, based on their bumps and time wasting through the race. They must learn how to avoid obstacles, learn how to avoid each other and learn how to do a lap in a reasonable time.

Statistic results can be observed in 2D graphics generated by the application using gnuplot.

It runs with X11 and it was written in C++, using KDevelop and Qt libraries.

PCC is far simpler than RARS (an excellent robot-racing simulator that inspired the designer) in many senses. But I though that with PCC I wouldn't have to wait for a long time to see a car learning as it might happen with RARS. With PCC anyone can avoid wasting time looking at such a slow competition since one can hide cars (everything) and see them after many interactions, spending only a few seconds. Also it's possible to save policies in hard disk. Anyway I would like to join RARS competitions when I consider I've worked enough with PCC.

The first version includes the SARSA algorithm using a look up table. In file params.cpp we can change parameters for the simulation and in file agente.cpp it's possible to make changes to the learning method. That's what I've been doing since I finished building PCC.

I'm also working on the design of a second and less shy version of PCC, which might be used to test a real competitive co-evolution with several RL schemes. I'm thinking of:

  • Including more RL agents, each using independent learning algorithms
  • Change the environment a little: Permit overlapping, increase vision area, increase number of speeds and rails and design a "less discrete environment".
  • Using the standard scheme for programming RL applications, proposed by Rich Sutton and Juan Carlos Santamaria.
  • Distributing the graphic interface separated from the simulation code, so that the simulation can be easily used (included or called) by any other application ( such as a genetic algorithm).

LEARNING SYSTEM DESCRIPTION

Environment

The environment is a straight plain track with 6 rails in good conditions and with a predetermined number of cars (obstacles) running on it. Each of the two agents, while trained, drives an additional vehicle on this track.

Car movement in this (discrete) space is performed in vertical way (advancing line by line) and horizontal way (moving the car from a rail to other). An agent can increase or decrease its speed (which is given in lines per time-step) to 0, 1 or 2.

Obstacles run with a constant speed (1 line per time-step), each trying to maintain the same separation distance between the previous and the next obstacle.

When an agent bumps another car, its speed is set to 0 by the environment, then the agent has to accelerate as soon as possible to keep running.

Car overlapping is not possible, this is, the space height-rail occupied by a car cannot be (partially or totally) touched by another. All cars (including the ones driven by agents) are two rails width, which allows only 5 positions in the x-axis (rail). The word height will be used as an integer measure of the distance from the beginning point of the track, and we'll assume any car's longitude as 1 height unit, this means that two cars can get in touch only when both have the same height in the track.

The height value following the final point of the track is the beginning of the track (a closed track) Each agent runs many laps in this track in order to learn.

Actions and perceptions

Actions will be instructions for the car to change the rail (turn left / right), to stay in its current rail, accelerate or disaccelerate. 5 actions in total.

Perceptions are limited to a frontal vision that covers only two lines ahead, the agent's line and the previous line. 4 lines in total. In this area may appear: The nearest obstacle, the car driven by the other agent, both of them or none.

This picture shows an example of what could be seen in the agent's vision area.

Learning

A state for the agent is built from its position, its speed and the position and speed of the cars in its vision area. All of them are current data. This configuration is translated to an integer value, which is sent to the agent each time-step. Usually, the agent is punished with a negative reward at each time-step, so that it tries to finish its work quickly. It can be punished hardly when it bumps, so that it tries to drive in the adequate way.

With these signals sent by the environment, agents will apply SARSA(lambda) (or other RL method) in order to converge to a satisfactory behavior. Agents will run many laps to be trained and, after this, one could study the result of the process by plotting some variables such as time-steps per lap and bumps per lap.

Usually, a decision is taken using an e-greedy method to allow exploration while exploiting the current policy.

RUNNING PCC

Once started, pcc shows the track clean and ready to start the race. Select start race in the race menu. You could stay looking at the initially naive cars for hours until they learn something interesting, or you could rather select show/hide cars in the race menu to make things much quicker. With this option, cars can be hidden o showed anytime that you want during the race.

There are always several values in the central part of the window, indicating what is happening in the race. The ones labeled *Optimal laps* refer to the number of laps where there were no bumps and the number of time-steps was the minimun for a specific agent.

In the plot menu, all options call Gnuplot to show the current result of the race for each agent. If nothing happens upon selecting these options perhaps Gnuplot is not properly installed on your system.

In the first option (laps) the time is given in obstacle-laps instead of time-steps in order to make numbers shorter; but since the constant speed of obstacles is 1 line per time-step you can multiply obstacle-laps by LARGOPISTA (TRACK SIZE) to obtain the real time elapsed in time-steps.

It's possible to save or load policies from hard disk for any of the two agents at any moment, using the buttons located in the central part of the window.



PCC Version 1.0
Copyright (C) 2001 Eduardo Daza Castillo
www.geocities.com/eduardo_daza