-
Research Focus:
- Investigate the neural network complexity required for optimal Flappy Bird performance.
- Define "perfect performance" as surpassing a handcrafted evaluation agent scoring 900/1000 over 10 runs.
-
Contributions:
- Development of a high-performing handcrafted agent.
- Analysis of network complexity versus performance.
-
Methodology Rationale:
- Opt for non-pixel-based learning to explore diverse model types beyond CNNs.
There are lots of interesting ways to extend this project, but I have to start focusing on communicating my findings instead of more discovery.
- Learns from pixels to play flappy bird
- Uses DQN from Mnih et al.
- Implementation of the approach. Inspired by the work of Chen.
- Implements the dueling-dqn approach on the simplified observation space.
- Implements the PPO approach on the simplified observation space.
- Implements the Dueling DQN approach on two state representations: (1) the simplified observation space (2) LIDAR measurements of the environment.
- Simplified observation space and used a QTable to implement a perfect agent.
- Further simplified the simplified observation space and used a QTable to implement a perfect agent. Inspired by the work of SarvagaVaish.
- Inspired by yenchenlin. Provides an even more mature implementation.
A winning state for this project is finding a perfect scoring bot for significantly less memory than a q-table for every state.
- Mobile game from 2013 that went viral
- Game mechanics
- Simple game mechanics make it an ideal candidate for reinforcement learning tasks
- RL environment exists at flappy-bird-gymnasium
- Simplified observation space 12 features, [0,1]
Reinforcement learning techniques are roughly divided into two categories:
- Value-based methods
- Implicitly assume optimal action is not a mixed strategy.
- More sample efficient, because the value function is computed per action.
- Assume that there is a mixed strategy for each state that is optimal.
- More stable, because they naturally explore more in each state instead of using hard-coded exploration strategies.
- Best achievable average score with a score limit of 1000.
- Sample inefficiency does not count
- How close does it get to handcrafted agent.
- I made a handcrafted evaluation function.
- It gets approximately 900 on a training run of 10 runs with a score limit of 1000.
- I can measure the success of the reinforcement learning agents utilizing how
- How well do the other benchmarks perform?
- How well does my agent perform?
- Parameters that impacted the performance of my agent
- Rough idea from Mnih et al. 2013, but incorporates improvements
- Uses network architecture of [12, 64, 64, 2]
- Double DQN
- Dueling DQN
Name | Mean Score (1000 runs) | Std Score (1000 runs) |
---|---|---|
Handcrafted Agent | <mean> |
<std> |
dqn_flappybird_v1_1300000_steps | 20.82 | 15.83 |
Note: Change the v1 to be more descriptive. Only supposed to be identified by me right now.
- Original training run was 30M learning steps.
- Catastrophic forgetting around 1.2M learning steps, weird jumps in score after that.
- Random chance tweaked parameters to get 900 average score.
<Include tensorboard chart here>
Cartesian product of all possible combinations of the following:
- Double DQN
- Dueling DQN
- {Prioritized Experience Replay, Hindsight Experience Replay}
- Learn directly from pixels
- Experiment with transformer models
- Experiment with Partially Observable Environments
- Try Hindsight Experience Replay
- Try Policy-based methods
- Summary of findings
- Comparative analysis
- future work
- research implications
One of my biggest mistakes was that I started by trying to reimplment my own QTable approach. This wasted lots of time and effort by trying to resolve problems that had already been solved in prior work. I thought that looking at their code would be "cheating." I did not see that I was not adding anything new to the field by doing this. It would have been better to review all prior work and try to replicate it before I built my own QTable approach.