DataLab Cup 5: Deep Reinforcement Learning

Shan-Hung Wu & DataLab
Fall 2017

Task: Flappy Bird

Given a screen shot of flappy bird, your task is to select action that maximize total reward.


We will use flappy bird in pygame learning environment as our training environment, please install it before going through. The environment provide some useful function that can easily get screen for each step.

To make thing easy, we replace background with black color and make all pipes and birds the same, please unzip the file on kaggle and overrite this folder in PyGame-Learning-Environment\ple\games\flappybird\assets.

In [1]:
# import package needed
%matplotlib inline
import matplotlib.pyplot as plt
import os
os.environ["SDL_VIDEODRIVER"] = "dummy" # make window not appear
import tensorflow as tf
import numpy as np
import skimage.color
import skimage.transform
from import FlappyBird
from ple import PLE
game = FlappyBird()
env = PLE(game, fps=30, display_screen=False) # environment interface to game
env.act(0) # dummy input to get screen correct

# get rgb screen
screen = env.getScreenRGB()

# get grayscale screen
screen = env.getScreenGrayscale()
plt.imshow(screen, cmap='gray')
C:\Users\conjugate-forever\Anaconda3\lib\site-packages\h5py\ FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
couldn't import doomish
Couldn't import doom
(288, 512, 3)
(288, 512)

Input/Output format

We will give you a list of previous screen containing current screen, where dimension of frame is 288 x 512 (grayscale). The output of your agent is action to select (0 or 1).


Your agent need to implement the following function:

class YourAgent:   
    def select_action(self, input_screens, sess):
            input_screens: list of frames preprocessed by preprocess function
            sess : tensorflow Session
            action: int (0 or 1)
    def preprocess(self, screen):
        this function preprocess screen that will be used in select_action function

            screen: screen to do some preprocessing
In [2]:
# E.g.
class Agent:
    def select_action(self, input_screens, sess):
        # epsilon-greedy
        if np.random.rand() < self.exploring_rate:
            action = np.random.choice(num_action) # Select a random action
            input_screens = np.array(input_screen[-1]).transpose([1,2,0]) # select current screen
            feed_dict = {
                self.input_screen: input_screen[None,:],
            action =, feed_dict=feed_dict)[0] # Select action with the highest q
        return action
    def preprocess(self, screen):
        screen = skimage.transform.resize(screen, [80, 80])
        return screen


To achieve the task, you can use either DQN or Policy Network or combine both to train your agent. Please refer lab16-2_DQN & Policy Network


Policy Network



We will use the same scenes to evaluate performance of your agent.

Evaluate sample code

The following are sample code for showing how will we evaluate your agent.

In [ ]:
def evaluate_step(agent, seed, sess):
    game = FlappyBird()
    env = PLE(game, fps=30, display_screen=False, rng=np.random.RandomState(seed))
    env.act(0) # dummy input
    # grayscale input screen for this episode  
    input_screens = [agent.preprocess(env.getScreenGrayscale())]
    t = 0
    while not env.game_over():
        # feed four previous screen, select an action
        action = agent.select_action(input_screens, sess)
        # execute the action and get reward
        reward = env.act(env.getActionSet()[action])  # reward = +1 when pass a pipe, -5 when die       
        # observe the result
        screen_plum = env.getScreenGrayscale()  # get next screen
        # append grayscale screen for this episode
        if t >= 1000: # maximum score to prevent run forever
    return t
def evaluate(agent, sess):
    scores = []
    for seed in [...some hidden seed...]:
        score = evaluate_step(agent, seed, sess)
    return scores

Above code with hidden seed is compiled in evaluate.pyc, to use it, run following code.

In [ ]:
from evaluate import evaluate
agent = YourAgent() # init your agent, load checkpoint...
scores = evaluate(agent, sess) # evaluate


submmit scores.csv generated by the following code to DataLabCup: Deep Reinforcement Learning

In [ ]:
import pandas as pd
df = pd.DataFrame({
    'scores': scores


I collect some papers that may useful in this task

Training: Dueling Network Architectures for Deep Reinforcement Learning

The paper propose a model architecture that learn a state-value for each state and give a baseline to action-value. The result speed up training.

Training: Asynchronous Methods for Deep Reinforcement Learning

The paper using asychronous agent in different environment, which collect different experience and therefore stabilize training.

Training: Prioritized Experience Replay

The paper give priority to experiences in replay buffer. The higher loss with higher priority, which give scene unseen before more chance to train.

Training: Deep Reinforcement Learning with Double Q-learning

The paper say that origin Q-learning over-estimate Q-value and propose a mitigated function. which stabilize training.

Training: Playing FPS Games with Deep Reinforcement Learning

The paper add a auxiliary layer to model and claim that it speed up training.