Q-lärande agent i Python

Jag skapar en q-lärande agent för att lösa ett vagnstolpe-problem från OpenAI Gym i denna handledning. Q-lärande är en del av aktivt förstärkningslärande, agenten behöver inte en karta över hela miljön och den lär sig en åtgärd-nytta-representation från temporära skillnader (TD).

Q-inlärning är en off-policy-algoritm eftersom den använder det bästa Q-värdet (åtgärdens kvalitet) och behöver inte ägna uppmärksamhet åt en policy. Q-lärande är mer flexibelt än SARSA eftersom en q-inlärnings-agent kan lära sig hur man beter sig oavsett vilken politik den styrs av.

Q-inlärning kan användas i delvis observerbara miljöer, algoritmen kan hitta en optimal policy för varje ändlig markov-beslutsprocess (FMDP) om agenten har oändlig utforskningstid. Algoritmen måste utforska miljön för att kunna maximera den totala belöningen.

Markov-beslutsprocess

En minskande utforskningsgrad (epsilon) gör att agenten är mer benägen att utforska i början och mer benägen att följa policyn mot slutet. Agenten uppdaterar sin policy/modell genom att tillämpa en inlärningstakt (alfa) och en diskonteringsfaktor (gamma). Inlärningstakten avgör hur troligt det är att ny kunskap ersätter gammal kunskap, en inlärningstakt på 0 betyder inget lärande och en inlärningstakt på 1 indikerar att ny kunskap är viktigast. Diskonteringsfaktorn avgör hur viktiga framtida belöningar är, en diskonteringsfaktor på 0 betyder att nya belöningar är viktigast medan en diskonteringsfaktor på 1 innebär att långsiktiga belöningar är viktigast.

Problem och bibliotek

En stolpe är fäst i en vagn som rör sig längs ett friktionslöst spår. Pendeln startar upprätt och målet är att förhindra att den faller ner, agenten kan tillämpa en kraft från vänster eller höger och en belöning om 1 ges vid varje tidpunkt som stolpen förblir upprätt. En episod avslutas när stolpen är mer än 15 grader från vertikalen, när vagnen har flyttats mer än 2,4 enheter från mitten eller efter 200 tidsenheter. CartPole-problemet är en del av gym-biblioteket, jag använder också följande bibliotek: os, math, random, pickle and numpy.

Kod

CartPole-v0 (version 0) anses vara löst om den genomsnittliga belöningen är 195 eller mer över 100 försök i följd. Agenten kan tränas och utvärderas, modellen sparas till hårddisken efter varje träningspass. Den fullständiga koden för q-inlärningsagenten visas nedan.

# Import libraries
import os
import math
import random
import pickle
import gym
import numpy as np

# Discretize observation (countinuous to discrete)
def discretize(env, buckets, state):
    upper_bounds = [env.observation_space.high[0], 0.5, env.observation_space.high[2], math.radians(50)]
    lower_bounds = [env.observation_space.low[0], -0.5, env.observation_space.low[2], -math.radians(50)]
    ratios = [(state[i] + abs(lower_bounds[i])) / (upper_bounds[i] - lower_bounds[i]) for i in range(len(state))]
    next_state = [int(round((buckets[i] - 1) * ratios[i])) for i in range(len(state))]
    next_state = [min(buckets[i] - 1, max(0, next_state[i])) for i in range(len(state))]
    return tuple(next_state)

# Get an action (0:Left, 1:Right)
def get_action(model, state, epsilon):
    return random.randint(0, 1) if (random.random() <= epsilon) else np.argmax(model[state])

# Update model
def update(model, current_state, next_state, action, reward, alpha=0.85, gamma=0.95):
    model[current_state][action] += alpha * (reward + gamma * np.max(model[next_state]) - model[current_state][action])

# Exploration rate
def get_epsilon(t, min_epsilon, divisor=25):
    return max(min_epsilon, min(1, 1.0 - math.log10((t + 1) / divisor)))

# Learning rate
def get_alpha(t, min_alpha, divisor=25):
    return max(min_alpha, min(1.0, 1.0 - math.log10((t + 1) / divisor)))

# Get a model
def get_model(env) -> ():

     # Load a model if we have saved one
    if(os.path.isfile('models\\cartpole.q') == True):
        with open ('models\\cartpole.q', 'rb') as fp:
            return pickle.load(fp)

    # Set buckets
    buckets = (1, 1, 6, 12)

    # Return an empty model (Q-table) and buckets
    return (np.zeros(buckets + (env.action_space.n, )), buckets)

# Train a model
def train():

    # Variables
    episodes = 300
    timesteps = 200
    total_score = 0

    # Create an environment
    env = gym.make('CartPole-v0')

    # Get a model (Q table) and buckets
    model, buckets = get_model(env)

    # Loop episodes
    for episode in range(episodes):

        # Start episode and get initial observation (discretized in a tuple)
        current_state = discretize(env, buckets, env.reset())

        # Get learning rate and exploration rate
        alpha = get_alpha(episode, 0.1)
        epsilon = get_epsilon(episode, 0.1)

        # Reset score
        score = 0

        # Loop timesteps
        for t in range(timesteps):

            # Get an action (0:Left, 1:Right)
            action = get_action(model, current_state, epsilon)

            # Perform a step
            # next_state (position, velocity, angle and angular velocity)
            next_state, reward, done, info = env.step(action)

            # Discretize the state to buckets
            next_state = discretize(env, buckets, next_state)

            # Update the model
            update(model, current_state, next_state, action, reward, alpha, 1.0)
  
            # Update the state
            current_state = next_state

            # Update score
            score += reward
            total_score += reward

            # Check if we are done (game over)
            if done:
                print('Episode {0}, Score: {1}, Timesteps: {2}, Epsilon: {3}'.format(episode+1, score, t+1, epsilon))
                break
       
    # Close the environment
    env.close()

    # Save the model
    with open('models\\cartpole.q', 'wb') as fp:
        pickle.dump((model, buckets), fp)

    # Print final score
    print()
    print('--- Evaluation ---')
    print('Average score: {0}'.format(total_score / episodes))
    print('Episodes: {0}'.format(episodes))
    print()

    # Print model
    print('--- Model (Q-table) ---')
    print(model)
    print()

# Evaluate a model
def evaluate():

    # Variables
    episodes = 100
    timesteps = 200
    total_score = 0

    # Create an environment
    env = gym.make('CartPole-v0')

    # Get a model (Q table) and buckets
    model, buckets = get_model(env)

    # Loop episodes
    for episode in range(episodes):

        # Start episode and get initial observation (discretized in a tuple)
        state = discretize(env, buckets, env.reset())

        # Reset score
        score = 0

        # Loop timesteps
        for t in range(timesteps):

            # Render the environment
            env.render(mode='human')

            # Get an action (0:Left, 1:Right)
            action = np.argmax(model[state])

            # Perform a step
            # next_state (position, velocity, angle and angular velocity)
            state, reward, done, info = env.step(action)

            # Discretize the state to buckets
            state = discretize(env, buckets, state)

            # Update score
            score += reward
            total_score += reward

            # Check if we are done (game over)
            if done:
                print('Episode {0}, Score: {1}, Timesteps: {2}'.format(episode+1, score, t+1))
                break
       
    # Close the environment
    env.close()

    # Print final score
    print()
    print('--- Evaluation ---')
    print('Average score: {0}'.format(total_score / episodes))
    print('Episodes: {0}'.format(episodes))
    print()

# The main entry point for this module
def main():

    # Train the model
    train()

    # Evaluate the model
    #evaluate()

# Tell python to run main method
if __name__ == "__main__": main()

Träning

Episode 86, Score: 28.0, Timesteps: 28, Epsilon: 0.46344155742846993
Episode 87, Score: 65.0, Timesteps: 65, Epsilon: 0.45842075605341903
Episode 88, Score: 13.0, Timesteps: 13, Epsilon: 0.4534573365218689
Episode 89, Score: 118.0, Timesteps: 118, Epsilon: 0.4485500020271248
Episode 90, Score: 200.0, Timesteps: 200, Epsilon: 0.44369749923271273
Episode 91, Score: 91.0, Timesteps: 91, Epsilon: 0.43889861635094396
Episode 92, Score: 35.0, Timesteps: 35, Epsilon: 0.43415218132648237
Episode 93, Score: 38.0, Timesteps: 38, Epsilon: 0.4294570601181025
Episode 94, Score: 29.0, Timesteps: 29, Epsilon: 0.42481215507233894
Episode 95, Score: 152.0, Timesteps: 152, Epsilon: 0.4202164033831899
Episode 96, Score: 200.0, Timesteps: 200, Epsilon: 0.4156687756324692
Episode 97, Score: 49.0, Timesteps: 49, Epsilon: 0.41116827440579273
Episode 98, Score: 129.0, Timesteps: 129, Epsilon: 0.4067139329795427
Episode 99, Score: 57.0, Timesteps: 57, Epsilon: 0.4023048140744877
Episode 100, Score: 140.0, Timesteps: 140, Epsilon: 0.3979400086720376
Episode 101, Score: 90.0, Timesteps: 90, Epsilon: 0.3936186348893951
Episode 102, Score: 54.0, Timesteps: 54, Epsilon: 0.3893398369101201
Episode 103, Score: 103.0, Timesteps: 103, Epsilon: 0.38510278396686537
Episode 104, Score: 51.0, Timesteps: 51, Epsilon: 0.38090666937325723
Episode 105, Score: 140.0, Timesteps: 140, Epsilon: 0.37675070960209955
Episode 106, Score: 35.0, Timesteps: 35, Epsilon: 0.3726341434072673
Episode 107, Score: 96.0, Timesteps: 96, Epsilon: 0.368556230986828
Episode 108, Score: 200.0, Timesteps: 200, Epsilon: 0.36451625318508785
Episode 109, Score: 80.0, Timesteps: 80, Epsilon: 0.3605135107314139
Episode 110, Score: 70.0, Timesteps: 70, Epsilon: 0.3565473235138126
Episode 111, Score: 39.0, Timesteps: 39, Epsilon: 0.35261702988538013
Episode 112, Score: 36.0, Timesteps: 36, Epsilon: 0.348721986001856
Episode 113, Score: 22.0, Timesteps: 22, Epsilon: 0.3448615651886179
Episode 114, Score: 51.0, Timesteps: 51, Epsilon: 0.341035157335565
Episode 115, Score: 27.0, Timesteps: 27, Epsilon: 0.3372421683184259
Episode 116, Score: 200.0, Timesteps: 200, Epsilon: 0.3334820194451191
Episode 117, Score: 84.0, Timesteps: 84, Epsilon: 0.329754146925876
Episode 118, Score: 128.0, Timesteps: 128, Epsilon: 0.32605800136591223
Episode 119, Score: 26.0, Timesteps: 26, Epsilon: 0.3223930472795069
Episode 120, Score: 48.0, Timesteps: 48, Epsilon: 0.31875876262441283
Episode 121, Score: 16.0, Timesteps: 16, Epsilon: 0.31515463835558755
Episode 122, Score: 53.0, Timesteps: 53, Epsilon: 0.31158017799728943
Episode 123, Score: 34.0, Timesteps: 34, Epsilon: 0.30803489723263966
Episode 124, Score: 74.0, Timesteps: 74, Epsilon: 0.30451832350980257
Episode 125, Score: 59.0, Timesteps: 59, Epsilon: 0.30102999566398114
Episode 126, Score: 57.0, Timesteps: 57, Epsilon: 0.29756946355447467
Episode 127, Score: 61.0, Timesteps: 61, Epsilon: 0.2941362877160807
Episode 128, Score: 45.0, Timesteps: 45, Epsilon: 0.2907300390241693
Episode 129, Score: 175.0, Timesteps: 175, Epsilon: 0.2873502983727886
Episode 130, Score: 140.0, Timesteps: 140, Epsilon: 0.2839966563652008
Episode 131, Score: 88.0, Timesteps: 88, Epsilon: 0.2806687130162734
Episode 132, Score: 80.0, Timesteps: 80, Epsilon: 0.2773660774661877
Episode 133, Score: 108.0, Timesteps: 108, Epsilon: 0.2740883677049518
Episode 134, Score: 82.0, Timesteps: 82, Epsilon: 0.2708352103072299
Episode 135, Score: 90.0, Timesteps: 90, Epsilon: 0.2676062401770315
Episode 136, Score: 45.0, Timesteps: 45, Epsilon: 0.26440110030182007
Episode 137, Score: 172.0, Timesteps: 172, Epsilon: 0.2612194415156308
Episode 138, Score: 42.0, Timesteps: 42, Epsilon: 0.25806092227080113
Episode 139, Score: 75.0, Timesteps: 75, Epsilon: 0.25492520841794253
Episode 140, Score: 200.0, Timesteps: 200, Epsilon: 0.25181197299379965
Episode 141, Score: 113.0, Timesteps: 113, Epsilon: 0.2487208960166577
Episode 142, Score: 74.0, Timesteps: 74, Epsilon: 0.2456516642889811
Episode 143, Score: 31.0, Timesteps: 31, Epsilon: 0.24260397120697585
Episode 144, Score: 200.0, Timesteps: 200, Epsilon: 0.23957751657678794
Episode 145, Score: 127.0, Timesteps: 127, Epsilon: 0.23657200643706278
Episode 146, Score: 103.0, Timesteps: 103, Epsilon: 0.23358715288760057
Episode 147, Score: 172.0, Timesteps: 172, Epsilon: 0.2306226739238615
Episode 148, Score: 134.0, Timesteps: 134, Epsilon: 0.22767829327708022
Episode 149, Score: 40.0, Timesteps: 40, Epsilon: 0.22475374025976358
Episode 150, Score: 107.0, Timesteps: 107, Epsilon: 0.22184874961635637
Episode 151, Score: 50.0, Timesteps: 50, Epsilon: 0.21896306137886812
Episode 152, Score: 200.0, Timesteps: 200, Epsilon: 0.2160964207272651
Episode 153, Score: 200.0, Timesteps: 200, Epsilon: 0.21324857785443885
Episode 154, Score: 200.0, Timesteps: 200, Epsilon: 0.2104192878355745
Episode 155, Score: 200.0, Timesteps: 200, Epsilon: 0.2076083105017461
Episode 156, Score: 200.0, Timesteps: 200, Epsilon: 0.204815410317576
Episode 157, Score: 200.0, Timesteps: 200, Epsilon: 0.2020403562628038
Episode 158, Score: 200.0, Timesteps: 200, Epsilon: 0.19928292171761497
Episode 159, Score: 200.0, Timesteps: 200, Epsilon: 0.19654288435158607
Episode 160, Score: 200.0, Timesteps: 200, Epsilon: 0.1938200260161128
Episode 161, Score: 200.0, Timesteps: 200, Epsilon: 0.19111413264018784
Episode 162, Score: 200.0, Timesteps: 200, Epsilon: 0.1884249941294066
Episode 163, Score: 200.0, Timesteps: 200, Epsilon: 0.18575240426807982
Episode 164, Score: 200.0, Timesteps: 200, Epsilon: 0.18309616062433975
Episode 165, Score: 200.0, Timesteps: 200, Epsilon: 0.18045606445813134
Episode 166, Score: 200.0, Timesteps: 200, Epsilon: 0.1778319206319825
Episode 167, Score: 200.0, Timesteps: 200, Epsilon: 0.1752235375244543
Episode 168, Score: 200.0, Timesteps: 200, Epsilon: 0.17263072694617476
Episode 169, Score: 200.0, Timesteps: 200, Epsilon: 0.17005330405836405
Episode 170, Score: 200.0, Timesteps: 200, Epsilon: 0.16749108729376372
Episode 171, Score: 200.0, Timesteps: 200, Epsilon: 0.16494389827988376
Episode 172, Score: 200.0, Timesteps: 200, Epsilon: 0.16241156176448868
Episode 173, Score: 200.0, Timesteps: 200, Epsilon: 0.15989390554324223
Episode 174, Score: 200.0, Timesteps: 200, Epsilon: 0.1573907603894379
Episode 175, Score: 200.0, Timesteps: 200, Epsilon: 0.1549019599857432
Episode 176, Score: 200.0, Timesteps: 200, Epsilon: 0.15242734085788778
Episode 177, Score: 200.0, Timesteps: 200, Epsilon: 0.149966742310231
Episode 178, Score: 200.0, Timesteps: 200, Epsilon: 0.14752000636314366
Episode 179, Score: 200.0, Timesteps: 200, Epsilon: 0.1450869776921444
Episode 180, Score: 200.0, Timesteps: 200, Epsilon: 0.14266750356873148
Episode 181, Score: 200.0, Timesteps: 200, Epsilon: 0.14026143380285305
Episode 182, Score: 200.0, Timesteps: 200, Epsilon: 0.13786862068696282
Episode 183, Score: 200.0, Timesteps: 200, Epsilon: 0.13548891894160808
Episode 184, Score: 200.0, Timesteps: 200, Epsilon: 0.1331221856625011
Episode 185, Score: 200.0, Timesteps: 200, Epsilon: 0.13076828026902376
Episode 186, Score: 200.0, Timesteps: 200, Epsilon: 0.12842706445412122
Episode 187, Score: 200.0, Timesteps: 200, Epsilon: 0.12609840213553858
Episode 188, Score: 200.0, Timesteps: 200, Epsilon: 0.1237821594083578
Episode 189, Score: 200.0, Timesteps: 200, Epsilon: 0.12147820449879354
Episode 190, Score: 200.0, Timesteps: 200, Epsilon: 0.11918640771920863
Episode 191, Score: 200.0, Timesteps: 200, Epsilon: 0.11690664142431006
Episode 192, Score: 200.0, Timesteps: 200, Epsilon: 0.11463877996848804
Episode 193, Score: 200.0, Timesteps: 200, Epsilon: 0.11238269966426384
Episode 194, Score: 200.0, Timesteps: 200, Epsilon: 0.11013827874181159
Episode 195, Score: 200.0, Timesteps: 200, Epsilon: 0.10790539730951965
Episode 196, Score: 200.0, Timesteps: 200, Epsilon: 0.10568393731556158
Episode 197, Score: 200.0, Timesteps: 200, Epsilon: 0.1034737825104447
Episode 198, Score: 200.0, Timesteps: 200, Epsilon: 0.10127481841050645
Episode 199, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 200, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 201, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 202, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 203, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 204, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 205, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 206, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 207, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 208, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 209, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 210, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 211, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 212, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 213, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 214, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 215, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 216, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 217, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 218, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 219, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 220, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 221, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 222, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 223, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 224, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 225, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 226, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 227, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 228, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 229, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 230, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 231, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 232, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 233, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 234, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 235, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 236, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 237, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 238, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 239, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 240, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 241, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 242, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 243, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 244, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 245, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 246, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 247, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 248, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 249, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 250, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 251, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 252, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 253, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 254, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 255, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 256, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 257, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 258, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 259, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 260, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 261, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 262, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 263, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 264, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 265, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 266, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 267, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 268, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 269, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 270, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 271, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 272, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 273, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 274, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 275, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 276, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 277, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 278, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 279, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 280, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 281, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 282, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 283, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 284, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 285, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 286, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 287, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 288, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 289, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 290, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 291, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 292, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 293, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 294, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 295, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 296, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 297, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 298, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 299, Score: 200.0, Timesteps: 200, Epsilon: 0.1
Episode 300, Score: 200.0, Timesteps: 200, Epsilon: 0.1

--- Evaluation ---
Average score: 125.72666666666667
Episodes: 300

--- Model (Q-table) ---
[[[[[  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]]

   [[ 26.63994576  28.49148154]
    [ 23.20222149  26.46118573]
    [ 21.13854075  26.53317167]
    [ 17.30265287  26.27408546]
    [ 16.72772951  23.51441257]
    [  0.          23.83321618]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]]

   [[328.05320439  95.25088858]
    [553.20592401 162.97711459]
    [537.45198849 255.21496651]
    [577.1094048  502.60218571]
    [566.48400495 383.81600409]
    [578.25566476 563.78102189]
    [577.65832768 554.68481   ]
    [528.49528721 578.19611935]
    [508.65127267 578.01982338]
    [241.71554661 530.27294801]
    [165.98990398 507.88389605]
    [ 79.41272507 169.88054431]]

   [[230.61126595  71.25934918]
    [519.82742333 129.4517015 ]
    [531.26507526 192.56401542]
    [578.50717041 516.87577006]
    [577.38442447 532.41981888]
    [558.66474821 578.55925508]
    [563.71522864 578.50416538]
    [363.28307807 565.98936464]
    [496.11067625 576.50651204]
    [264.02898773 527.26034577]
    [238.63787799 554.40185893]
    [202.70950775 371.15281394]]

   [[  0.           0.        ]
    [  0.           0.        ]
    [  0.          38.52296094]
    [  0.           0.        ]
    [ 22.9402602   63.74988989]
    [ 52.03343446  23.82415025]
    [ 45.87911801  23.97587393]
    [ 23.58355396  49.89961519]
    [ 47.69941954  40.08997247]
    [ 46.96344845  33.3477977 ]
    [ 49.27767993  40.11629936]
    [ 50.15477103  46.12943125]]

   [[  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]
    [  0.           0.        ]]]]]

Utvärdering

Episode 1, Score: 200.0, Timesteps: 200
Episode 2, Score: 200.0, Timesteps: 200
Episode 3, Score: 200.0, Timesteps: 200
Episode 4, Score: 200.0, Timesteps: 200
Episode 5, Score: 200.0, Timesteps: 200
Episode 6, Score: 200.0, Timesteps: 200
Episode 7, Score: 200.0, Timesteps: 200
Episode 8, Score: 200.0, Timesteps: 200
Episode 9, Score: 200.0, Timesteps: 200
Episode 10, Score: 200.0, Timesteps: 200
Episode 11, Score: 200.0, Timesteps: 200
Episode 12, Score: 200.0, Timesteps: 200
Episode 13, Score: 200.0, Timesteps: 200
Episode 14, Score: 200.0, Timesteps: 200
Episode 15, Score: 200.0, Timesteps: 200
Episode 16, Score: 200.0, Timesteps: 200
Episode 17, Score: 200.0, Timesteps: 200
Episode 18, Score: 200.0, Timesteps: 200
Episode 19, Score: 200.0, Timesteps: 200
Episode 20, Score: 200.0, Timesteps: 200
Episode 21, Score: 200.0, Timesteps: 200
Episode 22, Score: 200.0, Timesteps: 200
Episode 23, Score: 200.0, Timesteps: 200
Episode 24, Score: 200.0, Timesteps: 200
Episode 25, Score: 200.0, Timesteps: 200
Episode 26, Score: 200.0, Timesteps: 200
Episode 27, Score: 200.0, Timesteps: 200
Episode 28, Score: 200.0, Timesteps: 200
Episode 29, Score: 200.0, Timesteps: 200
Episode 30, Score: 200.0, Timesteps: 200
Episode 31, Score: 200.0, Timesteps: 200
Episode 32, Score: 200.0, Timesteps: 200
Episode 33, Score: 200.0, Timesteps: 200
Episode 34, Score: 200.0, Timesteps: 200
Episode 35, Score: 200.0, Timesteps: 200
Episode 36, Score: 200.0, Timesteps: 200
Episode 37, Score: 200.0, Timesteps: 200
Episode 38, Score: 200.0, Timesteps: 200
Episode 39, Score: 200.0, Timesteps: 200
Episode 40, Score: 200.0, Timesteps: 200
Episode 41, Score: 200.0, Timesteps: 200
Episode 42, Score: 200.0, Timesteps: 200
Episode 43, Score: 200.0, Timesteps: 200
Episode 44, Score: 200.0, Timesteps: 200
Episode 45, Score: 200.0, Timesteps: 200
Episode 46, Score: 200.0, Timesteps: 200
Episode 47, Score: 200.0, Timesteps: 200
Episode 48, Score: 200.0, Timesteps: 200
Episode 49, Score: 200.0, Timesteps: 200
Episode 50, Score: 200.0, Timesteps: 200
Episode 51, Score: 200.0, Timesteps: 200
Episode 52, Score: 200.0, Timesteps: 200
Episode 53, Score: 200.0, Timesteps: 200
Episode 54, Score: 200.0, Timesteps: 200
Episode 55, Score: 200.0, Timesteps: 200
Episode 56, Score: 200.0, Timesteps: 200
Episode 57, Score: 200.0, Timesteps: 200
Episode 58, Score: 200.0, Timesteps: 200
Episode 59, Score: 200.0, Timesteps: 200
Episode 60, Score: 200.0, Timesteps: 200
Episode 61, Score: 200.0, Timesteps: 200
Episode 62, Score: 200.0, Timesteps: 200
Episode 63, Score: 200.0, Timesteps: 200
Episode 64, Score: 200.0, Timesteps: 200
Episode 65, Score: 200.0, Timesteps: 200
Episode 66, Score: 200.0, Timesteps: 200
Episode 67, Score: 200.0, Timesteps: 200
Episode 68, Score: 200.0, Timesteps: 200
Episode 69, Score: 200.0, Timesteps: 200
Episode 70, Score: 200.0, Timesteps: 200
Episode 71, Score: 200.0, Timesteps: 200
Episode 72, Score: 200.0, Timesteps: 200
Episode 73, Score: 200.0, Timesteps: 200
Episode 74, Score: 200.0, Timesteps: 200
Episode 75, Score: 200.0, Timesteps: 200
Episode 76, Score: 200.0, Timesteps: 200
Episode 77, Score: 200.0, Timesteps: 200
Episode 78, Score: 200.0, Timesteps: 200
Episode 79, Score: 200.0, Timesteps: 200
Episode 80, Score: 200.0, Timesteps: 200
Episode 81, Score: 200.0, Timesteps: 200
Episode 82, Score: 200.0, Timesteps: 200
Episode 83, Score: 200.0, Timesteps: 200
Episode 84, Score: 200.0, Timesteps: 200
Episode 85, Score: 200.0, Timesteps: 200
Episode 86, Score: 200.0, Timesteps: 200
Episode 87, Score: 200.0, Timesteps: 200
Episode 88, Score: 200.0, Timesteps: 200
Episode 89, Score: 200.0, Timesteps: 200
Episode 90, Score: 200.0, Timesteps: 200
Episode 91, Score: 200.0, Timesteps: 200
Episode 92, Score: 200.0, Timesteps: 200
Episode 93, Score: 200.0, Timesteps: 200
Episode 94, Score: 200.0, Timesteps: 200
Episode 95, Score: 200.0, Timesteps: 200
Episode 96, Score: 200.0, Timesteps: 200
Episode 97, Score: 200.0, Timesteps: 200
Episode 98, Score: 200.0, Timesteps: 200
Episode 99, Score: 200.0, Timesteps: 200
Episode 100, Score: 200.0, Timesteps: 200

--- Evaluation ---
Average score: 200.0
Episodes: 100

Lämna ett svar

E-postadressen publiceras inte. Obligatoriska fält är märkta *