1. Policy Evaluation (Monte-Carlo based State value estimation): The instructions for
BLACKJACK are given below.
• Beginning of the game
o 2 cards dealt to player
o 2 cards dealt to dealer, one face down
• Goal: Have your card sum be greater than the dealers without exceeding 21.
• States (200 of them):
o Your current sum
o Dealer’s showing card
o Do you have a useable ace?
• Reward: +1 for winning, 0 for a draw, -1 for losing
• Actions: Stick (stand, stop receiving cards), hit (receive another card)
• Discount factor: ! = 1
• Policy: Stick if your sum is 20 or 21, else hit
To estimate the value of each state using Monte Carlo Method, you (the agent) initialize the
value of each state as 0, and then played game for the following two episodes.
(Your Sum, Dealer’s card, Ace?)
Action Next State Reward
19, 10, no Hit 22, 10, no -1
Final cards on the table at the end of the episode 1 is given below.
At the end of episode 1, update the value of state (19,10, no) by taking the average of sample
(Your Sum, Dealer’s card, Ace?) Action Next State Reward
13, 10, no HIT 16, 10, no 0
13, 10, no HIT 19, 10, no 0
19, 10, no HIT 21, 22, no 1
Final cards on the table at the end of the episode 2
At the end of episode 2, update the value of state (13,10, no), (16,10, no), (19,10, no) by taking
the average of sample return.
2. Policy Evaluation (Temporal Difference based State value estimation): The instructions for
Random Walk are given below.
• States: A, B, C, D, E, and both ends:
o Starting state: C
o Terminal states: left end, and right end
• Reward: +1 on the right termination, 0 otherwise
• Actions: go left or go right
• Discount factor: ! = 1
• Policy .: 50% go left, and 50% go right
a) List the bellman expectation equations for each non-terminal state.
The solutions to be above equations should be
$!(/) = 1/6, $!(0) = 2/6, $!(1) = 3/6, $!(2) = 4/6, $!(3) = 5/6
b) To estimate the value of each state using Temporal Difference Method, you (the agent)
initialize the value of each non-terminal state as 0.5 and run the temporal difference learning
algorithm with a constant step-size parameter (learning rate) 8 = 0.1.
The value update equation and the corresponding values learned after various numbers of
episodes are given in the figure below. The final estimate is about as close as the estimates
ever get to the true values. The values fluctuate indefinitely in response to the outcomes of the
most recent episodes.
In the above figure, (0,1,10, 100) are the episode indices. It appears that the first episode results in a
change in only $!(/). What does this tell you about what happened on the first episode? Why was
only the estimate for this one state changed? By exactly how much was it changed?
Such a cheap price for your free time and healthy sleep
All online transactions are done using all major Credit Cards or Electronic Check through PayPal. These are safe, secure, and efficient online payment methods.