Policy Evaluation (Monte-Carlo based State value estimation): The instructions for BLACKJACK are given below

1
HW4
Name: ID:
1. Policy Evaluation (Monte-Carlo based State value estimation): The instructions for
BLACKJACK are given below.
• Beginning of the game
o 2 cards dealt to player
o 2 cards dealt to dealer, one face down
• Goal: Have your card sum be greater than the dealers without exceeding 21.
• States (200 of them):
o Your current sum
o Dealer’s showing card
o Do you have a useable ace?
• Reward: +1 for winning, 0 for a draw, -1 for losing
• Actions: Stick (stand, stop receiving cards), hit (receive another card)
• Discount factor: ! = 1
• Policy: Stick if your sum is 20 or 21, else hit
To estimate the value of each state using Monte Carlo Method, you (the agent) initialize the
value of each state as 0, and then played game for the following two episodes.
Episode 1:
State
(Your Sum, Dealer’s card, Ace?)
Action Next State Reward
19, 10, no Hit 22, 10, no -1
Final cards on the table at the end of the episode 1 is given below.
At the end of episode 1, update the value of state (19,10, no) by taking the average of sample
return.
$((19,10, no))=____________
Episode 2:
State
(Your Sum, Dealer’s card, Ace?) Action Next State Reward
13, 10, no HIT 16, 10, no 0
13, 10, no HIT 19, 10, no 0
19, 10, no HIT 21, 22, no 1
Final cards on the table at the end of the episode 2
2
At the end of episode 2, update the value of state (13,10, no), (16,10, no), (19,10, no) by taking
the average of sample return.
$((13,10, no))=____________
$((16,10, no))=____________
$((19,10, no))=____________
2. Policy Evaluation (Temporal Difference based State value estimation): The instructions for
Random Walk are given below.
• States: A, B, C, D, E, and both ends:
o Starting state: C
o Terminal states: left end, and right end
• Reward: +1 on the right termination, 0 otherwise
• Actions: go left or go right
• Discount factor: ! = 1
• Policy .: 50% go left, and 50% go right
a) List the bellman expectation equations for each non-terminal state.
$!(/)=_________________
$!(0)=_________________
$!(1)=_________________
$!(2)=_________________
$!(3)=_________________
The solutions to be above equations should be
$!(/) = 1/6, $!(0) = 2/6, $!(1) = 3/6, $!(2) = 4/6, $!(3) = 5/6
b) To estimate the value of each state using Temporal Difference Method, you (the agent)
initialize the value of each non-terminal state as 0.5 and run the temporal difference learning
algorithm with a constant step-size parameter (learning rate) 8 = 0.1.
The value update equation and the corresponding values learned after various numbers of
episodes are given in the figure below. The final estimate is about as close as the estimates
ever get to the true values. The values fluctuate indefinitely in response to the outcomes of the
most recent episodes.
3
In the above figure, (0,1,10, 100) are the episode indices. It appears that the first episode results in a
change in only $!(/). What does this tell you about what happened on the first episode? Why was
only the estimate for this one state changed? By exactly how much was it changed?
Order Now

Calculate a fair price for your paper

Such a cheap price for your free time and healthy sleep

1650 words
-
-
Place an order within a couple of minutes.
Get guaranteed assistance and 100% confidentiality.
Total price: $78
WeCreativez WhatsApp Support
Our customer support team is here to answer your questions. Ask us anything!
👋 Hi, how can I help?