{"id":1796,"date":"2015-03-13T14:50:02","date_gmt":"2015-03-13T13:50:02","guid":{"rendered":"https:\/\/project.inria.fr\/plasma-lab\/?page_id=1796"},"modified":"2015-03-19T10:41:50","modified_gmt":"2015-03-19T09:41:50","slug":"reward-estimation","status":"publish","type":"page","link":"https:\/\/project.inria.fr\/plasma-lab\/reward-estimation\/","title":{"rendered":"Reward estimation for MDPs"},"content":{"rendered":"<p><\/p>\n<p style=\"text-align: justify;\">In addition to <a href=\"...\/nondeterminisitic-algorithms\">estimating minimum and maximum probabilities<\/a>, PLASMA Lab implements an algorithm to estimate minimum and maximum expected reward, as defined by the rewards properties described below. The algorithms are demonstrated on a number of standard <a title=\"Case studies\" href=\"https:\/\/project.inria.fr\/plasma-lab\/examples\/\">case studies<\/a>.<\/p>\n<hr \/>\n<h3>Types of reward<\/h3>\n<p style=\"text-align: justify;\">Rewards or costs are additional variables assigned to states or transitions. The RML language uses a specific <strong>rewards<\/strong> construction described below and on the <a href=\"http:\/\/www.prismmodelchecker.org\/manual\/ThePRISMLanguage\/CostsAndRewards\">PRISM website<\/a>:<\/p>\n<p style=\"text-align: justify;\"><div class=\"alert alert-success\" role=\"alert\"><p class=\"printonly\"><strong>Important!<\/strong><\/p><br \/>\nrewards &#8220;<em>rewardName<\/em>&#8220;<\/p>\n<p style=\"text-align: justify; padding-left: 30px;\" class=\"first-p\"><em>guard1<\/em> : <em>value1;<\/em><br \/>\n[] <em>guard2<\/em>: <em>value2<\/em>;<br \/>\n[<em>channel<\/em>] <em>guard3<\/em> : <em>value3<\/em>;<\/p>\n<p style=\"text-align: justify;\">endrewards<br \/>\n<\/div><\/p>\n<p style=\"text-align: justify;\">This construction associates the following values to the reward:<\/p>\n<ol>\n<li>The first command defines a <strong>state reward<\/strong> that assigns the <em>value1<\/em> to each state of the model that satisfies <em>guard1.<\/em><\/li>\n<li>The second command defines a <strong>transition reward<\/strong> that assigns the <em>value2<\/em> to each non synchronized transition that satisfies <em>guard2.<\/em><\/li>\n<li>The third command also defines a<strong> transition reward<\/strong> but it assigns the <em>value3<\/em> only to the transitions labeled with <em>channel<\/em> and that satisfy <em>guard3<\/em>.<\/li>\n<\/ol>\n<p>Three types of <strong>reward properties<\/strong> define the value of the reward associated to a path that will be used by the algorithm:<\/p>\n<ul>\n<li style=\"text-align: justify;\"><strong>Reachability rewards:\u00a0<\/strong>The reward is associated to a BLTL property and the value of the reward for a path is the cumulative value (the sum of state rewards and transition rewards) along the path up to the state which satisfies the property (or the last one if the property is not satisfied). Note that this is different from PRISM that only computes reachability rewards for LTL properties that are always satisfied (otherwise the reward is infinite).<\/li>\n<li style=\"text-align: justify;\"><strong> Cumulative rewards: <\/strong>The reward of a path is the cumulative reward up to time T. This type of reward is easily encoded with PLASMA Lab using a reachability reward with a BLTL property like G&lt;=# T true.<\/li>\n<li><strong>Instantaneous rewards:\u00a0<\/strong>The reward of a path is the state reward of the state at time T. To specify this reward in PLASMA Lab we use the name <em>rewardName<\/em>[I] instead of\u00a0<em>rewardName<\/em> and a BLTL property like G&lt;=# T true to encode the timing.<\/li>\n<\/ul>\n<hr \/>\n<h3>Smart reward estimation algorithm<\/h3>\n<p style=\"text-align: justify;\">The algorithm is somehow similar to the <a title=\"Probability estimation for MDPs\" href=\"https:\/\/project.inria.fr\/plasma-lab\/documentation\/algorithms\/nondeterministic-algorithms\/\">probability estimation algorithm<\/a>. However the set of schedulers is directly initialized at the begin of the algorithm using all the simulation <em>budget<\/em>, and the first iteration performs one simulation for each scheduler. For each scheduler the algorithm evaluates the expected reward at the time the BLTL property is satisfied. Then after each iteration the best schedulers are kept and the number of simulations is increased according to the <em>reduction factor<\/em>.<\/p>\n<div id=\"attachment_1926\" style=\"width: 257px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/project.inria.fr\/plasma-lab\/files\/2015\/03\/algo_reward_mdp.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1926\" class=\"size-medium wp-image-1926\" alt=\"Parameters of the reward algorithm for MDPs\" src=\"https:\/\/project.inria.fr\/plasma-lab\/files\/2015\/03\/algo_reward_mdp-247x300.png\" width=\"247\" height=\"300\" srcset=\"https:\/\/project.inria.fr\/plasma-lab\/files\/2015\/03\/algo_reward_mdp-247x300.png 247w, https:\/\/project.inria.fr\/plasma-lab\/files\/2015\/03\/algo_reward_mdp.png 271w\" sizes=\"auto, (max-width: 247px) 100vw, 247px\" \/><\/a><p id=\"caption-attachment-1926\" class=\"wp-caption-text\">Parameters of the reward algorithm for MDPs<\/p><\/div>\n<p style=\"text-align: justify;\">For <strong>reachability rewards<\/strong> an additional computation is performed. Since the algorithm only consider bounded properties, we evaluate the hypothesis that the probability to satisfy the BLTL is at least greater than a given <em>threshold<\/em>. This is computed using an SPRT test with a confidence parameter <em>alpha<\/em>. If the test is satisfied the result will display the property indicator in green. Otherwise it stays red.<\/p>\n<div id=\"attachment_1927\" style=\"width: 621px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/project.inria.fr\/plasma-lab\/files\/2015\/03\/algo_reward_sprt_true.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1927\" class=\" wp-image-1927\" alt=\"Hypothesis accepted by the SPRT test.\" src=\"https:\/\/project.inria.fr\/plasma-lab\/files\/2015\/03\/algo_reward_sprt_true.png\" width=\"611\" height=\"77\" \/><\/a><p id=\"caption-attachment-1927\" class=\"wp-caption-text\">Hypothesis accepted by the SPRT test.<\/p><\/div>\n<div id=\"attachment_1928\" style=\"width: 617px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/project.inria.fr\/plasma-lab\/files\/2015\/03\/algo_reward_sprt_false.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1928\" class=\" wp-image-1928\" alt=\"Hypothesis rejected by the SPRT test.\" src=\"https:\/\/project.inria.fr\/plasma-lab\/files\/2015\/03\/algo_reward_sprt_false.png\" width=\"607\" height=\"77\" \/><\/a><p id=\"caption-attachment-1928\" class=\"wp-caption-text\">Hypothesis rejected by the SPRT test.<\/p><\/div>\n<p style=\"text-align: justify;\">\n<p><\/p>","protected":false},"excerpt":{"rendered":"<p>In addition to estimating minimum and maximum probabilities, PLASMA Lab implements an algorithm to estimate minimum and maximum expected reward, as defined by the rewards properties described below. The algorithms are demonstrated on a number of standard case studies. Types of reward Rewards or costs are additional variables assigned to\u2026<\/p>\n<p> <a class=\"continue-reading-link\" href=\"https:\/\/project.inria.fr\/plasma-lab\/reward-estimation\/\"><span>Continue reading<\/span><i class=\"crycon-right-dir\"><\/i><\/a> <\/p>\n","protected":false},"author":235,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1796","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/project.inria.fr\/plasma-lab\/wp-json\/wp\/v2\/pages\/1796","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/project.inria.fr\/plasma-lab\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/project.inria.fr\/plasma-lab\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/project.inria.fr\/plasma-lab\/wp-json\/wp\/v2\/users\/235"}],"replies":[{"embeddable":true,"href":"https:\/\/project.inria.fr\/plasma-lab\/wp-json\/wp\/v2\/comments?post=1796"}],"version-history":[{"count":15,"href":"https:\/\/project.inria.fr\/plasma-lab\/wp-json\/wp\/v2\/pages\/1796\/revisions"}],"predecessor-version":[{"id":1966,"href":"https:\/\/project.inria.fr\/plasma-lab\/wp-json\/wp\/v2\/pages\/1796\/revisions\/1966"}],"wp:attachment":[{"href":"https:\/\/project.inria.fr\/plasma-lab\/wp-json\/wp\/v2\/media?parent=1796"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}