Аннотация
We consider the multiarmed bandit problem (the problem of Markov bandits) with switching penalties and no discounting in case when state spaces of all bandits are finite. An optimal strategy should have the largest average reward per unit time on an infinite time horizon. For this problem it is shown that an optimal strategy can be specified by a Gittins index under the natural assumption that the switching penalties are nonnegative.
Язык оригинала | английский |
---|---|
Страницы (с-по) | 355-364 |
Число страниц | 10 |
Журнал | Theory of Probability and its Applications |
Том | 64 |
Номер выпуска | 3 |
DOI | |
Состояние | Опубликовано - 1 янв 2019 |