Export directly to RefWorks

TY  - THES
ID  - 137463560
TI  - Investigating the Limits of Learning Non-Markovian Reward Models
AU  - Boone, Ilio
AU  - De Raedt, Luc
AU  - KU Leuven. Faculteit Ingenieurswetenschappen. Opleiding Master of Artificial Intelligence (Leuven)
PY  - 2021
PB  - Leuven KU Leuven. Faculteit Ingenieurswetenschappen
DB  - UniCat
UR  - https://www.unicat.be/uniCat?func=search&query=sysid:137463560
AB  - This thesis investigates the learnability of Markov Decision Processes (MDPs) with added complexity along multiple dimensions. A theoretical investigation into the implications of the complexity dimensions is performed and it is discussed how existing algorithms deal with them. 3 complexity dimensions are defined, being temporal dependence, partial observability and dynamism. The temporal dependence and dynamism dimensions are closely related, but are distinguished because there is a subtle difference regarding the learnability of MDPs with these complexities.    In Reinforcement Learning literature, frameworks exist for solving Decision Processes with non-Markovian Rewards (NMRDPs), based on the use of a Mealy Reward Machine, and for Partially Observable MDPs, based on the use of belief states. Both methods transform the original problem to a process that behaves in a Markov way, allowing traditional model-solving techniques to be used. The limitations of these techniques for dynamic PO-NMRDPs, which combine the three dimensions of complexity, are investigated and it is proposed how they can be extended to fit dynamic PO-NMRDPs. Consequently, two methods are presented to handle PO-NMRDPs, one based on an MRM and the other on the use of belief states. An important distinction between the two is made in regard to policies for these methods, as the former is deterministic and the latter results in a stochastic action plan.  Following the theoretical discussion regarding complex MDPs, a practical framework is proposed that simultaneously learns an MRM to represent the reward structure and derives an optimal deterministic policy to maximize rewards in dynamic PO-NMRDPs. This framework is an extension of the ARM framework, which has been developed to deal with standard NMRDPs. The issues of learning dynamic PO-NMRDPs using the standard ARM framework are discussed and it is explained diligently how these issues are resolved. The learning phase is based on Angluin's L* algorithm, which constructs an MRM and combines this MRM with the original state space to create a synchronized product, which behaves in a Markov manner. Traditional model-checking techniques are employed on this synchronized product in order to derive an optimal deterministic, reactive policy.
ER  -