电子书-马尔科夫决策过程中的学习表示和控制：新的前沿领域（英）

174

电子书-马尔科夫决策过程中的学习表示和控制：新的前沿领域（英）

# 计算机 # 网络学 # 序列决策问题大小：1.27M | 页数：163 | 上架时间：2022-03-03 | 语言：英文

电子书-马尔科夫决策过程中的学习表示和控制：新的前沿领域（英）.pdf

免费阅读10页，购买之后可查看、下载163页完整报告

电子书-马尔科夫决策过程中的学习表示和控制：新的前沿领域（英）.pdf

试看10页

类型: 电子书

上传者: 二一

出版日期: 2022-03-03

摘要：

Из серии Foundations and Trends in Machine Learning издательства NOWPress, 2008, -163 pp.This paper describes a novel machine learning framework for solving sequential decision problems called Markov decision processes (MDPs) by iteratively computing low-dimensional representations and approximately optimal policies. A unified mathematical framework for learning representation and optimal control in MDPs is presented based on a class of singular operators called Laplacians, whose matrix representations have nonpositive off-diagonal elements and zero row sums. Exact solutions of discounted and average-reward MDPs are expressed in terms of a generalized spectral inverse of the Laplacian called the Drazin inverse. A generic algorithm called representation policy iteration (RPI) is presented which interleaves computing low-dimensional representations and approximately optimal policies. Two approaches for dimensionality reduction of MDPs are described based on geometric and reward-sensitive regularization, whereby low-dimensional representations are formed by diagonalization or dilation of Laplacian operators. Model-based and model-free variants of the RPI algorithm are presented; they are also compared experimentally on discrete and continuous MDPs. Some directions for future work are finally outlined.Introduction
Sequential Decision Problems
Laplacian Operators and MDPs
Approximating Markov Decision Processes
Dimensionality Reduction Principles in MDPs
Basis Construction: Diagonalization Methods
Basis Construction: Dilation Methods
Model-Based Representation Policy Iteration
Basis Construction in Continuous MDPs
Model-Free Representation Policy Iteration

Related Work and Future Challenges

本文介绍了一种新的机器学习框架，通过迭代计算低维表示和近似最优策略来解决称为马尔科夫决策过程（MDPs）的顺序决策问题。本文提出了一个统一的数学框架，用于学习MDPs中的表示和最优控制，该框架基于一类称为Laplacians的奇异算子，其矩阵表示具有非正对角线元素和零行和。折现和平均回报的MDP的精确解是用Laplacian的广谱逆表示的，称为Drazin逆。提出了一种称为表示策略迭代（RPI）的通用算法，该算法交织计算低维表示和近似最优策略。描述了两种基于几何和奖励敏感正则化的MDP降维方法，其中低维表示是由拉普拉斯算子的对角化或扩张形成的。提出了基于模型和无模型的RPI算法的变体；还在离散和连续MDP上对它们进行了实验比较。最后概述了未来工作的一些方向。

序列决策问题

拉普拉斯算子和MDPs

马尔科夫决策过程的逼近

MDPs中的降维原则

基数构建。对角线化方法

基准构建。扩张方法

基于模型的表示策略迭代

连续MDP中的基数构建

无模型表示的策略迭代