奔三啦

好消息,我拍了婚纱照

Created2023-06-05|生活

特大喜讯！

好消息,我订婚了！

Created2023-05-06

特大喜讯！

11-24

Created2022-11-26|日志

沉痛悼念11月24日新疆因疫情封锁火灾遇难者同胞

On Policy Approximation

Created2022-11-20|科研

回顾一下Sutton书中第9章对On policy approximation的讨论

cs234-4: SARSA、Q-learning、On policy 和off policy简单理解

Created2022-11-16|笔记

重新理解基础算法

Policy gradient method

Created2022-11-16|科研

对sutton “Reinforcement learning：an introduction”第十三章REINFORCE方法的一个重新总结

综述在涉及非平稳性的多种环境中学习的调查 A Survey of Learning in Multiagent Environments Dealing with Non-Stationarity

Created2022-11-11

TFT开始合作，然后做对手在上一局地选择，即如果对手在上一局cooperate，那本局就cooperate，如果对手在上一局defect，那么就在本局defect。 Pavlov如果两名玩家都在上局合作则本局合作，如果两名玩家都在上局背叛则本局选择背叛。新的框架 policy generating function: belief $\beta_j$ Influence function $\theta$ 在想这三个指标是作者自己提出来的么？并没有在其他文章中见到过 Best response 多智能体学习最优反应$$B R_{i}(\hat{\theta})=\pi_{i}^{*}(s, a, \hat{\theta})=B R_{i}\left(\boldsymbol{\pi}{-i} \mid \pi{j} \sim \beta_{j}\left(\tau \mid h_{j}\right), h_{j} \sim p\left(h_{j} \mid h_{i}\right)\right)$$ 五种方式应对non-stationarity 行为...