2013年12月15日 星期日

PPO AI algorithm on PyTorch

1. Algorithm
https://spinningup.openai.com/en/latest/algorithms/ppo.html
https://zhuanlan.zhihu.com/p/108034550

2. Pytorch python source
https://github.com/fatalfeel/PPO_Pytorch (with full comment)

6. DD-PPO (Decentralized PPO)

!!!PPO Lessons!!!
Make you from beginner to experience on 30 days:
(a). First neural AI system - Probabilistic Neural Network(PNN) on 1988
http://www.mediafire.com/file/94sg5skfxa7hgvz/PNN.zip
(b) Probability Mathematics
#derivative of log
https://socratic.org/questions/how-do-you-differentiate-y-log-x-2-2
#variance and standard deviation
https://www.mathsisfun.com/data/standard-deviation.html
#entropy
http://webservices.itcs.umich.edu/mediawiki/lingwiki/index.php/Entropy
http://www.bituzi.com/2015/12/entropy.html
#basic probability
http://www.mediafire.com/file/b9z4urhvbolys3s/mathematics.zip
#expected value
https://www.statlect.com/fundamentals-of-probability/expected-value
#conditional expected value
https://www.youtube.com/watch?v=_MSdrWxi2mw
#categorical distribution
https://pytorch.org/docs/stable/distributions.html
http://www.mediafire.com/file/f2gt6x6ah7gd41m/categorical_distribution.py 
#multivariate normal distribution
http://www.mediafire.com/file/gv9ocq7m4tviq24/multivariatenormal_distribution.py
#Monte Carlo approximation
https://www.statlect.com/asymptotic-theory/Monte-Carlo-method
#Importance sampling (important)
https://www.youtube.com/watch?v=V8f8ueBc9sY
https://www.math.arizona.edu/~tgk/mc/book_chap6.pdf
http://www.mediafire.com/file/s6cf3utr8kvvnyy/Importance-Sampling.py

Importance Example:
E[x] = expected value x
Dice Q is fair each face of probability is 1/6 approximate to 0.166
Dice P is cheat each face has different probability

Q:
Eq[xi] = Σxi*q(xi), i=1~6, q(xi)=1/6=0.166
=1*0.166+2*0.166+3*0.166+4*0.166+5*0.1660+6*0.166 = 3.48
also
1/n*Σxi , n is 6
= 0.166*(1+2+3+4+5+6) = 3.48
so
Eq[xi] = 1/n*Σxi
when q(xi) is smooth different
Eq[xi] 1/n*Σxi 

P:
Ep[xi] = Σxi*p(xi) , i are 1~6
1*0.50+2*0.25+3*0.15+4*0.05+5*0.035+6*0.015 = 1.915

Eq[xi] = 1/n*Σxi
Ep[xi] = Σxi*p(xi) = Σxi*p(xi)/q(xi)*q(xi)
"xi replace to xi*p(xi)/q(xi)" into (Eq[xi] = 1/n*Σxi)
Eq[xi*p(xi)/q(xi)] = 1/n*Σxi*p(xi)/q(xi), n is 6
=0.166*((1*0.50/0.166)+(2*0.25/0.166)+(3*0.15/0.166)+(4*0.05/0.166)+(5*0.035/0.166)+(6*0.015/0.166))
=1.915 #expected value same as Ep[xi]

Conclusion:
Ep[xi] = Σxi*p(xi) = Σxi*p(xi)/q(xi)*q(xi) = Eq[xi*p(xi)/q(xi)] = 1/n*Σxi*p(xi)/q(xi)
when q(xi) is smooth different
Ep[xi] = Σxi*p(xi) = Σxi*p(xi)/q(xi)*q(xi) = Eq[xi*p(xi)/q(xi)] ≈ 1/n*Σxi*p(xi)/q(xi)

(c)Mean squared error
https://medium.com/uxai/%E5%BE%9E%E5%AF%A6%E4%BE%8B%E9%80%90%E6%AD%A5%E4%BA%86%E8%A7%A3%E5%9F%BA%E6%9C%AC-pytorch-%E5%A6%82%E4%BD%95%E9%81%8B%E4%BD%9C-653a7843323e

(d) Loss function
https://www.youtube.com/watch?v=fegAeph9UaA

(e) Forward pass and Back propagation (important)
https://zhuanlan.zhihu.com/p/40378224
https://www.cnblogs.com/charlotte77/p/5629865.html
http://www.mediafire.com/file/czit8q113fwv7pt/backpropagation.py
http://www.mediafire.com/file/vtm391yxzpksxnx/torch_backward.py
https://software.intel.com/content/www/cn/zh/develop/articles/step-by-step-explaination-on-neural-network-backward-propagation-process.html
https://www.youtube.com/watch?v=ibJpTrp5mcE

(f) Markov Reward and Decision Processes
https://zhuanlan.zhihu.com/p/35231424  #MRP
https://zhuanlan.zhihu.com/p/35354956  #MDP
https://blog.techbridge.cc/2018/10/27/intro-to-mdp-and-app
https://blog.techbridge.cc/2018/12/22/intro-to-mdp-program
http://www.cs.cmu.edu/afs/andrew/course/15/381-f08/www/lectures/HandoutMDP.pdf
http://www.mediafire.com/file/911yt4ttaot7vyp/mkdp.py
https://www.youtube.com/watch?v=9g32v7bK3Co

(g) Reinforcement Learning
https://bluesmilery.github.io/blogs/e4dc3fbf
https://pylessons.com/A2C-reinforcement-learning
https://www.youtube.com/watch?v=W8XF3ME8G2I

(h) Policy Gradient
https://www.janisklaise.com/post/rl-policy-gradients
https://medium.com/change-the-world-with-technology/policy-gradient-181d43a24cf5
https://www.youtube.com/watch?v=y8UPGr36ccI
https://www.youtube.com/watch?v=z95ZYgPgXOY

(i) A3C Advantage function A(s,a)
https://www.cnblogs.com/wangxiaocvpr/p/8110120.html
https://www.youtube.com/watch?v=O79Ic8XBzvw 

(j) PPO
https://pylessons.com/PPO-reinforcement-learning
http://www.mediafire.com/file/rp0skd2kwpc78it/PPO.pdf
https://www.youtube.com/watch?v=OAKAZhFmYoI
(Super recommend using Pycharm to debug source. It made by JetBrains which also made Android Studio this group got a lot support by google)

2d 3d formula plot tool:
https://www.desmos.com/calculator
https://www.geogebra.org/3d

Tesla told in New York Herald: I prefer to be remembered as the inventor who succeeded in abolishing war. That will be my highest pride.
http://www.teslacollection.com/tesla_articles/1898/new_york_herald/f_l_christman/tesla_declares_he_will_abolish_war (in middle section)

Albert Einstein: The release of atom power has changed everything except our way of thinking... the solution to this problem lies in the heart of mankind. If only I had known, I should have become a watchmaker.
https://atomictrauma.wordpress.com/the-scientists/albert-einstein

沒有留言:

張貼留言