You can find the Lecture Notes Markdown Here. Feel free to Contribute and improve them further.Lecture 1 Notes for UCL Berkley RL Course, Taught By David Silver.

Reinforcement Learning: It sits at the Intersection of many fields of Science. It’s the science of Decision making, a method to understand optimum decisions.

Ex:

So it’s really commo to a lot of branches, and is a general approach to solving the Reward based problems.

RL Vs OtherLearning:

Real Life use cases:

RL Problem

(Informal) Reward Hypothesis: All goals can be described by the maximisation of expected cummalative reward.

All rewards can be weighed out based on the actions after comparisions. Thus all rewards can be decomposed into a scale of comparisions and hence a Scalar reward must be ultimately derived.

By Definition, Goal may be intermediate or a Final goal or Time Based, etcetra.

First step is understanding the Reward Signal.

Sequential Decision Making

Formalism

Agent

We control the brain here-Brain is the agent.

Environment:

Environment exists outside the Environment.

At every step, the agent receives observations which are generated by the environment and the agent itself influences the environment by making actions.

The Machine Learning problem of RL is related to the stream of data coming from the trail and error interaction.

History and State

The Stream of Experience, Sequence of Observation, Actions and rewards.

H(t) is

State: It’s a summary of the information to determine the next action. It captures the history to determine all that should happen next.

Note: For a multi-agent problem, an agent can consider other agents as part of the Environment.

Information State: Markov State

An information systems contains all useful information from History.

Markov Property: A state is Markov if and only if: Probability of the next state, given your current state is the same as all of the previous states. In other words, only current state is determining the next state and the history is not relavant.

In other words, If Markov property holds. The Future is independent of the History, given the Present. Since the state characterises everything about the past.

Another definition: State is a sufficient statistics of the Future.

By definition: Environment state is Markov.

And the entire history is also a Markov state. (Not a useful one)

  1. Fully Observable Environment: Agent gets to see the complete environment, Agents State = Environment State = Information System. This is known as MDP (Markov Decision Process).
  2. Partially Observable Environment: Agent observes environment indirectly. In this case, agent state != Environment State. This is known as Partially Observable MDP (POMDP). Possible agent state representations:

Inside an RL Agent

An RL Agent may (or may not) include on of these:

Policy:

It’s a map from state from action. Determines what the agent will do if it’s in a state.

Value Function:

It’s a prediction of expected future reward. We chose between actions by deciding to go for the highest rewards, an estimate of this is obtained by the Value function.

Value function depends on the way in which we are behaving, it depends on the policy. It gives the reward if we follow an action, thus helps in optimize our behaviour.

Gamma: Discounting. It affects if we care about current/later states. It decides the horizon for evaluating the future. (Horizon-how far along do we need to calculate outcomes of our actions).

Model:

It’s used to learn the environment, predicts what the environment will do next. It isn’t necessary to create a model of the environment. But it’s useful when we do.

It can be divide into two states:

  1. State Transition model: Predicts the state transition, given the current rewards.
  2. Reward Model: It predicts the expected Reward given the current state.

RL Agents:

We catgorize our agents based on which of the above three concepts, it follows. Say, if we have a value based agent: if it has a value function and a policy is implicit.

Policy based: maintains a data structure of the every state without storing the value function.

Actor Critic: Combines both the policy and also the value function.

So RL Problems can be categorized as:

Problems within RL

Learning and Planning:

There are two problems when it comes to sequential decision making.

  1. Initially environment is unknown (via trial and error).
  2. Interacts with Environment.
  3. Improves it Policy.
  1. We describe the environment, model of agent is known to agent.
  2. Agent computes on its model and improves its policy

Exploration Vs Exploitation

Another key aspect of RL.

Exploration: Chosing to give up some known reward, in order to find more about the environment.

Exploitation: Exploits known information to maximise reward.

There is an Exploration Vs Exploitation Tradeoff.

Prediction and Control

Prediction: An estimate of the future, given the current policy

Control: Find the Best policy.

In RL, we need to evaluate all our policies to find out the best one.

Subscribe to Newsletter for a Weekly Curated Lists of Deep Learning and Computer Vision Reads.

You can find me on Twitter @bhutanisanyam1