1. ReinforceLearning 101 (Bandit Problem)

Reinforcement Learning

The agent learns the optimal solution by interacting with the environment itself.

Types of Machine Learning

Supervised Learning
- The answer values are labeled → Equivalent to knowing the optimal solution in advance.

Unsupervised Learning
- Mainly used to find structures or patterns in data without answer labels.
- Clustering, feature extraction, dimensionality reduction, etc.

Reinforcement Learning

The agent, which is the subject of action, is placed in an environment, observes the state of the environment, and takes an action suitable for the state. As a result of the action, the environment changes, and at this time, it receives a reward and simultaneously observes a new state. In this process, the agent learns how to maximize the total sum of rewards it receives.

Repetition of State → Action → Reward

Reinforcement learning receives rewards as feedback on the environment, which is different in nature from answer labels in supervised learning. In supervised learning, the answer label itself indicates that it is optimal, but in reinforcement learning, it cannot be determined whether it is optimal or not.

Multi-armed bandit problem

Situation: There are multiple slot machines with fixed probabilities, and the agent does not know the probability of obtaining a coin. At this time, it needs to figure out which slot machine has the best winning rate and acquire the maximum amount of coins.

At this time, the state information about the environment does not change. (Because using a slot machine once does not change the environment (probability of the slot machine).)

Following the highest probability is different from following the highest expected value. (The probabilities of a slot machine returning 0, 1, 5, or 10 coins can vary. We must follow the highest expected value.)

Highest Expected Value → Value or Action Value

Action Value $q (A) = E (R ∣ A)$ → The expected reward value when action A is selected is the reward $q (A)$

$Q (A)$ → Estimate / $q (A)$ → Actual Action Value

We must estimate the slot machine as accurately as possible (through learning).

If we need to follow the value through the sample mean, we can calculate it using the following equation. The reason for using this formula is to avoid repeating many additions when n becomes sufficiently large.

Q_{n} = (1 - \frac{1}{n}) Q_{n - 1} + \frac{1}{n} R_{n} = Q_{n - 1} + \frac{1}{n} (R_{n} - Q_{n - 1})

$R_{n}$ is the reward (number of coins) obtained at the nth time.

Here, $\frac{1}{n}$ plays the role of the learning rate. (Because it controls the amount of update itself. As n increases, the learning rate decreases.)

This can be expressed in code as follows -> Also called incremental implementation.

Q = Q + (reward-Q)/n
#The formula is the same above and below
Q += (reward-Q)/n

Player's Policy

Greedy Policy: Utilizing only the best choice based on play so far. → In this case, if the best choice is incorrectly estimated in the beginning, the actual optimal solution may be missed (problems arise when the reliability of the estimate is low).

Exploitation / Exploration [They are in a trade-off relationship]

To avoid missing the actual optimal solution, the exploration process is performed in parallel → ε-greedy algorithm

Exploration with ε probability, exploitation with 1-ε probability to maximize the expected value while not missing the optimal solution. [Choosing an appropriate ε is very important].

Since the results obtained from each learning session are different, multiple attempts must be made to average them.

Refer to the book for the actual code implementation and results.

Non-stationary problem

Until now, the probabilities of the slot machines in the multi-armed bandit problem have not changed.

If the probability changes over time, the importance of past reward data should decrease. Therefore, there is a need to adjust the learning rate.

Q_{n} = Q_{n - 1} + α (R_{n} - Q_{n - 1})

Rewards received a long time ago have lower weights as α accumulates. It is the same as the equation below.

Q_{n} = α R_{n} + α (1 - α) R_{n - 1} + \cdot \cdot \cdot + α (1 - α)^{n - 1} R_{1} + (1 - α)^{n} Q_{0}

At this time, Bias may occur due to $Q_{0}$

Since past rewards decrease exponentially, it is called exponential moving average.

Stationary problem vs Non-stationary problem

Stationary problem - Sample Mean

Q_{n} = Q_{n - 1} + \frac{1}{n} (R_{n} - Q_{n - 1})

Non-stationary problem - Exponential moving average

This content was translated from Korean to English using GPT.

Search This Blog

HOYA LAB