The Q learning training based upon the Bellman equation to update the Q value is:
\begin{equation*} Q(s,u)=Q(s,u)+\alpha* (reward+\gamma*max(Q(s_{next},u_{next}))-Q(s,u)) \end{equation*}where
$s$ is the state, such as the count of existing stock.
$u$ is the action, such as the count of items to be purchased.
$reward$ is the cost to store the existing stock plus the cost of missing sales revenue at the situation
when the existing stock is less than the sales demand.
$\gamma$ is the discount factor of future Q values.
$\alpha$ is the learning rate of Q value update.
Each Q value Q(state, action) is associated with one state and one action taken at that state.
For the inventory control, the state is composed of the existing stock and the delivered order from last action.
state=(stock, delivery). The total stock=stock+delivery.
action=new stock to order, which will become delivery of next state.
demand=sales that withdraws from the existing stock, without exceeding the existing stock.
new_state=stock+delivery-demand
The boundary condition is that there is a maximum capacity to store the stock.
The action to order new stock shall not exceed (max_capacity-(stock+delivery))
Create a Q table with all possible combinations of states and actions.
For example, if the maximum capacity is 5, then there are five states and, in each state, there are (max_capacity-stock-delivery+1) actions.
If stock+delilvery=3, then the action list is [0,1,2]. That means the order from the action can be quantities 0, 1, or 2.
There are a total of 21 possible Q values. Initially, randomly set small Q values.
Before the training begins, define the number of training iterations.
For each iteration, perform the following steps:
(a) Stock and Delivery: Identify the stock and delivery at the beginning of this iteration.
For the very first state, randomly select the stock and delivery according to the boundary of max capacity.
(b) Best next_action: Use the episilon greedy strategy to select the best next_action.
The current state(stock, delivery) contains the information of stock and delivery.
If the lottery output is less than epsilon,
Set the best next_action to be random integer within [0, max_capacity-(stock+delivery)+1]
Otherwise,
The next_action=best next_action=max(Q(state, action))
(c) Create a training batch (a sequence of events for training)
Define the number of events in a sequence.
Make the sequence long enough to generate all required rewards of the entire list of states in the Q Table.
Derive (state, next_action, reward, next_state) for each event.
Store the event elements (state, next_action, reward, next_state) in a batch to be used for training.
The details of the procedure to derive (state, next_action, reward, next_state) are as follows:
(i) Set the sales demand to be a normal distribution in [max_capacity,0]
new_stock=max((stock+delivery-demand),0)
(ii) Calculate the stock_cost=new_stock $\times unitStockCost$.
If demand>(stock+delivery),
Calculate the lost_revenue=(demand-stock-delivery)$\times unitLostRevenue$.
(iii) Reward=stock_cost+lost_revenue.
(iv) next_state=(new_stock, next_action)
(v) Store the event elements (state, next_action, reward, next_state)
Repeat these five steps for the total number of events to complete the training batch.
(d) Update Q Table using the training batch
(i) Find next_action: For each row of (state, next_action, reward, next_state), use the [next_state] in the current Q Table to identify the best next_action associated with this [next_state].
(ii) Update the Q(state, action) using
Q(state,action)=Q(state,action)+$\alpha \times$ (reward+ $\gamma \times$ max(Q(next_state, next_action))-Q(state, action))
(iii) Repeat Steps (i) & (ii) for the entire training batch. Now the Q Table is updated.
(iv) Use the latest Q Table to collect the optimal policy (best next_action) for each state.
At each iteration,
compare the latest optimal policy set with the previous version. If there is no more change, stop the training. The Q Table is now fully trained.
Consider one product's inventory. Its initial stock is 4 and the delivery from previous order is 0.
max_capacity=6
unitStockCost=1
unitLostRevenue=10
$\gamma$=0.9
$\alpha$=0.05
$\epsilon$=0.2
trainingIteration=50
eventLength=(max_capacity+1) $\times $ (max_capacity+2)/2