The Upper Confidence Bound (UCB) algorithm is a popular approach used in the context of multi-armed bandits, which is a problem in decision-making where an agent must choose between multiple options (arms) to maximize its total reward. The UCB algorithm balances exploration (trying out less-known arms) and exploitation (focusing on the arm that has provided the best reward so far) by assigning each arm a score based on its average reward and an uncertainty term that decreases as more pulls are made. The score for each arm can be expressed as:
where is the average reward of arm , is the total number of pulls so far, and is the number of times arm has been pulled. By selecting the arm with the highest UCB score, the algorithm ensures that it explores less frequently chosen arms while still capitalizing on the best-performing ones. This method has been shown to have strong theoretical performance guarantees, making it a widely used strategy in adaptive learning scenarios.
Start your personalized study experience with acemate today. Sign up for free and find summaries and mock exams for your university.