Multiplicative Weight Update Method

Multiplicative weight update method is a meta-algorithm. It is an algorithmic technique which "maintains a distribution on a certain set of interest, and updates it iteratively by multiplying the probability mass of elements by suitably chosen factors based on feedback obtained by running another algorithm on the distribution".^[1] It was discovered repeatedly in very diverse fields such as machine learning (AdaBoost, Winnow, Hedge), optimization (solving LPs), theoretical computer science (devising fast algorithm for LPs and SDPs), and game theory.

Name

"Multiplicative weights" implies the iterative rule used in algorithms derived from the Multiplicative Weight Update Method.^[2] It is given with different names in the different fields where it was discovered or rediscovered.

Background

An algorithm named "Fictitious Play" similar to multiplicative weight update method was proposed in game theory in the early 1950s. Grigoriadis and Khachiyan^[3] applied a randomized variant of "Fictitious Play" to solve two-player zero-sum games efficiently using the multiplicative weights algorithm. That is, player allocates higher weight to the actions that had a better outcome and choose his strategy relying on these weights. In machine learning, Littlestone applied the earliest form of the multiplicative weights update rule in his famous Winnow Algorithm, which is similar to Minsky and Papert's earlier perceptron learning algorithm.

Motivation

A binary decision needs to be made based on n experts’ opinions to attain an associated payoff. In the first round, all experts’ opinions have the same weight. The decision maker will make the first decision based on the majority of the experts' prediction. Then, in each successive round, the decision maker will repeatedly update the weight of each expert's opinion depending on the correctness of his prior predictions.

Algorithm Analysis

Weighted Majority Algorithm^[4]^[5]

"Given the setting in prediction from expert advice. Denote the experts as $\mathbb {E} _{1},\mathbb {E_{2}} ...,\mathbb {E_{n}}$ . Initialize the set of experts who have not made a mistake to $\mathbb {E} =\left\{\mathbb {E} _{1},\mathbb {E} _{2},...,\mathbb {E} _{n}\right\}\quad$ .

   Initialization: 
      Fix an  $\eta$ ≤1/2. For each expert, associate the weight  ${w_{i}}^{1}$ ≔1.
   For  $t$  =  ${\mathit {1}}$ ,  ${\mathit {2}}$ ,…, $T$ 
      1. Make the prediction that is the weighted majority of the experts’ predictions based on the weights  $\mathbb {w_{1}} ^{t},...,\mathbb {w_{n}} ^{t}$ . That is, making a binary choice depending on which prediction has a higher total weight of experts advising it (breaking ties arbitrarily). 
      2. For every expert i who predicts wrongly, decrease his weight for the next round by multiplying it by a factor of (1-η):
                                                              $w_{i}^{t+1}$ = $(1-\eta )w_{i}^{t}$             (update rule)

Analysis

After $T$ steps, let $m_{i}^{T}$ be the number of mistakes of expert i and $M^T$ be the number of mistakes our algorithm has made. Then we have the following bound for every $i$ :

                              $M^{T}\leq 2(1+\eta )m_{i}^{T}+{\frac {2\ln(n)}{\eta }}$ .

In particular, this holds for i which is the best expert. That is, having the least $m_{i}^{T}$ .

Multiplicative Weights Algorithm^[2]

Initialization: Fix an η≤1/2. For each expert, associate the weight ${w_{i}}^{1}$ ≔1.

  For  $t$  =  ${\mathit {1}}$ ,  ${\mathit {2}}$ ,…, $T$ :
   1. Choose decision  $i$  with probability proportional to its weight  $w_{i}^{t}$ . I.e., use the distribution over decision 
                       $p^{t}=\left\{{\frac {w_{1}^{t}}{\Phi ^{t}}},{\text{...}},{\frac {w_{n}^{t}}{\Phi ^{t}}}\right\}\quad$  
      where  $\textstyle \Phi ^{t}=\sum _{i}w_{i}^{t}$ .
   2. Observe the cost of the decision  $m^{t}$ . 
   3. Penalize the costly decision by updating their weights as follows: for every decision  $i$ , set 
                                $w_{i}^{t+1}=w_{i}^{t}(1-\eta m_{i}^{t})$

Analysis

Similar to the analysis for weighted majority algorithm, assume that all costs $m_{i}^{t}\in \left[-1,1\right]\quad$ and $\eta \leq {\frac {1}{2}}$ . Then the Multiplicative Weights Algorithm guarantees that after $T$ rounds, for any decision $i$ , we have

                                            $\sum _{t=1}^{T}{m^{t}p^{t}}\leq \sum _{t=1}^{T}{m^{t}}+\eta \sum _{t=1}^{T}{|m^{t}|}+{\frac {\ln(n)}{\eta }}$

Randomized Weighted Majority Algorithm^[6]

Given a situation where the probabilities of the experts predicting positive and negative, counting the weights, are both close to 1/2. The predictions made by the algorithm would be randomized. The algorithm calculates the probabilities of experts predicting positive or negatives, and then makes a random decision based on the computed fraction: predict

f(x)={\begin{cases}1&{\text{with probability}}{\frac {q_{1}}{W}}\\0&{\text{otherwise}}\end{cases}}

where

                $W=\sum _{i}{w_{i}}=q_{0}+q_{1}$ .

Analysis

The number of mistakes made by the Randomized Weighted Majority Algorithm is bounded as:

                 $E\left[\#{\text{mistakes of the learner}}\right]\leq \alpha _{\beta }\left(\#{\text{ mistakes of the best expert}}\right)+c_{\beta }\ln(N)$

where $\alpha _{\beta }={\frac {\ln({\frac {1}{beta}})}{1-\beta }}$ and $c_{\beta }={\frac {1}{1-\beta }}$ .

Applications

Multiplicative Weights method is usually used to solve a constrained optimization problem. Let each expert be the constraint in the problem, and the events represent the points in the area of interest. The punishment of the expert corresponds to how well its corresponding constraint is satisfied on the point represented by an event.^[1]

Solving Zero-Sum Games Approximately (Oracle Algorithm):^[1]^[4]^[7]

Suppose we were given the distribution $P$ on experts. Let $A$ = payoff matrix of a finite 2-player zero-sum game, with $n$ rows.

When the row player $p_{r}$ uses plan $i$ and the column player $p_{c}$ uses plan $j$ , the payoff of player $p_{c}$ is $A\left(i,j\right)$ ≔ $A_{ij}$ , assuming $A\left(i,j\right)\in \left[0,1\right]$ .

If player $p_{r}$ chooses action $i$ from a distribution $P$ over the rows, then the expected result for player $p_{c}$ selecting action $j$ is $A\left(P,j\right)=E_{i\in P}\left[A\left(i,j\right)\right]$ .

Hence, in order to maximize $A\left(P,j\right)$ , player $p_{c}$ is should choose plan $j$ . Similarly, the expected payoff for player $p_{l}$ is $A\left(i,P\right)=E_{j\in P}\left[A\left(i,j\right)\right]$ . Choosing plan $i$ would minimize this payoff. By John Von Neumann's Min-Max Theorem, we obtain:

                                           $\min _{P}\max _{j}A\left(P,j\right)=\max _{Q}\min _{i}A\left(i,Q\right)$

where P and i changes over the distributions over rows, Q and j changes over the columns.

Then, let $\lambda^*$ denote the common value of above quantities, also named as the "value of the game". Let $\delta >0$ be an error parameter. To solve the zero-sum game bounded by additive error of $\delta$ ,

                                                  $\lambda ^{*}-\delta \leq \min _{i}A\left(i,q\right)$ 
                                                  $\max _{j}A\left(p,j\right)\leq \lambda ^{*}+\delta$

Machine Learning

In Machine Learning, Littlestone and Warmuth generalized the Winnow algorithm to the Weighted Majority algorithm.^[8] Later, Freund and Schapire generalized it in the form of Hedge algorithm.^[9] AdaBoost Algorithm formulated by Yoav Freund and Robert Schapire also employed the Multiplicative Weight Update Method.^[4]

Winnow Algorithm ^[4]

Based on current knowledge in algorithms, multiplicative weight update method was first used in Littlestone's Winnow Algorithm. It is utilized in machine learning to solve a linear program.

Given $m$ labeled examples $\left(a_{1},l_{1}\right),{\text{…}},\left(a_{m},l_{m}\right)$ where $a_{j}\in \mathbb {R} ^{n}$ are feature vectors, and $l_{j}\in \left\{-1,1\right\}\quad$ are their labels.

The aim is to find non-negative weights such that for all examples, the sign of the weighted combination of the features matches its labels. That is, require that $l_{j}a_{j}x\geq 0$ for all $j$ . Without loss of generality, assume the total weight is 1 so that they form a distribution. Thus, for notational convenience, redefine $a_{j}$ to be $l_{j}a_{j}$ , the problem reduces to finding a solution to the following LP:

                      $\forall j=1,2,{\text{…}},m:a_{j}x\geq 0$ ,
                      $1*x=1$ ,
                      $\forall i:x_{i}\geq 0$ .

This is general form of LP.

Hedge Algorithm [1]

Hedge Algorithm is similar to Weighted Majority Algorithm. However, their exponential update rules are different.

Analysis

Assume $\epsilon \leq 1$ and for $t\in [T]$ , $p^{t}$ is picked by Hedge. Then for all experts $i$ ,

                                 $\sum _{t\leq T}p^{t}m^{t}\leq \sum _{t\leq T}m^{t}+{\frac {\ln(N)}{\epsilon }}+\epsilon T$

Initialization: Fix an $\eta \leq 12$ . For each expert, associate the weight $w_{i}^{1}$ ≔1 For t=1,2,…,T:

      1. Pick the distribution  $p_{j}^{t}={\frac {w_{1}^{t}}{\Phi t}}$  where  $\Phi t=\sum _{i}w_{i}^{t}$ .
      2. Observe the cost of the decision  $m^{t}$ . 
      3. Set 
                               $w_{i}^{t+1}=w_{i}^{t}*$  exp( $-\epsilon *m_{i}^{t}$ ).

Solving Linear Programs Approximately^[10]

Problem

Given a $m\times n$ matrix $A$ and $b\in \mathbb {R} ^{n}$ , is there a $x$ such that $Ax\geq b$ ?

                       $\exists ?x:Ax\geq b$               (1)

Assumption

Using the oracle algorithm in solving zero-sum problem, with an error parameter $\epsilon >0$ , the output would either be a point $x$ such that $Ax\geq b-\epsilon$ or a proof that $x$ does not exist, i.e., there is no solution to this linear system of inequalities.

solution

Given vector $p\in \Delta _{n}$ , solves the following relaxed problem

                      $\exists ?x:p$            $Ax\geq p$        $b$              (2)

If there exists a x satisfying (1), then x satisfies (2) for all $p\in \Delta _{n}$ . The contrapositive of this statement is also true. Suppose if oracle returns a feasible solution for a $p$ , the solution $x$ it returns has bounded width $\max _{i}|{(Ax)}_{i}-b_{i}|\leq 1$ . Therefore, if there is a solution to (1), then there is an algorithm that its output x satisfies the system (2) up to an additive error of $2\epsilon$ . The algorithm makes at most ${\frac {\ln(m)}{\epsilon ^{2}}}$ calls to a width-bounded oracle for the problem (2). The contrapositive stands true as well. The Multiplicative Updates is applied in the algorithm in this case.

Other Applications

Operation Research and On-line Statistical Decision Making ^[4]

In operation research and on-line statistical decision making problem field, the weighted majority algorithm and its more complicated versions have been found independently.

Computational Geometry ^[4]

Multiplicative weights algorithm is also widely applied in computational geometry such as Clarkson's algorithm for linear programming (LP) with a bounded number of variables in linear time.^[11]^[12] Later, Bronnimann and Goodrich employed analogous methods to find Set Covers for hypergraphs with small VC dimension.^[13]

References

1 2 3 4 5 "Efficient Algorithms Using The Multiplicative Weights Update Method" (PDF). November 2007.
1 2 "The Multiplicative Weights Algorithm*" (PDF). Retrieved 2016-11-09.
↑ "A sublinear-time randomized approximation algorithm for matrix games." (PDF)|format= requires |url= (help). 1995. Missing or empty |url= (help); |access-date= requires |url= (help)
1 2 3 4 5 6 7 8 9 10 11 12 13 "The Multiplicative Weights Update Method: A Meta-Algorithm and Applications" (PDF). 2012.
↑ "Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm" (PDF). 2013.
↑ "COS 511: Foundations of Machine Learning" (PDF). 2006-03-20.
↑ "An Algorithmist's Toolkit" (PDF). 2009-12-08. Retrieved 2016-11-09.
↑ DEAN P. FOSTER AND RAKESH VOHRA (1999). Regret in the on-line decision problem, p. 29. Games and Economic Behaviour
1 2 Yoav, Freund. Robert, E. Schapire (1996). TA Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting*, p. 55. journal of computer and system sciences.
↑ "Fundamentals of Convex Optimization" (PDF). Retrieved 2016-11-09.
↑ KENNETH L. CLARKSON. A Las Vegas algorithm for linear programming when the dimension is small., In Proc. 29th FOCS, pp. 452–456. IEEE Comp. Soc. Press, 1988.[doi:10.1109/SFCS.1988.21961] 123, 152.
↑ KENNETH L. CLARKSON. A Las Vegas algorithm for linear and integer programming when the dimension is small., J. ACM, 42:488–499, 1995. [doi:10.1145/201019.201036] 123, 152.
↑ H. BRONNIMANN AND ¨ M.T. GOODRICH. Almost optimal set covers in finite VC-dimension., Discrete Comput. Geom., 14:463–479, 1995. Preliminary version in 10th Ann. Symp. Comp. Geom. (SCG’94). [doi:10.1007/BF02570718] 123, 152

This article is issued from Wikipedia - version of the 11/20/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.

Multiplicative Weight Update Method

Name

Background

Motivation

Algorithm Analysis

Weighted Majority Algorithm[4][5]

Analysis

Multiplicative Weights Algorithm[2]

Analysis

Randomized Weighted Majority Algorithm[6]

Analysis

Applications

Solving Zero-Sum Games Approximately (Oracle Algorithm):[1][4][7]

Machine Learning

Winnow Algorithm [4]

Hedge Algorithm [1]

Analysis

Solving Linear Programs Approximately[10]

Problem

Assumption

solution

Other Applications

Operation Research and On-line Statistical Decision Making [4]

Computational Geometry [4]

References

Weighted Majority Algorithm^[4]^[5]

Multiplicative Weights Algorithm^[2]

Randomized Weighted Majority Algorithm^[6]

Solving Zero-Sum Games Approximately (Oracle Algorithm):^[1]^[4]^[7]

Winnow Algorithm ^[4]

Solving Linear Programs Approximately^[10]

Operation Research and On-line Statistical Decision Making ^[4]

Computational Geometry ^[4]