When presented with the above plot, a human would instinctively detect the pattern present, and, if looking for the lowest point, would be able to make an intelligent guess on where to begin. Most would not choose to evenly divide the space into a grid and test every point, yet this is precisely what grid search does. Humans have a fundamental intuition from looking at that image that there are areas the minimum is more likely to be. By exhaustively searching the space, youre wasting your time on the areas where the function is obviously (excluding an improbable fluke) not going to be at its minimum and ignoring any information you have from the points you already know.
The next most common method is random search, which is exactly what it sounds like. Given the same plot, I doubt anybody would decide to pick random points. Random search is not quite stupid - anybody who has studied statistics knows the power of randomness from techniques like bootstrapping or monte carlo. In fact, randomly picking parameters often outperforms grid search. Random hyperparameters can find points that grid search would either skip over (if the granularity of the grid was too coarse) or cost a tremendous amount to find (as the grid becomes finer). It can similarly outperform manual search in some situations: a human will generally focus on the domain in which they have seen the lowest points, whereas random search finds new domains neglected by intuition.
Outside of cases where finding the absolute global minimum is required and an exhaustive search is necessary, grid search could only really reasonably be used in situations where the cost of evaluating the objective function is so low it could be considered to be non-existent, and even in those cases the only excuse for implementing it is laziness (i.e. it costs more to implement / run a sophisticated method than to perform grid search). In these cases grid search is preferred to random search because with a fine enough grid search you are guaranteed to find a near optimal point, whereas random search offers no such guarantee.
But what if you are in a situation where finding every point is too costly? Training models is often expensive, both in time and computational power, and this expense skyrockets with the increased data and model complexity. What we need is an intelligent way to traverse our parameter space while searching for a minimum.
Upon first impression, this might seem like an easy problem. Finding the minimum of an objective function is pretty much all we ever do in machine learning, right? Well heres the rub: you dont know this function. Fitting a regression you choose your cost function. Training a neural net you choose your activation function. If you know what these functions are, then you know their derivatives (assuming you picked differentiable functions); this means you know which direction points down. This knowledge is fundamental to most optimization techniques, like momentum or stochastic gradient descent.
There are other techniques, like binary search or golden ratio search, which dont require the derivative directly but require the knowledge your objective is unimodal - any local, non-global minima have potential to make this search entirely ineffective. Yet other optimization methods do not depend upon any knowledge of the underlying function (simulated annealing, coordinate descent) but require a large number of samples from the objective function to find an approximate minimum.
So the question is, what do you do when the cost of evaluation is high? How do we intelligently guess where the minimums are likely to be in a high dimensional space? We need a method which doesnt waste time on points where there isnt expected return, but also wont get caught in local minima.
So now that we know how bad grid search and random search are, the questions is how can we do better? Here we discuss one technique for intelligent hyperparameter search, known as bayesian optimization. We will now cover the concept of how this technique can be used to traverse hyperparameter space; there is an associated IPython Notebook which further illustrates the practice.
The basic idea is this: you assume there is a smooth response surface (i.e. the curve of your objective function) connecting all your known hyperparameter-objective function points, then you model all of these possible surfaces. Averaging over these surfaces we acquire an expected objective function value for any given point and an associated variance (in more mathematical terms, we are modeling our response as a Gaussian Process). The lowest the variance on this surface dips is the point of highest expected improvement. This is a black-box model; we need no actual knowledge of the underlying process producing the response curve.
This concept is illustrated clearly in Yelps MOE documentation for the case of a single hyperparameter. On top, you see our objective function response surface when we have no previous data points. It is flat, as we dont yet know anything. We can see that the variance (the shaded region) is also flat. In the bottom plot we see the maximum Expected Improvement (the lowest the variance dips). This plot is also flat, so our point of highest Expected Improvement is random.