You can start using powerful machine learning tools quickly and easily using different open source packages, but tuning these models is often a non-intuitive, time-consuming process. The tunable parameters (hyperparameters) of the models themselves can greatly affect their accuracy. While all of these tools attempt to set reasonable default hyperparameters for you, they can often fail to provide optimal results for many real world datasets in practice. When every model evaluation can take hours or days on powerful clusters and the model fit can have a large impact on your overall system, it is important to find the best hyperparameters as quickly as possible.
In this post we’ll show you how different hyperparameter optimization strategies like using model defaults, grid search, random search, and Bayesian Optimization (SigOpt) can change the model fit for various classifiers and famous datasets.
The hyperparameter tuning methods described below can be used for any dataset and any classifier. The code for running these examples is availableon github
. You can easily modify it to use your data or classifier of choice.
Each classifier attempts to build a model given the training data that will have the best model fit on the testing data. Both the GBC and SVC classifier have several tunable hyperparameters that can greatly affect the model fit.GBC hyperparameterValue rangelearning_rate0.01-1.0n_estimators20-500 (int)min_samples_split1-4 (int)min_samples_leaf1-3 (int)RFC hyperparameterValue rangen_estimators3-20 (int)min_samples_split1-4 (int)min_samples_leaf1-3 (int)max_features0.1-1.0SVC hyperparameterValue rangeC0.01-10.0gamma0.0001-1.0kernelrbf or poly or sigmoid
Scikit-learn makes it very easy to get these classifiers up and running and provides default values for the hyperparameters that try to fit a wide variety of datasets. Because these hyperparameters are not tuned for any specific dataset, they often produce a sub-optimal fit for your specific problem. Bayesian Optimization (via SigOpt) beats the default hyperparameters, allowing you to achieve a better model fit than the defaults.Classifier, DatasetFit: SigOpt vs Default HyperparametersGBC, connect-4+11.4%GBC, poker+28.6%GBC, usps+2.1%GBC, satimage+5.9%RFC, satimage+0.01%SVC, satimage+14.7%
Grid search cuts up the space of possible hyperparameters into equal sized (in each dimension) grids and samples at each intersection of the grid. This provides a uniform search over the space, but is exponential in the number of dimensions being searched over. SigOpt finds better hyperparameters than grid search with fewer function evaluations. While grid search is exponential in the dimension of hyperparameters, we have found in practice that SigOpt finds optima in a linear number of evaluations. We can see the massive speed gains in these 3 and 4 dimensional spaces already. For many complex machine learning tasks, evaluation can take hours or even days on supercomputers – so every evaluation is precious.Classifier, DatasetSpeed: SigOpt vs GridFit: SigOpt vs GridFit: SigOpt vs Exhaustive GridGBC, connect-4+1914%+27.7%+0.1%GBC, poker+379%+26.6%+0.5%GBC, usps+838%+15.1%+0.4%GBC, satimage+6000%+18.6%+0.0%RFC, satimage+1700%+2.4%+0.0%SVC, satimage+635%+12.2%+0.0%
Column 1 shows the classifier and dataset being compared. SigOpt speed denotes how much faster SigOpt was able to find an optima vs Grid search in terms of model evaluations. SigOpt vs Grid shows how much better the optima SigOpt was able to find vs the best optima that grid search had found with the same number of model evaluations. SigOpt vs full grid shows the gain SigOpt was able to find in the much smaller number of evaluations vs an exhaustive grid search (192 evaluations) of the space.
Random search picks random hyperparameters from the space and sees how they change the model fit. After some fixed number of iterations, the best values observed are used. While this method allows the user to potentially stumble upon the best hyperparameters, it also fares worse than SigOpt, which can find better hyperparameters faster.Classifier, DatasetSpeed: SigOpt vs RandomFit: SigOpt vs RandomGBC, connect-4+1342%+1.6%GBC, poker+3%+0.5%GBC, usps+137.5%+0.6%GBC, satimage+0%+0.0% (both perfect)RFC, satimage+433.3%+1.5%SVC, satimage+62%+1.2%
Column 1 shows the classifier and dataset being compared. SigOpt speed denotes how much faster SigOpt was able to find an optima vs a uniform random search in total number of model evaluations before finding an optima. SigOpt fit vs Random fit shows how much better an optima SigOpt was able to find vs the best point that random search had found with the same number of model evaluations.
Evaluating different hyperparameters for a model is very time-consuming and expensive as you train on more data. Model fit for things like CTR prediction or user recommendations can have a large impact on your overall system and bottom line. Using the right tools to train your models can save you time and money.SigOpt
provides a simpleAPI
and web interface for quickly and easily leveraging cutting edge optimization research to solve this problem for you. Check us outfor free