To build a model:
Click the Assist Me! button and select buildModel
or
Click the Assist Me! button, select getFrames, then click the Build Model… button below the parsed .hex data set
or
Click the View button after parsing data, then click the Build Model button
or
Click the drop-down Model menu and select the model type from the list
The Build Model… button can be accessed from any page containing the .hex key for the parsed data (for example, getJobs > getFrame).
In the Build a Model cell, select an algorithm from the drop-down menu:
The available options vary depending on the selected model. If an option is only available for a specific model type, the model type is listed. If no model type is specified, the option is applicable to all model types.
Model_ID: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates an ID containing the model type (for example, gbm-6f6bdc8b-ccbc-474a-b590-4579eea44596).
Training_frame: (Required) Select the dataset used to build the model.
NOTE: If you click the Build a model button from the Parse cell, the training frame is entered automatically.
Validation_frame: (Optional) Select the dataset used to evaluate the accuracy of the model.
Ignored_columns: (Optional) Click the plus sign next to a column name to add it to the list of columns excluded from the model. To add all columns, click the Add all button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the Clear all button.
Drop_na20_cols: (Optional) Check this checkbox to drop columns that are missing (i.e., use 0 or NA) over 20% of their values
User_points: (K-Means, PCA) For K-Means, specify the number of initial cluster centers. For PCA, specify the initial Y matrix. Note: The PCA User_points parameter should only be used by advanced users for testing purposes.
Transform: (PCA) Select the transformation method for the training data: None, Standardize, Normalize, Demean, or Descale. The default is None.
Response_column: (Required for GLM, GBM, DL, DRF, NaiveBayes) Select the column to use as the independent variable.
Solver: (GLM) Select the solver to use (IRLSM, L_BFGS, or auto). IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, while L_BFGS scales better for datasets with many columns. The default is IRLSM.
Ntrees: (GBM, DRF) Specify the number of trees. The default value is 50.
Max_depth: (GBM, DRF) Specify the maximum tree depth. For GBM, the default value is 5. For DRF, the default value is 20.
Min_rows: (GBM), (DRF) Specify the minimum number of observations for a leaf (“nodesize” in R). For Grid Search, use comma-separated values. The default value is 10.
Nbins: (GBM, DRF) Specify the number of bins for the histogram. The default value is 20.
Mtries: (DRF) Specify the columns to randomly select at each level. To use the square root of the columns, enter -1. The default value is -1.
Sample_rate: (DRF) Specify the sample rate. The range is 0 to 1.0 and the default value is 0.6666667.
Build_tree_one_node: (DRF) To run on a single node, check this checkbox. This is suitable for small datasets as there is no network overhead but fewer CPUs are used. The default setting is disabled.
Learn_rate: (GBM) Specify the learning rate. The range is 0.0 to 1.0 and the default is 0.1.
Distribution: (GBM) Select the distribution type from the drop-down list. The options are auto, bernoulli, multinomial, or gaussian and the default is auto.
Loss: (DL) Select the loss function. For DL, the options are Automatic, MeanSquare, CrossEntropy, Huber, or Absolute and the default value is Automatic. Absolute, MeanSquare, and Huber are applicable for regression or classification, while CrossEntropy is only applicable for classification. Huber can improve for regression problems with outliers.
Score_each_iteration: (K-Means, DRF, NaiveBayes, PCA, GBM) To score during each iteration of the model training, check this checkbox.
K: (K-Means), (PCA) For K-Means, specify the number of clusters. For PCA, specify the rank of matrix approximation. The default for K-Means and PCA is 1.
Gamma: (PCA) Specify the regularization weight for PCA. The default is 0.
Max_iterations: (K-Means, PCA,GLM) Specify the number of training iterations. For K-Means and PCA, the default is 1000. For GLM, the default is -1.
Beta_epsilon: (GLM) Specify the beta epsilon value. If the L1 normalization of the current beta change is below this threshold, consider using convergence.
Init: (K-Means, PCA) Select the initialization mode. For K-Means, the options are Furthest, PlusPlus, Random, or User. For PCA, the options are PlusPlus, User, or None.
Note: If PlusPlus is selected, the initial Y matrix is chosen by the final cluster centers from the K-Means PlusPlus algorithm.
Family: (GLM) Select the model type (Gaussian, Binomial, Poisson, or Gamma).
Activation: (DL) Select the activation function (Tanh, TanhWithDropout, Rectifier, RectifierWithDropout, Maxout, MaxoutWithDropout). The default option is Rectifier.
Hidden: (DL) Specify the hidden layer sizes (e.g., 100,100). For Grid Search, use comma-separated values: (10,10),(20,20,20). The default value is [200,200]. The specified value(s) must be positive.
Epochs: (DL) Specify the number of times to iterate (stream) the dataset. The value can be a fraction. The default value for DL is 10.0.
Variable_importances: (DL) Check this checkbox to compute variable importance. This option is not selected by default.
Laplace: (NaiveBayes) Specify the Laplace smoothing parameter. The default value is 0.
Min_sdev: (NaiveBayes) Specify the minimum standard deviation to use for observations without enough data. The default value is 0.001.
Eps_sdev: (NaiveBayes) Specify the threshold for standard deviation. If this threshold is not met, the min_sdev value is used. The default value is 0.
Min_prob: (NaiveBayes) Specify the minimum probability to use for observations without enough data. The default value is 0.001.
Eps_prob: (NaiveBayes) Specify the threshold for standard deviation. If this threshold is not met, the min_sdev value is used. The default value is 0.
Standardize: (K-Means, GLM) To standardize the numeric columns to have mean of zero and unit variance, check this checkbox. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option is selected by default.
Beta_constraints: (GLM)To use beta constraints, select a dataset from the drop-down menu. The selected frame is used to constraint the coefficient vector to provide upper and lower bounds.
Advanced Options
Checkpoint: (DL) Enter a model key associated with a previously-trained Deep Learning model. Use this option to build a new model as a continuation of a previously-generated model (e.g., by a grid search).
Use_all_factor_levels: (GLM, DL) Check this checkbox to use all factor levels in the possible set of predictors; if you enable this option, sufficient regularization is required. By default, the first factor level is skipped. For Deep Learning models, this option is useful for determining variable importances and is automatically enabled if the autoencoder is selected.
Train_samples_per_iteration: (DL) Specify the number of global training samples per MapReduce iteration. To specify one epoch, enter 0. To specify all available data (e.g., replicated training data), enter -1. To use the automatic values, enter -2. The default is -2.
Adaptive_rate: (DL) Check this checkbox to enable the adaptive learning rate (ADADELTA). This option is selected by default. If this option is enabled, the following parameters are ignored: rate, rate_decay, rate_annealing, momentum_start, momentum_ramp, momentum_stable, and nesterov_accelerated_gradient.
Input_dropout_ratio: (DL) Specify the input layer dropout ratio to improve generalization. Suggested values are 0.1 or 0.2. The range is >= 0 to <1 and the default value is 0.
L1: (DL) Specify the L1 regularization to add stability and improve generalization; sets the value of many weights to 0. The default value is 0.
L2: (DL) Specify the L2 regularization to add stability and improve generalization; sets the value of many weights to smaller values. The default value is 0.
Score_interval: (DL) Specify the shortest time interval (in seconds) to wait between model scoring. The default value is 5.
Score_training_samples: (DL) Specify the number of training set samples for scoring. To use all training samples, enter 0. The default value is 10000.
Score_validation_samples: (DL) (Requires selection from the Validation_Frame drop-down list) Specify the number of validation set samples for scoring. To use all validation set samples, enter 0. The default value is 0. This option is applicable to classification only.
Score_duty_cycle: (DL) Specify the maximum duty cycle fraction for scoring. A lower value results in more training and a higher value results in more scoring. The default value is 0.1.
Autoencoder: (DL) Check this checkbox to enable the Deep Learning autoencoder. This option is not selected by default. Note: This option requires a loss function other than CrossEntropy. If this option is enabled, use_all_factor_levels must be enabled.
Balance_classes: (GLM, GBM, DRF, DL, NaiveBayes) Oversample the minority classes to balance the class distribution. This option is not selected by default. This option is only applicable for classification. Majority classes can be undersampled to satisfy the Max_after_balance_size parameter.
Max_confusion_matrix_size: (DRF, NaiveBayes, GBM) Specify the maximum size (in number of classes) for confusion matrices to be printed in the Logs.
Max_hit_ratio_k: (DRF, NaiveBayes) Specify the maximum number (top K) of predictions to use for hit ratio computation. Applicable to multi-class only. To disable, enter 0.
Link: (GLM) Select a link function (Identity, Family_Default, Logit, Log, or Inverse).
Alpha: (GLM) Specify the regularization distribution between L2 and L2. The default value is 0.5.
Lambda: (GLM) Specify the regularization strength. There is no default value.
Lambda_search: (GLM) Check this checkbox to enable lambda search, starting with lambda max. The given lambda is then interpreted as lambda min.
Rate: (DL) Specify the learning rate. Higher rates result in less stable models and lower rates result in slower convergence. The default value is 0.005. Not applicable if adaptive_rate is enabled.
Rate_annealing: (DL) Specify the learning rate annealing. The formula is rate/(1+rate_annealing value * samples). The default value is 10.000001. Not applicable if adaptive_rate is enabled.
Momentum_start: (DL) Specify the initial momentum at the beginning of training. A suggested value is 0.5. The default value is 0. Not applicable if adaptive_rate is enabled.
Momentum_ramp: (DL) Specify the number of training samples for increasing the momentum. The default value is 1000000. Not applicable if adaptive_rate is enabled.
Momentum_stable: DL Specify the final momentum value reached after the momentum_ramp training samples. Not applicable if adaptive_rate is enabled.
Nesterov_accelerated_gradient: (DL) Check this checkbox to use the Nesterov accelerated gradient. This option is recommended and selected by default. Not applicable is adaptive_rate is enabled.
Hidden_dropout_ratios: (DL) Specify the hidden layer dropout ratios to improve generalization. Specify one value per hidden layer, each value between 0 and 1 (exclusive). There is no default value. This option is applicable only if TanhwithDropout, RectifierwithDropout, or MaxoutWithDropout is selected from the Activation drop-down list.
Expert Options
Keep_cross_validation_splits: (DL) Check this checkbox to keep the cross-validation frames. This option is not selected by default.
Overwrite_with_best_model: (DL) Check this checkbox to overwrite the final model with the best model found during training. This option is selected by default.
Target_ratio_comm_to_comp: (DL) Specify the target ratio of communication overhead to computation. This option is only enabled for multi-node operation and if train_samples_per_iteration equals -2 (auto-tuning). The default value is 0.02.
Rho: (DL) Specify the adaptive learning rate time decay factor. The default value is 0.99. This option is only applicable if adaptive_rate is enabled.
Epsilon: (DL) Specify the adaptive learning rate time smoothing factor to avoid dividing by zero. The default value is 1.0E-8. This option is only applicable if adaptive_rate is enabled.
Max_W2: (DL) Specify the constraint for the squared sum of the incoming weights per unit (e.g., for Rectifier). The default value is infinity.
Initial_weight_distribution: (DL) Select the initial weight distribution (Uniform Adaptive, Uniform, or Normal). The default is Uniform Adaptive. If Uniform Adaptive is used, the initial_weight_scale parameter is not applicable.
Initial_weight_scale: (DL) Specify the initial weight scale of the distribution function for Uniform or Normal distributions. For Uniform, the values are drawn uniformly from initial weight scale. For Normal, the values are drawn from a Normal distribution with the standard deviation of the initial weight scale. The default value is 1.0. If Uniform Adaptive is selected as the initial_weight_distribution, the initial_weight_scale parameter is not applicable.
Classification_stop: (DL) (Applicable to discrete/categorical datasets only) Specify the stopping criterion for classification error fractions on training data. To disable this option, enter -1. The default value is 0.0.
Max_hit_ratio_k: (DL,)GLM (Classification only) Specify the maximum number (top K) of predictions to use for hit ratio computation (for multi-class only). To disable this option, enter 0. The default value is 10.
Regression_stop: (DL) (Applicable to real value/continuous datasets only) Specify the stopping criterion for regression error (MSE) on the training data. To disable this option, enter -1. The default value is 0.000001.
Diagnostics: (DL) Check this checkbox to compute the variable importances for input features (using the Gedeon method). For large networks, selecting this option can reduce speed. This option is selected by default.
Fast_mode: (DL) Check this checkbox to enable fast mode, a minor approximation in back-propagation. This option is selected by default.
Ignore_const_cols: (DL) Check this checkbox to ignore constant training columns, since no information can be gained from them. This option is selected by default.
Force_load_balance: (DL) Check this checkbox to force extra load balancing to increase training speed for small datasets and use all cores. This option is selected by default.
Single_node_mode: (DL) Check this checkbox to force H2O to run on a single node for fine-tuning of model parameters. This option is not selected by default.
Replicate_training_data: (DL) Check this checkbox to replicate the entire training dataset on every node for faster training on small datasets. This option is not selected by default. This option is only applicable for clouds with more than one node.
Shuffle_training_data: (DL) Check this checkbox to shuffle the training data. This option is recommended if the training data is replicated and the value of train_samples_per_iteration is close to the number of nodes times the number of rows. This option is not selected by default.
Missing_values_handling: (DL) Select how to handle missing values (Skip or MeanImputation). The default value is MeanImputation.
Quiet_mode: (DL) Check this checkbox to display less output in the standard output. This option is not selected by default.
Sparse: (DL) Check this checkbox to use sparse iterators for the input layer. This option is not selected by default as it rarely improves performance.
Col_major: (DL) Check this checkbox to use a column major weight matrix for the input layer. This option can speed up forward propagation but may reduce the speed of backpropagation. This option is not selected by default.
Average_activation: (DL) Specify the average activation for the sparse autoencoder. The default value is 0. If Rectifier is selected as the Activation type, this value must be positive. For Tanh, the value must be in (-1,1).
Sparsity_beta: (DL) Specify the sparsity regularization. The default value is 0.
Max_categorical_features: (DL) Specify the maximum number of categorical features enforced via hashing. The default is unlimited.
Reproducible: (DL) To force reproducibility on small data, check this checkbox. If this option is enabled, the model takes more time to generate, since it uses only one thread.
Export_weights_and_biases: (DL) To export the neural network weights and biases as H2O frames, check this checkbox.
Class_sampling_factors: (GLM, DRF, NaiveBayes), GBM, DL) Specify the per-class (in lexicographical order) over/under-sampling ratios. By default, these ratios are automatically computed during training to obtain the class balance. There is no default value. This option is only applicable for classification problems and when Balance_Classes is enabled.
Seed: (K-Means, GBM, DL, DRF) Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.
Prior: (GLM) Specify prior probability for y ==1. Use this parameter for logistic regression if the data has been sampled and the mean of response does not reflect reality. The default value is -1.
Max_active_predictors: (GLM) Specify the maximum number of active predictors during computation. This value is used as a stopping criterium to prevent expensive model building with many predictors.