random generator seed, default=2018 ------------------------------------------------------------------------------------------------------------ REGULARIZATION
l2 regularization, default=0.0
l1 regularization, default=0.0
l2 bias term, default=0.0
l1 bias term, default=0.0
whether to use weight noise (drop connect), default=false
weight retain probability for the weight noise (drop-connect), default=1 (no drop-connect)
whether apply to biases for the weight noise (drop-connect), default=false ------------------------------------------------------------------------------------------------------------------ OPTIMIZATION
optimization algorithm (default=STOCHASTIC_GRADIENT_DESCENT)
STOCHASTIC_GRADIENT_DESCENT://StochasticGradientDescent.java LINE_GRADIENT_DESCENT://LineGradientDescent.java CONJUGATE_GRADIENT://ConjugateGradient.java LBFGS://LBFGS.java
whether to use mini-batch, default=true
learning rate, default=0.1
gradient moving avg decay rate, default=0.9
gradient sqrt decay rate, default=0.999
default=1E-8
NESTEROVS momentum, default=0.9
RMSPROP decay rate, default=0.95
ADADELTA decay rate, default=0.95
weights updater, (default = NESTEROVS). Options:
SGD: //Sgd.java learningRate: learning rate (default = 1E-3) ADAM: //Adam.java learningRate: learning rate, DEFAULT_ADAM_LEARNING_RATE = 1e-3; beta1: gradient moving avg decay rate, DEFAULT_ADAM_BETA1_MEAN_DECAY = 0.9; beta2: gradient sqrt decay rate, DEFAULT_ADAM_BETA2_VAR_DECAY = 0.999; epsilon: epsilon, DEFAULT_ADAM_EPSILON = 1e-8; //Adam: A Method for Stochastic Optimization ADAMAX: //AdaMax.java learningRate: learning rate, DEFAULT_ADAMAX_LEARNING_RATE = 1e-3; beta1: gradient moving avg decay rate, DEFAULT_ADAMAX_BETA1_MEAN_DECAY = 0.9; beta2: gradient sqrt decay rate, DEFAULT_ADAMAX_BETA2_VAR_DECAY = 0.999; epsilon: epsilon, DEFAULT_ADAMAX_EPSILON = 1e-8; //Adam: A Method for Stochastic Optimization NADAM://Nadam.java learningRate: learning rate, DEFAULT_NADAM_LEARNING_RATE = 1e-3; epsilon: DEFAULT_NADAM_EPSILON = 1e-8; beta1: gradient moving avg decay rate, DEFAULT_NADAM_BETA1_MEAN_DECAY = 0.9; beta2: gradient sqrt decay rate, DEFAULT_NADAM_BETA2_VAR_DECAY = 0.999; //An overview of gradient descent optimization algorithms AMSGRAD: //AMSGrad.java learningRate: learning rate, DEFAULT_AMSGRAD_LEARNING_RATE = 1e-3; epsilon: DEFAULT_AMSGRAD_EPSILON = 1e-8; beta1: DEFAULT_AMSGRAD_BETA1_MEAN_DECAY = 0.9; beta2: DEFAULT_AMSGRAD_BETA2_VAR_DECAY = 0.999; ADAGRAD: Vectorized Learning Rate used per Connection Weight//AdaGrad.java learningRate: learning rate, DEFAULT_ADAGRAD_LEARNING_RATE = 1e-1; epsilon: DEFAULT_ADAGRAD_EPSILON = 1e-6; //Adaptive Subgradient Methods for Online Learning and Stochastic Optimization //Adagrad – eliminating learning rates in stochastic gradient descent NESTEROVS: tracks previous layer's gradient and uses it as a way of updating the gradient //Nesterovs.java learningRate: learning rate, DEFAULT_NESTEROV_LEARNING_RATE = 0.1; momentum: DEFAULT_NESTEROV_MOMENTUM = 0.9; RMSPROP: //RmsProp.java learningRate: learning rate, DEFAULT_RMSPROP_LEARNING_RATE = 1e-1; epsilon: DEFAULT_RMSPROP_EPSILON = 1e-8; rmsDecay: decay rate, DEFAULT_RMSPROP_RMSDECAY = 0.95; //Neural Networks for Machine Learning ADADELTA: //AdaDelta.java rho: decay rate, controlling the decay of the previous parameter updates, DEFAULT_ADADELTA_RHO = 0.95; epsilon: DEFAULT_ADADELTA_EPSILON = 1e-6; (no need to manually set the learning rate) //ADADELTA: AN ADAPTIVE LEARNING RATE METHOD NONE: no updates //NoOp.java
gradient normalization, default=None Options: GradientNormalization.X
ClipElementWiseAbsoluteValue: g <- sign(g)*max(maxAllowedValue,|g|). ClipL2PerLayer: GOut = G if l2Norm(G) < threshold (i.e., no change) GOut = threshold * G / l2Norm(G) otherwise ClipL2PerParamType: conditional renormalization. Very similar to ClipL2PerLayer, however instead of clipping per layer, do clipping on each parameter type separately. None: no gradient normalization RenormalizeL2PerLayer: rescale gradients by dividing by the L2 norm of all gradients for the layer RenormalizeL2PerParamType: GOut_weight = G_weight / l2(G_weight) GOut_bias = G_bias / l2(G_bias)
gradient threshold, default=0.5 ------------------------------------------------------------------------------------------------------------------------------------------ INPUT LAYER
input size, required
input type, default=InputType.Type.CNN Options:
InputType.Type.FF: Standard feed-foward (2d minibatch, 1d per example) data InputType.Type.CNN: 2D Convolutional neural network (4d minibatch, [miniBatchSize, channels, height, width]) InputType.Type.CNN3D: 3D convolutional neural network (5d minibatch, [miniBatchSize, channels, height, width, channels]) InputType.Type.CNNFlat: Flattened 2D conv net data (2d minibatch, [miniBatchSize, height * width * channels]) InputType.Type.RNN: Recurrent neural network (3d minibatch) time series data
height of input, default=10
width of input, default=10
depth of input, default=10
number of channels, default=3 ------------------------------------------------------------------------------------------------------------------------------------------ OUTPUT LAYER
output size, required
loss function for the output layer, required Options: y-true, yHat-prediction
L2: Sum of Squared Errors//LossL2.java L = sum_i (y_i - yHat_i)^2 MSE (or SQUARED_LOSS): Mean Squared Error//LossMSE.java L = 1/(2N) sum_i sum_j (y_{i,j} - yHat_{i,j})^2 L1: Sum of Absolute Errors//LossL1.java L = sum_i |y_i - yHat_i| MEAN_ABSOLUTE_ERROR: Mean Absolute Error//LossMAE.java L = 1/(2N) sum_i sum_j |y_{i,j} - yHat_{i,j}| MEAN_ABSOLUTE_PERCENTAGE_ERROR: Mean Aboluste Percentage Error//LossMAPE.java L = 1/N sum_i |y_i - yHat_i|*100/|y_i| MEAN_SQUARED_LOGARITHMIC_ERROR: Mean Squared Logarithmic Error//LossMSLE.java L = 1/N sum_i (log(1 + y_i) - log(1 + yHat_i))^2 POISSON (or EXPLL): Exponential Log Likelihood Loss (Poisson Loss)//LossPoisson.java L = 1/N sum_i (yHat_i - y_i * log(yHat_i)) XENT: Binary Cross Entropy Loss//LossBinaryXENT.java L = - 1/N (y_i*log(yHat_i) + (1 - y_i)*log(1 - yHat_i)) (label scalar of 0/1 binary classes) MCXENT: Multiclass Cross Entropy Loss//LossMCXENT.java L = - 1/N \sum_i \sum_k y_{i,k} * log(yHat_{i, k}) (label vector of 0/1 indicator labels) NEGATIVELOGLIKELIHOOD: Negative Log Likelihood//LossNegativeLogLikelihood.java L = - 1/N \sum_i \sum_k y_{i,k} * log(yHat_{i, k}) (*negative log likelihood is equivalent to cross entropy mathematically) KL_DIVERGENCE (or RECONSTRUCTION_CROSSENTROPY): Kullback Leibler Divergence Loss//LossKLD.java L = - 1/N sum_i y_i * log (yHat_i / y_i) = 1/N sum_i y_i * log (y_i / yHat_i) = 1/N ( sum_i y_i * log(y_i) - sum_i y_i * log(yHat_i)) = entropy cross-entropy COSINE_PROXIMITY://LossCosineProximity.java L = (sum_i y_i dotprod yHat_i)/(sqrt(sum_i y_i dotprod y_i) * sqrt(sum_i yHat_i dotprod yHat_i)) HINGE: Hinge Loss//LossHinge.java L = 1/N sum_i max(0, 1 - yHat_i * y_i) (*label scalar of -1/+1 labels) SQUARED_HINGE: Squared Hinge Loss//LossSquaredHinge.java L = 1/N sum_i (max(0, 1 - yHat_i * y_i))^2 (*label scalar of -1/+1 labels)
output layer activation functions, required. Options:
Cube://ActivationCube.java f(x) = x^3 ELU://ActivationELU.java ⎧ alpha * (exp(x) - 1.0), x < 0; // alpha defaults to 1, if not specified f(x) = ⎨ ⎩ x, x >= 0; HARDSIGMOID://ActivationHardSigmoid.java f(x) = min(1, max(0, 0.2*x + 0.5)) HARDTANH://ActivationHardTanH.java ⎧ 1, if x > 1 f(x) = ⎨ -1, if x < -1 ⎩ x, otherwise IDENTITY://ActivationIdentity.java f(x) = x LEAKYRELU://ActivationLReLU.java f(x) = max(0, x) + alpha * min(0, x) // alpha defaults to 0.01 RRELU://ActivationRReLU.java f(x) = max(0,x) + alpha * min(0, x) // alpha is drawn from uniform(l,u) during training and is set to l+u/2 during test // l and u default to 1/8 and 1/3 respectively // Empirical Evaluation of Rectified Activations in Convolutional Network RATIONALTANH://ActivationRationalTanh.java f(x) = 1.7159 * tanh(2x/3), where tanh is approxiated as tanh(y) ~ sgn(y) * { 1 - 1/(1+|y|+y^2+1.41645*y^4)} //Reference RELU://ActivationReLU.java f(x) = max(0, x) //RELU6://ActivationReLU6.java // f(x) = min(max(x, 0), 6) RECTIFIEDTANH://ActivationRectifiedTanh.java f(x) = max(0, tanh(x)) SELU://ActivationSELU.java ⎧ x, x > 0 f(x) = lambda ⎨ ⎩ alpha * exp(x) - alpha, x <= 0 //Reference SIGMOID://ActivationSigmoid.java f(x) = 1 / (1 + exp(-x)) SOFTPLUS://ActivationSoftPlus.java f(x) = log(1 + exp(x)) SOFTSIGN://ActivationSoftSign.java f_i(x) = x_i / (1 + |x_i|) SOFTMAX://ActivationSoftmax.java f_i(x) = exp(x_i - shift) / sum_j exp(x_j - shift), where shift = max_i x_i SWISH://ActivationSwish.java f(x) = x * sigmoid(x) TANH: //ActivationTanH.java f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
output layer weight initialization, default=XAVIER. Options:
ZERO: all 0s. ONES: all 1s. SIGMOID_UNIFORM: U(-r,r) with r=4*sqrt(6/(fanIn + fanOut)), A version of XAVIER_UNIFORM for sigmoid activation functions. NORMAL: N(0, sigma^2) with sigma = 1/sqrt(fanIn). LECUN_UNIFORM: U[-a,a] with a=3/sqrt(fanIn). UNIFORM: U[-a,a] with a=1/sqrt(fanIn). XAVIER: N(0, sigma^2) with sigma = sqrt(2.0/(fanIn + fanOut)) XAVIER_UNIFORM: U(-s,s) with s = sqrt(6/(fanIn + fanOut)) XAVIER_FAN_IN: N(0, sigma^2) with sigma = sqrt(1/fanIn) RELU: N(0, sigma^2) with sigma = sqrt(2.0/nIn) RELU_UNIFORM: U(-s,s) with s = sqrt(6/fanIn) IDENTITY: I_{nIn, nOut} an identity matrix, only applicable to square weight matrices VAR_SCALING_NORMAL_FAN_IN: N(0, sigma^2) with sigma = sqrt(1.0/fanIn) VAR_SCALING_NORMAL_FAN_OUT: N(0, sigma^2) with sigma = sqrt(1.0/fanOut) VAR_SCALING_NORMAL_FAN_AVG: N(0, sigma^2) with sigma = sqrt(1.0/((fanIn + fanOut)/2)) VAR_SCALING_UNIFORM_FAN_IN: U[-a,a] with a=3.0/(fanIn) VAR_SCALING_UNIFORM_FAN_OUT: U[-a,a] with a=3.0/(fanOut) VAR_SCALING_UNIFORM_FAN_AVG: U[-a,a] with a=3.0/((fanIn + fanOut)/2)
output layer bias initialization, default=0.0
instance weights-based on classes, applicable for weighted classification, default=Array[Double]() -------------------------------------------------------------------------------------------------------------------------------------------------- BASE FOR LAYERS
whether to pretrain, default=false
whether to use backprop, default=true
whether to set a listener, default=true
listener type, default="console" Options:
console: print in the console ui: display in the UI file: save to a file
listener frequency to track the score, default=1
file path for saving the stats if set listenType="file", default="", not used
whether to enable remote listening, default=false -------------------------------------------------------------------------------------------- Fully Connected Dense Layers
number of dense layers, default=1
sizes of dense layers, default=Array(1), for index beyond the boundary, use the last one
activations of dense layers, default=Array(Activation.RELU)
weight initializer of dense layers, default=Array(WeightInit.XAVIER)
bias initializer of dense layers, default=Array(0.0)
drop out retaining probabilities, default=Array(1.0), no dropout -------------------------------------------------------------------------------- Convolutional Layers
number of convolutional layers, default=1
sizes of convolutional layers, default=Array(1), for index beyond the boundary, use the last one
activations of convolutional layers, default=Array(Activation.LEAKLYRELU)
weight initializer of convolutional layers, default=Array(WeightInit.XAVIER)
bias initializer of convolutional layers, default=Array(0.0)
kernels of convolutional layers, default=Array(Array((2,2))
strides of convolutional layers, default=Array(Array(1,1))
paddings of convolutional layers, default=Array(Array(0,0))
drop out retaining probabilities, default=Array(1.0), no dropout ------------------------------------------------------------------------ Pooling Layers
number of pooling layers, default=1
types of pooling layers, default=Array(PoolingType.MAX), for index beyond the boundary, use the last one
kernels of pooling layers, default=Array(Array((2,2))
strides of pooling layers, default=Array(Array(1,1))
paddings of pooling layers, default=Array(Array(0,0))
Base Configuration Builder
Base Configuration Builder
Base Configurations for Layers
Base Configurations for Layers
Chain all configurations: Base -> Reg -> Opt -> Layer
Chain all configurations: Base -> Reg -> Opt -> Layer
Convolutional Layer Configuration Builder
Convolutional Layer Configuration Builder
the convolutional layer and the pooling layer are stacked between each other. If the pooling layer is not needed, one trick is to set the pooling kernel size to be unit k(1, 1)
Fully Connected Dense Layer
Fully Connected Dense Layer
Layer Configuration Builder
Build the MLNN and Start the Listener (if applicable)
Build the MLNN and Start the Listener (if applicable)
the initiated multi-layer network
Optimization Configuration Builder
Optimization Configuration Builder
Print Parameters
Pooling Layer Configuration Builder
Pooling Layer Configuration Builder
the convolutional layer and the pooling layer are stacked between each other. If the pooling layer is not needed, one trick is to set the pooling kernel size to be unit k(1, 1)
Regularization Configuration Builder
Regularization Configuration Builder
Start Listener
Start Listener
Convolutional Neural Network (CNN)
BASE