1. Duality
1.1 Lagrangian and Dual Function
- Lagrangian for a primal problem:
- Dual function:
- Always concave (regardless of convexity of primal).
- Lower bound property: If , then .
1.2 Dual Problem
- Dual problem:
- Always a concave problem (even if primal is non-convex).
- Optimal dual value: .
1.3 Weak and Strong Duality
- Weak duality: (always holds).
- Strong duality: (holds under Slater’s condition for convex problems).
1.4 KKT Conditions
For optimal under strong duality:
- Primal feasibility: .
- Dual feasibility: .
- Complementary slackness: .
- Stationarity: .
- For convex problems, KKT is necessary and sufficient for optimality.
2. Applications in Machine Learning
2.1 Logistic Regression
- Convex optimization formulation:
- Unconstrained convex problem.
2.2 Support Vector Machines (SVM)
- Primal QP:
- Dual problem:
- Support vectors: points with .
- Kernel trick: Replace with .
2.3 Non-Separable Case (Soft Margin)
- Introduce slack variables :
- Dual constraints become: .
3. Neural Network Training & Optimization
3.1 Backpropagation
- Chain rule applied layer-by-layer.
- Local gradients for operations:
- Add gate: gradient distributor.
- Multiply gate: gradient swapper.
- Max gate: gradient router.
- Copy gate: gradient adder.
3.2 Stochastic Gradient Descent (SGD)
- Update rule:
- Mini-batch SGD:
- Improves parallelism and stability.
3.3 Advanced Optimizers
- Momentum:
- Adam: Combines momentum and adaptive learning rates.
3.4 Activation Functions
- ReLU: Most common, but can cause “dead neurons”.
- Sigmoid/Tanh: Suffer from vanishing gradients.
- Leaky ReLU/Maxout: Alternatives to mitigate issues.
3.5 Initialization
- He initialization (for ReLU): .
- Xavier initialization (for sigmoid/tanh): .
3.6 Regularization & Techniques
- Dropout: Randomly deactivate neurons during training.
- Batch Normalization: Normalize layer inputs to speed up training.
- Gradient Clipping: Prevent exploding gradients.
4. Model Selection
4.1 Cross-Validation
- K-fold CV: Partition data into subsets, train on , validate on 1.
- Choose model with lowest cross-validation error.
4.2 Bayesian Model Selection
- BIC:
- AIC:
- Occam’s Razor: Prefer simpler models if performance is similar.
5. Statistical Learning Theory
5.1 Generalization Error
- .
- Overfitting: Small training error, large generalization error.
- Underfitting: Large training and generalization error.
5.2 PAC Learnability
- A hypothesis class is PAC-learnable if:
for sufficiently large .
- Finite : Always PAC-learnable.
- Infinite : Use VC dimension to characterize learnability.
5.3 VC Dimension
- Definition: Largest set size that can be shattered by .
- Examples:
- Linear classifiers in : .
- Axis-aligned rectangles in : .
6. Summary
- Duality provides lower bounds and optimality conditions.
- Convex optimization underpins many ML models (SVM, logistic regression).
- Neural networks rely on backpropagation and advanced optimizers.
- Model selection balances complexity and performance.
- Learning theory formalizes generalization and model capacity.
说些什么吧!