information gain
LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: \emph{Gradient-based One-Side Sampling} (GOSS) and \emph{Exclusive Feature Bundling} (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much).
- North America > United States (0.04)
- Europe > Spain > Andalusia > Cádiz Province > Cadiz (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- (3 more...)
- North America > United States > Texas (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > New Jersey (0.04)
- Instructional Material (0.67)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- (9 more...)
- Information Technology > Modeling & Simulation (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- North America > United States > Washington (0.04)
- Europe > France > Brittany > Ille-et-Vilaine > Rennes (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
Color-Oriented Redundancy Reduction in Dataset Distillation
In this paper, we propose AutoPalette, a framework that minimizes color redundancy at the individual image and overall dataset levels, respectively. At the image level, we employ a palette network, a specialized neural network, to dynamically allocate colors from a reduced color space to each pixel. The palette network identifies essential areas in synthetic images for model training and consequently assigns more unique colors to them. At the dataset level, we develop a color-guided initialization strategy to minimize redundancy among images.
- Asia > Singapore (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Lancashire > Lancaster (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > Japan > Kyūshū & Okinawa > Kyūshū > Fukuoka Prefecture > Fukuoka (0.04)
- North America > United States (0.27)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Data Science > Data Mining (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)