Goto

Collaborating Authors

 lik



Proofs and Additional Numerical Experiments for " Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data "

Neural Information Processing Systems

Slutsky's theorem together with (S.3) and (S.5) implies the result in Theorem 1. Now we check the Lindeberg-Feller condition. 's are non-negative and E S.4 Derivation of corrected model (4) Note that π (x, 1) = 1 and π (x, 0) = π (x) . Slutsky's theorem together with (S.15) and (S.17) implies the result in Theorem 1. 's, whose distribution depends on N . From (S.27) and (S.28), Chebyshev's inequality implies that For sampled data, (5) tell us that the joint density w.r.t. the product counting measure of the responses The outline of the proof is similar to that of the proof of Theorem 2. Write Markov's inequality shows that they are both o The outline of the proof is similar to that of the proof of Theorem 4. The estimator Slutsky's theorem together with (S.38) and (S.40) implies the result in Theorem 1.


Location is Key: Leveraging Large Language Model for Functional Bug Localization in Verilog

Yao, Bingkun, Wang, Ning, Zhou, Jie, Wang, Xi, Gao, Hong, Jiang, Zhe, Guan, Nan

arXiv.org Artificial Intelligence

Bug localization in Verilog code is a crucial and time-consuming task during the verification of hardware design. Since introduction, Large Language Models (LLMs) have showed their strong programming capabilities. However, no work has yet considered using LLMs for bug localization in Verilog code. This paper presents Location-is-Key, an opensource LLM solution to locate functional errors in Verilog snippets. LiK achieves high localization accuracy, with a pass@1 localization accuracy of 93.3% on our test dataset based on RTLLM, surpassing GPT-4's 77.9% and comparable to Claude-3.5's 90.8%. Additionally, the bug location obtained by LiK significantly improves GPT-3.5's bug repair efficiency (Functional pass@1 increased from 40.39% to 58.92%), highlighting the importance of bug localization in LLM-based Verilog debugging. Compared to existing methods, LiK only requires the design specification and the erroneous code snippet, without the need for testbenches, assertions, or any other EDA tools. This research demonstrates the feasibility of using LLMs for Verilog error localization, thus providing a new direction for automatic Verilog code debugging.


Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data

Wang, HaiYing, Zhang, Aonan, Wang, Chong

arXiv.org Machine Learning

We investigate the issue of parameter estimation with nonuniform negative sampling for imbalanced data. We first prove that, with imbalanced data, the available information about unknown parameters is only tied to the relatively small number of positive instances, which justifies the usage of negative sampling. However, if the negative instances are subsampled to the same level of the positive cases, there is information loss. To maintain more information, we derive the asymptotic distribution of a general inverse probability weighted (IPW) estimator and obtain the optimal sampling probability that minimizes its variance. To further improve the estimation efficiency over the IPW method, we propose a likelihood-based estimator by correcting log odds for the sampled data and prove that the improved estimator has the smallest asymptotic variance among a large class of estimators. It is also more robust to pilot misspecification. We validate our approach on simulated data as well as a real click-through rate dataset with more than 0.3 trillion instances, collected over a period of a month. Both theoretical and empirical results demonstrate the effectiveness of our method.


Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a Chemogenomics Approach

Ohue, Masahito, Yamazaki, Takuro, Ban, Tomohiro, Akiyama, Yutaka

arXiv.org Machine Learning

Virtual screening (VS) is widely used during computational drug discovery to reduce costs. Chemogenomics-based virtual screening (CGBVS) can be used to predict new compound-protein interactions (CPIs) from known CPI network data using several methods, including machine learning and data mining. Although CGBVS facilitates highly efficient and accurate CPI prediction, it has poor performance for prediction of new compounds for which CPIs are unknown. The pairwise kernel method (PKM) is a state-of-the-art CGBVS method and shows high accuracy for prediction of new compounds. In this study, on the basis of link mining, we improved the PKM by combining link indicator kernel (LIK) and chemical similarity and evaluated the accuracy of these methods. The proposed method obtained an average area under the precision-recall curve (AUPR) value of 0.562, which was higher than that achieved by the conventional Gaussian interaction profile (GIP) method (0.425), and the calculation time was only increased by a few percent.