refactor
CODECLEANER: Elevating Standards with A Robust Data Contamination Mitigation Toolkit
Cao, Jialun, Chen, Songqiang, Zhang, Wuqi, Lo, Hau Ching, Cheung, Shing-Chi
Data contamination presents a critical barrier preventing widespread industrial adoption of advanced software engineering techniques that leverage code language models (CLMs). This phenomenon occurs when evaluation data inadvertently overlaps with the public code repositories used to train CLMs, severely undermining the credibility of performance evaluations. For software companies considering the integration of CLM-based techniques into their development pipeline, this uncertainty about true performance metrics poses an unacceptable business risk. Code refactoring, which comprises code restructuring and variable renaming, has emerged as a promising measure to mitigate data contamination. It provides a practical alternative to the resource-intensive process of building contamination-free evaluation datasets, which would require companies to collect, clean, and label code created after the CLMs' training cutoff dates. However, the lack of automated code refactoring tools and scientifically validated refactoring techniques has hampered widespread industrial implementation. To bridge the gap, this paper presents the first systematic study to examine the efficacy of code refactoring operators at multiple scales (method-level, class-level, and cross-class level) and in different programming languages. In particular, we develop an open-sourced toolkit, CODECLEANER, which includes 11 operators for Python, with nine method-level, one class-level, and one cross-class-level operator. A drop of 65% overlap ratio is found when applying all operators in CODECLEANER, demonstrating their effectiveness in addressing data contamination. Additionally, we migrate four operators to Java, showing their generalizability to another language. We make CODECLEANER online available to facilitate further studies on mitigating CLM data contamination.
REFACTOR: Learning to Extract Theorems from Proofs
Zhou, Jin Peng, Wu, Yuhuai, Li, Qiyang, Grosse, Roger
Human mathematicians are often good at recognizing modular and reusable theorems that make complex mathematical results within reach. In this paper, we propose a novel method called theoREm-from-prooF extrACTOR (REFACTOR) for training neural networks to mimic this ability in formal mathematical theorem proving. We show on a set of unseen proofs, REFACTOR is able to extract 19.6% of the theorems that humans would use to write the proofs. When applying the model to the existing Metamath library, REFACTOR extracted 16 new theorems. With newly extracted theorems, we show that the existing proofs in the MetaMath database can be refactored. The new theorems are used very frequently after refactoring, with an average usage of 733.5 times, and help shorten the proof lengths. Lastly, we demonstrate that the prover trained on the new-theorem refactored dataset proves more test theorems and outperforms state-of-the-art baselines by frequently leveraging a diverse set of newly extracted theorems. Code can be found at https://github.com/jinpz/refactor.
Adaptive Reconvergence-driven AIG Rewriting via Strategy Learning
Ni, Liwei, Yang, Zonglin, Zhang, Jiaxi, Liu, Junfeng, Li, Huawei, Xie, Biwei, Li, Xinquan
Rewriting is a common procedure in logic synthesis aimed at improving the performance, power, and area (PPA) of circuits. The traditional reconvergence-driven And-Inverter Graph (AIG) rewriting method focuses solely on optimizing the reconvergence cone through Boolean algebra minimization. However, there exist opportunities to incorporate other node-rewriting algorithms that are better suited for specific cones. In this paper, we propose an adaptive reconvergence-driven AIG rewriting algorithm that combines two key techniques: multi-strategy-based AIG rewriting and strategy learning-based algorithm selection. The multi-strategy-based rewriting method expands upon the traditional approach by incorporating support for multi-node-rewriting algorithms, thus expanding the optimization space. Additionally, the strategy learning-based algorithm selection method determines the most suitable node-rewriting algorithm for a given cone. Experimental results demonstrate that our proposed method yields a significant average improvement of 5.567\% in size and 5.327\% in depth.
Why AI is your friend when it comes to cloud migration
But the drawbacks of not making the investment to rebuild your legacy apps for the cloud means technological debt, competitive disadvantages in agility and frustrated customers left suffering poor user experiences. Organisations need to decide which applications to move to the cloud and which to keep on-premise. Then, they must decide how to refactor those apps with cloud-native technologies or create a hybrid-cloud setup - it's a complicated process. Successful cloud migrations and transformation rely on automating continuous builds, integration and delivery as well as automating performance monitoring, root-cause analysis and remediation. Together with this'automate everything' approach is leveraging AI.
Coding habits for data scientists
While this may be fine for notebooks targeted at teaching people about the machine learning process, in real projects it's a recipe for unmaintainable mess. The lack of good coding habits makes code hard to understand and consequently, modifying code becomes painful and error-prone. This makes it increasingly difficult for data scientists and developers to evolve their ML solutions. In this article, we'll share techniques for identifying bad habits that add to complexity in code as well as habits that can help us partition complexity.
Technical Debt in Data Science Series -- Part 1 – Acing AI – Medium
A Data Science Interview involves different challenges for a potential data scientist. As much as the interview is for the company to decide if the person is a fit, it as also, for the person to decide if the company is a fit. Understanding a company as a fit requires one to ask some important questions to the interviewers and understand how the data team functions in different areas. Technical Debt in Data Science is one such area. My AI Interview Questions articles for Microsoft, Google, Amazon, Netflix, LinkedIn, Ebay, Twitter, Walmart, Apple, Facebook, Salesforce and Uber have been very helpful to the readers.
Barriers to Refactoring
Refactoring6 is something software developers like to do. But do they refactor as much as they would like? Are there barriers that prevent them from doing so? Refactoring is an important tool for improving quality. Many development methodologies rely on refactoring, especially for agile methodologies but also in more plan-driven organizations. If barriers exist, they would undermine the effectiveness of many product-development organizations. We conducted a large-scale survey in 2009 of 3,785 practitioners' use of object-oriented concepts,7 including questions as to whether they would refactor to deal with certain design problems. We expected either that practitioners would tell us our choice of design principles was inappropriate for basing a refactoring decision or that refactoring is the right decision to take when designs were believed to have quality problems. However, we were told the decision of whether or not to refactor was due to non-design considerations. It is now eight years since the survey, but little has changed in integrated development environment (IDE) support for refactoring, and what has changed has done little to address the barriers we identified.
ReFACTor: Practical Low-Rank Matrix Estimation Under Column-Sparsity
Gavish, Matan, Schweiger, Regev, Rahmani, Elior, Halperin, Eran
Various problems in data analysis and statistical genetics call for recovery of a column-sparse, low-rank matrix from noisy observations. We propose ReFACTor, a simple variation of the classical Truncated Singular Value Decomposition (TSVD) algorithm. In contrast to previous sparse principal component analysis (PCA) algorithms, our algorithm can provably reveal a low-rank signal matrix better, and often significantly better, than the widely used TSVD, making it the algorithm of choice whenever column-sparsity is suspected. Empirically, we observe that ReFACTor consistently outperforms TSVD even when the underlying signal is not sparse, suggesting that it is generally safe to use ReFACTor instead of TSVD and PCA. The algorithm is extremely simple to implement and its running time is dominated by the runtime of PCA, making it as practical as standard principal component analysis.