Goto

Collaborating Authors

 Dong, Ming


Rich Semantic Knowledge Enhanced Large Language Models for Few-shot Chinese Spell Checking

arXiv.org Artificial Intelligence

Chinese Spell Checking (CSC) is a widely used technology, which plays a vital role in speech to text (STT) and optical character recognition (OCR). Most of the existing CSC approaches relying on BERT architecture achieve excellent performance. However, limited by the scale of the foundation model, BERT-based method does not work well in few-shot scenarios, showing certain limitations in practical applications. In this paper, we explore using an in-context learning method named RS-LLM (Rich Semantic based LLMs) to introduce large language models (LLMs) as the foundation model. Besides, we study the impact of introducing various Chinese rich semantic information in our framework. We found that by introducing a small number of specific Chinese rich semantic structures, LLMs achieve better performance than the BERT-based model on few-shot CSC task. Furthermore, we conduct experiments on multiple datasets, and the experimental results verified the superiority of our proposed framework.


Data-oriented Dynamic Fine-tuning Parameter Selection Strategy for FISH Mask based Efficient Fine-tuning

arXiv.org Artificial Intelligence

In view of the huge number of parameters of Large language models (LLMs) , tuning all parameters is very costly, and accordingly fine-tuning specific parameters is more sensible. Most of parameter efficient fine-tuning (PEFT) concentrate on parameter selection strategies, such as additive method, selective method and reparametrization-based method. However, there are few methods that consider the impact of data samples on parameter selecting, such as Fish Mask based method. Fish Mask randomly choose a part of data samples and treat them equally during parameter selection, which is unable to dynamically select optimal parameters for inconstant data distributions. In this work, we adopt a data-oriented perspective, then proposing an IRD ($\mathrm{\underline I}$terative sample-parameter $\mathrm{\underline R}$ange $\mathrm{\underline D}$ecreasing) algorithm to search the best setting of sample-parameter pair for FISH Mask. In each iteration, by searching the set of samples and parameters with larger Fish information, IRD can find better sample-parameter pair in most scale. We demonstrate the effectiveness and rationality of proposed strategy by conducting experiments on GLUE benchmark. Experimental results show our strategy optimizes the parameter selection and achieves preferable performance.


Automated Identification of Toxic Code Reviews Using ToxiCR

arXiv.org Artificial Intelligence

Toxic conversations during software development interactions may have serious repercussions on a Free and Open Source Software (FOSS) development project. For example, victims of toxic conversations may become afraid to express themselves, therefore get demotivated, and may eventually leave the project. Automated filtering of toxic conversations may help a FOSS community to maintain healthy interactions among its members. However, off-the-shelf toxicity detectors perform poorly on Software Engineering (SE) datasets, such as one curated from code review comments. To encounter this challenge, we present ToxiCR, a supervised learning-based toxicity identification tool for code review interactions. ToxiCR includes a choice to select one of the ten supervised learning algorithms, an option to select text vectorization techniques, eight preprocessing steps, and a large-scale labeled dataset of 19,571 code review comments. Two out of those eight preprocessing steps are SE domain specific. With our rigorous evaluation of the models with various combinations of preprocessing steps and vectorization techniques, we have identified the best combination for our dataset that boosts 95.8% accuracy and 88.9% F1 score. ToxiCR significantly outperforms existing toxicity detectors on our dataset. We have released our dataset, pre-trained models, evaluation results, and source code publicly available at: https://github.com/WSU-SEAL/ToxiCR


A Pattern Recognition Method for Partial Discharge Detection on Insulated Overhead Conductors

arXiv.org Artificial Intelligence

Today,insulated overhead conductors are increasingly used in many places of the world due to the higher operational reliability, elimination of phase-to-phase contact, closer distances between phases and stronger protection for animals. However, the standard protection devices are often not able to detect the conductor phase-to-ground fault and the more frequent tree/tree branch hitting conductor events as these events only lead to partial discharge (PD) activities instead of causing overcurrent seen on bare conductors. To solve this problem, in recent years, Technical University of Ostrava (VSB) devised a special meter to measure the voltage signal of the stray electrical field along the insulated overhead conductors, hoping to detect the above hazardous PD activities. In 2018, VSB published a large amount of waveform data recorded by their meter on Kaggle, the world's largest data science collaboration platform, looking for promising pattern recognition methods for this application. To tackle this challenge, we developed a unique method based on Seasonal and Trend decomposition using Loess (STL) and Support Vector Machine (SVM) to recognize PD activities on insulated overhead conductors. Different SVM kernels were tested and compared. Satisfactory classification rates on VSB dataset were achieved with the use of Gaussian radial basis kernel.


Combining Unsupervised and Supervised Learning for Asset Class Failure Prediction in Power Systems

arXiv.org Machine Learning

Abstract--In power systems, an asset class is a group of power equipment that has the same function and shares similar electrical or mechanical characteristics. Predicting failures for different asset classes is critical for electric utilities towards developing cost-effective asset management strategies. Previously, physical age based Weibull distribution has been widely used to failure prediction. However, this mathematical model cannot incorporate asset condition data such as inspection or testing results. As a result, the prediction cannot be very specific and accurate for individual assets. To solve this important problem, this paper proposes a novel and comprehensive data-driven approach based on asset condition data: K-means clustering as an unsupervised learning method is used to analyze the inner structure of historical asset condition data and produce the asset conditional ages; logistic regression as a supervised learning method takes in both asset physical ages and conditional ages to classify and predict asset statuses. Furthermore, an index called average aging rate is defined to quantify, track and estimate the relationship between asset physical age and conditional age. This approach was applied to an urban distribution system in West Canada to predict medium-voltage cable failures. Case studies and comparison with standard Weibull distribution are provided. The proposed approach demonstrates superior performance and practicality for predicting asset class failures in power systems. I. INTRODUCTION oday, more and more electric utilities are mandated by regulators to develop cost-effective long-term asset management strategies to reduce overall cost while maintaining system reliability [1-2]. Sophisticated and optimal asset management strategies can only be established based on the accurate prediction of asset failures in the future.


A Hybrid Long-Term Load Forecasting Model for Distribution Feeder Peak Demand using LSTM Neural Network

arXiv.org Machine Learning

Long Short-Term Memory (LSTM) neural network is an enhanced Recurrent Neural Network (RNN) that has gained significant attention in recent years. It solved the vanishing and exploding gradient problems that a standard RNN has and was successfully applied to a variety of time-series forecasting problems. In power systems, distribution feeder long-term load forecast is a critical task many electric utility companies perform on an annual basis. The goal of this task is to forecast the load change on existing distribution feeders for the next few years. The forecasted results will be used as input in long-term system planning studies to determine necessary system upgrades so that the distribution system can continue to operate reliably during normal operation and contingences. This research proposed a comprehensive hybrid model based on LSTM neural network for this classic and important forecasting task. It is not only able to combine the advantages of top-down and bottom-up forecasting models but also able to leverage the time-series characteristics of multi-year data. This paper firstly explains the concept of LSTM neural network and then discusses the steps of feature selection, feature engineering and model establishment in detail. In the end, a real-world application example for a large urban grid in West Canada is provided. The results are compared to other models such as bottom-up, ARIMA and ANN. The proposed model demonstrates superior performance and great practicality for forecasting long-term peak demand for distribution feeders.


Residential Transformer Overloading Risk Assessment Using Clustering Analysis

arXiv.org Artificial Intelligence

Residential transformer population is a critical type of asset that many electric utility companies have been attempting to manage proactively and effectively to reduce unexpected failures and life losses that are often caused by transformer overloading. Within the typical power asset portfolio, the residential transformer asset is often large in population, having lowest reliability design, lacking transformer loading data and susceptible to customer loading behaviors such as adoption of distributed energy resources and electric vehicles. On the bright side, the availability of more residential operation data along with the advancement of data analytics techniques have provided a new path to further our understanding of local residential transformer overloading risks statistically. This research developed a new data-driven method to combine clustering analysis and the simulation of transformer temperature rise and insulation life loss to quantitatively and statistically assess the overloading risk of residential transformer population in one area and suggest proper risk management measures according to the assessment results. Case studies from an actual Canadian utility company have been presented and discussed in detail to demonstrate the applicability and usefulness of the proposed method.