Goto

Collaborating Authors

 Accuracy


NLPeer: A Unified Resource for the Computational Study of Peer Review

arXiv.org Artificial Intelligence

Peer review constitutes a core component of scholarly publishing; yet it demands substantial expertise and training, and is susceptible to errors and biases. Various applications of NLP for peer reviewing assistance aim to support reviewers in this complex process, but the lack of clearly licensed datasets and multi-domain corpora prevent the systematic study of NLP for peer review. To remedy this, we introduce NLPeer -- the first ethically sourced multidomain corpus of more than 5k papers and 11k review reports from five different venues. In addition to the new datasets of paper drafts, camera-ready versions and peer reviews from the NLP community, we establish a unified data representation and augment previous peer review datasets to include parsed and structured paper representations, rich metadata and versioning information. We complement our resource with implementations and analysis of three reviewing assistance tasks, including a novel guided skimming task. Our work paves the path towards systematic, multi-faceted, evidence-based study of peer review in NLP and beyond. The data and code are publicly available.


A benchmark for computational analysis of animal behavior, using animal-borne tags

arXiv.org Artificial Intelligence

Animal-borne sensors ('bio-loggers') can record a suite of kinematic and environmental data, which can elucidate animal ecophysiology and improve conservation efforts. Machine learning techniques are useful for interpreting the large amounts of data recorded by bio-loggers, but there exists no standard for comparing the different machine learning techniques in this domain. To address this, we present the Bio-logger Ethogram Benchmark (BEBE), a collection of datasets with behavioral annotations, standardized modeling tasks, and evaluation metrics. BEBE is to date the largest, most taxonomically diverse, publicly available benchmark of this type, and includes 1654 hours of data collected from 149 individuals across nine taxa. We evaluate the performance of ten different machine learning methods on BEBE, and identify key challenges to be addressed in future work. Datasets, models, and evaluation code are made publicly available at https://github.com/earthspecies/BEBE, to enable community use of BEBE as a point of comparison in methods development.


Neural Network Entropy (NNetEn): Entropy-Based EEG Signal and Chaotic Time Series Classification, Python Package for NNetEn Calculation

arXiv.org Artificial Intelligence

Entropy measures are effective features for time series classification problems. Traditional entropy measures, such as Shannon entropy, use probability distribution function. However, for the effective separation of time series, new entropy estimation methods are required to characterize the chaotic dynamic of the system. Our concept of Neural Network Entropy (NNetEn) is based on the classification of special datasets in relation to the entropy of the time series recorded in the reservoir of the neural network. NNetEn estimates the chaotic dynamics of time series in an original way and does not take into account probability distribution functions. We propose two new classification metrics: R2 Efficiency and Pearson Efficiency. The efficiency of NNetEn is verified on separation of two chaotic time series of sine mapping using dispersion analysis. For two close dynamic time series (r = 1.1918 and r = 1.2243), the F-ratio has reached the value of 124 and reflects high efficiency of the introduced method in classification problems. The electroenceph-alography signal classification for healthy persons and patients with Alzheimer disease illustrates the practical application of the NNetEn features. Our computations demonstrate the synergistic effect of increasing classification accuracy when applying traditional entropy measures and the NNetEn concept conjointly. An implementation of the algorithms in Python is presented.


Efficient Fraud Detection Using Deep Boosting Decision Trees

arXiv.org Artificial Intelligence

Fraud detection is to identify, monitor, and prevent potentially fraudulent activities from complex data. The recent development and success in AI, especially machine learning, provides a new data-driven way to deal with fraud. From a methodological point of view, machine learning based fraud detection can be divided into two categories, i.e., conventional methods (decision tree, tree boosting methods...) and deep learning, both of which have significant limitations in terms of the lack of representation learning ability for the former and interpretability for the latter. Furthermore, due to the rarity of detected fraud cases, the associated data is usually imbalanced, which seriously degrades the performance of classification algorithms. In this paper, we propose deep boosting decision trees (DBDT), a novel approach for fraud detection based on gradient boosting and neural networks. In order to combine the advantages of both conventional methods and deep learning, we first construct soft decision tree (SDT), a decision tree structured model with neural networks as its nodes, and then ensemble SDTs using the idea of gradient boosting. In this way we embed neural networks into gradient boosting to improve its representation learning capability and meanwhile maintain the interpretability. Furthermore, aiming at the rarity of detected fraud cases, in the model training phase we propose a compositional AUC maximization approach to deal with data imbalances at algorithm level. Extensive experiments on several real-life fraud detection datasets show that DBDT can significantly improve the performance and meanwhile maintain good interpretability.


Solar Active Region Magnetogram Image Dataset for Studies of Space Weather

arXiv.org Artificial Intelligence

In this dataset we provide a comprehensive collection of magnetograms (images quantifying the strength of the magnetic field) from the National Aeronautics and Space Administration's (NASA's) Solar Dynamics Observatory (SDO). The dataset incorporates data from three sources and provides SDO Helioseismic and Magnetic Imager (HMI) magnetograms of solar active regions (regions of large magnetic flux, generally the source of eruptive events) as well as labels of corresponding flaring activity. This dataset will be useful for image analysis or solar physics research related to magnetic structure, its evolution over time, and its relation to solar flares. The dataset will be of interest to those researchers investigating automated solar flare prediction methods, including supervised and unsupervised machine learning (classical and deep), binary and multi-class classification, and regression. This dataset is a minimally processed, user configurable dataset of consistently sized images of solar active regions that can serve as a benchmark dataset for solar flare prediction research.


Assessing the predicting power of GPS data for aftershocks forecasting

arXiv.org Artificial Intelligence

Forecasting large aftershocks is a challenge of great importance for human security. Today we dispose of statistical predictive models called Epidemic Type Aftershock Sequence (ETAS) tuned on the earthquake catalogue of the past seismicity. This catalogues contains basic information such as the location, the time and the magnitude of an earthquake. However we dispose of much richer data set about the crust dynamics, such as the daily displacement of the ground surface, that is nowadays measured by numerous GPS stations, devices that send their absolute position everyday to sattellites, thus telling us about how the ground deforms. In this study, we propose to forecast the Japanese aftershocks by means of a machine learning study of the GPS data alone. Our results show that this method is very promising and relies on the quality and the quantity of the available data.


Improving Link Prediction in Social Networks Using Local and Global Features: A Clustering-based Approach

arXiv.org Artificial Intelligence

Link prediction problem has increasingly become prominent in many domains such as social network analyses, bioinformatics experiments, transportation networks, criminal investigations and so forth. A variety of techniques has been developed for link prediction problem, categorized into 1) similarity based approaches which study a set of features to extract similar nodes; 2) learning based approaches which extract patterns from the input data; 3) probabilistic statistical approaches which optimize a set of parameters to establish a model which can best compute formation probability. However, existing literatures lack approaches which utilize strength of each approach by integrating them to achieve a much more productive one. To tackle the link prediction problem, we propose an approach based on the combination of first and second group methods; the existing studied works use just one of these categories. Our two-phase developed method firstly determines new features related to the position and dynamic behavior of nodes, which enforce the approach more efficiency compared to approaches using mere measures. Then, a subspace clustering algorithm is applied to group social objects based on the computed similarity measures which differentiate the strength of clusters; basically, the usage of local and global indices and the clustering information plays an imperative role in our link prediction process. Some extensive experiments held on real datasets including Facebook, Brightkite and HepTh indicate good performances of our proposal method. Besides, we have experimentally verified our approach with some previous techniques in the area to prove the supremacy of ours.


Risk Assessment of Lymph Node Metastases in Endometrial Cancer Patients: A Causal Approach

arXiv.org Artificial Intelligence

Artificial Intelligence (AI) has found many applications in medicine [15] and, more specifically, in cancer research [32] in the form of predictive models for diagnosis [14], prognosis [6] and therapy planning [12]. As a subfield of AI, Machine Learning (ML) and in particular Deep Learning (DL) has achieved significant results, especially in image processing [3]. Nonetheless, ML and DL models have limited explainability [13] because of their black-box design, which limits their adoption in the clinical field: clinicians and physicians are reluctant to include models that are not transparent in their decision process [24]. While recent research on Explainable AI (XAI) [11] has attacked this problem, DL models are still opaque and difficult to interpret. In contrast, in Probabilistic Graphical Models (PGMs) the interactions between different variables are encoded explicitly: the joint probability distribution P of the variables of interest factorizes according to a graph G, hence the "graphical" connotation. Bayesian Networks (BNs) [23], which we will describe in Section 3.1, are an instance of PGMs that can be used as causal models. In turn, this makes them ideal to use as decision support systems and overcome the limitations of the predictions based on probabilistic associations produced by other ML models [1, 19].


Solitary pulmonary nodules prediction for lung cancer patients using nomogram and machine learning

arXiv.org Artificial Intelligence

Lung cancer(LC) is a type of malignant neoplasm that originates in the bronchial mucosa or glands.As a clinically common nodule,solitary pulmonary nodules(SPNs) have a significantly higher probability of malignancy when they are larger than 8 mm in diameter.But there is also a risk of lung cancer when the diameter is less than 8mm,the purpose of this study was to create a nomogram for estimating the likelihood of lung cancer in patients with SPNs of 8 mm or smaller using computed tomography(CT) scans and biomarker information.Use CT scans and various biomarkers as input to build predictive models for the likelihood of lung cancer in patients with SPNs of 8 mm or less.The age,precursor gastrin-releasing peptide (ProGRP),gender,Carcinoembryonic Antigen(CEA),and stress corrosion cracking(SCC) were independent key tumor markers and were entered into the nomogram.The developed nomogram demonstrated strong accuracy in predicting lung cancer risk,with an internal validation area under the receiver operating characteristics curve(ROC) of 0.8474.The calibration curves plotted showed that the nomogram predicted the probability of lung cancer with good agreement with the actual probability.In this study,we finally succeeded in constructing a suitable nomogram that could predict the risk of lung cancer in patients with SPNs<=8 mm in diameter.The model has a high level of accuracy and is able to accurately distinguish between different patients,allowing clinicians to develop personalized treatment plans for individuals with SPNs.


Multimodal Short Video Rumor Detection System Based on Contrastive Learning

arXiv.org Artificial Intelligence

With the rise of short video platforms as prominent channels for news dissemination, major platforms in China have gradually evolved into fertile grounds for the proliferation of fake news. However, distinguishing short video rumors poses a significant challenge due to the substantial amount of information and shared features among videos, resulting in homogeneity. To address the dissemination of short video rumors effectively, our research group proposes a methodology encompassing multimodal feature fusion and the integration of external knowledge, considering the merits and drawbacks of each algorithm. The proposed detection approach entails the following steps: (1) creation of a comprehensive dataset comprising multiple features extracted from short videos; (2) development of a multimodal rumor detection model: first, we employ the Temporal Segment Networks (TSN) video coding model to extract video features, followed by the utilization of Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) to extract textual features. Subsequently, the BERT model is employed to fuse textual and video features; (3) distinction is achieved through contrast learning: we acquire external knowledge by crawling relevant sources and leverage a vector database to incorporate this knowledge into the classification output. Our research process is driven by practical considerations, and the knowledge derived from this study will hold significant value in practical scenarios, such as short video rumor identification and the management of social opinions.