Goto

Collaborating Authors

 preprocessing


Supplementary Materials for MEQA: A Benchmark for Multi-hop Event-centric Question Answering with Explanations

Neural Information Processing Systems

We utilize an open and widely used data format, i.e., JSON format, for the MEQA dataset. "context": "Roadside IED kills Russian major general [...]", # The context of the question "question": "Who died before AI-monitor reported it online?", "What event contains Al-Monitor is the communicator? "What event is after #1 has a victim? "Who died in the #2? major general,local commander,lieutenant general" We present a list of Datasheets [Gebru et al., 2021] for the MEQA dataset, synthesizing many of the For what purpose was the dataset created?


Bypass Exponential Time Preprocessing: Fast Neural Network Training via Weight-Data Correlation Preprocessing

Neural Information Processing Systems

Over the last decade, deep neural networks have transformed our society, and they are already widely applied in various machine learning applications. State-of-the-art deep neural networks are becoming larger in size every year to deliver increasing model accuracy, and as a result, model training consumes substantial computing resources and will only consume more in the future.Using current training methods, in each iteration, to process a data point $x \in \mathbb{R}^d$ in a layer, we need to spend $\Theta(md)$ time to evaluate all the $m$ neurons in the layer. This means processing the entire layer takes $\Theta(nmd)$ time for $n$ data points. Recent work [Song, Yang and Zhang, NeurIPS 2021] reduces this time per iteration to $o(nmd)$, but requires exponential time to preprocess either the data or the neural network weights, making it unlikely to have practical usage. In this work, we present a new preprocessing method that simply stores the weight-data correlation in a tree data structure in order to quickly and dynamically detect which neurons fire at each iteration. Our method requires only $O(nmd)$ time in preprocessing and still achieves $o(nmd)$ time per iteration. We complement our new algorithm with a lower bound, proving that assuming a popular conjecture from complexity theory, one could not substantially speed up our algorithm for dynamic detection of firing neurons.


Supplementary Materials for MEQA: A Benchmark for Multi-hop Event-centric Question Answering with Explanations

Neural Information Processing Systems

We utilize an open and widely used data format, i.e., JSON format, for the MEQA dataset. "context": "Roadside IED kills Russian major general [...]", # The context of the question "question": "Who died before AI-monitor reported it online?", "What event contains Al-Monitor is the communicator? "What event is after #1 has a victim? "Who died in the #2? major general,local commander,lieutenant general" We present a list of Datasheets [Gebru et al., 2021] for the MEQA dataset, synthesizing many of the For what purpose was the dataset created?


Scaling Legal AI: Benchmarking Mamba and Transformers for Statutory Classification and Case Law Retrieval

Maurya, Anuraj

arXiv.org Artificial Intelligence

The rapid growth of statutory corpora and judicial decisions requires scalable legal AI systems capable of classification and retrieval over extremely long contexts. Transformer-based architectures (e.g., Longformer, DeBERTa) dominate current legal NLP benchmarks but struggle with quadratic attention costs, limiting efficiency and scalability. In this work, we present the first comprehensive benchmarking of Mamba, a state-space model (SSM) with linear-time selective mechanisms, against leading transformer models for statutory classification and case law retrieval. We evaluate models on open-source legal corpora including LexGLUE, EUR-Lex, and ILDC, covering statutory tagging, judicial outcome prediction, and case retrieval tasks. Metrics include accuracy, recall at k, mean reciprocal rank (MRR), and normalized discounted cumulative gain (nDCG), alongside throughput measured in tokens per second and maximum context length. Results show that Mamba's linear scaling enables processing of legal documents several times longer than transformers, while maintaining or surpassing retrieval and classification performance. This study introduces a new legal NLP benchmark suite for long-context modeling, along with open-source code and datasets to support reproducibility. Our findings highlight trade-offs between state-space models and transformers, providing guidance for deploying scalable legal AI in statutory analysis, judicial decision support, and policy research.


Current State in Privacy-Preserving Text Preprocessing for Domain-Agnostic NLP

Sinha, Abhirup, Saha, Pritilata, Saha, Tithi

arXiv.org Artificial Intelligence

Privacy is a fundamental human right. Data privacy is protected by different regulations, such as GDPR. However, modern large language models require a huge amount of data to learn linguistic variations, and the data often contains private information. Research has shown that it is possible to extract private information from such language models. Thus, anonymizing such private and sensitive information is of utmost importance. While complete anonymization may not be possible, a number of different pre-processing approaches exist for masking or pseudonymizing private information in textual data. This report focuses on a few of such approaches for domain-agnostic NLP tasks.


Multi-Modal Sensor Fusion for Proactive Blockage Prediction in mmWave Vehicular Networks

Nazar, Ahmad M., Celik, Abdulkadir, Selim, Mohamed Y., Abdallah, Asmaa, Qiao, Daji, Eltawil, Ahmed M.

arXiv.org Artificial Intelligence

Vehicular communication systems operating in the millimeter wave (mmWave) band are highly susceptible to signal blockage from dynamic obstacles such as vehicles, pedestrians, and infrastructure. To address this challenge, we propose a proactive blockage prediction framework that utilizes multi-modal sensing, including camera, GPS, LiDAR, and radar inputs in an infrastructure-to-vehicle (I2V) setting. This approach uses modality-specific deep learning models to process each sensor stream independently and fuses their outputs using a softmax-weighted ensemble strategy based on validation performance. Our evaluations, for up to 1.5s in advance, show that the camera-only model achieves the best standalone trade-off with an F1-score of 97.1% and an inference time of 89.8ms. A camera+radar configuration further improves accuracy to 97.2% F1 at 95.7ms. Our results display the effectiveness and efficiency of multi-modal sensing for mmWave blockage prediction and provide a pathway for proactive wireless communication in dynamic environments.


How We Won the ISLES'24 Challenge by Preprocessing

Ren, Tianyi, Rivera, Juampablo E. Heras, Oswal, Hitender, Pan, Yutong, Henry, William, Walters, Sophie, Kurt, Mehmet

arXiv.org Artificial Intelligence

Stroke is among the top three causes of death worldwide, and accurate identification of stroke lesion boundaries is critical for diagnosis and treatment. Supervised deep learning methods have emerged as the leading solution for stroke lesion segmentation but require large, diverse, and annotated datasets. The ISLES'24 challenge addresses this need by providing longitudinal stroke imaging data, including CT scans taken on arrival to the hospital and follow-up MRI taken 2-9 days from initial arrival, with annotations derived from follow-up MRI. Importantly, models submitted to the ISLES'24 challenge are evaluated using only CT inputs, requiring prediction of lesion progression that may not be visible in CT scans for segmentation. Our winning solution shows that a carefully designed preprocessing pipeline including deep-learning-based skull stripping and custom intensity windowing is beneficial for accurate segmentation. Combined with a standard large residual nnU-Net architecture for segmentation, this approach achieves a mean test Dice of 28.5 with a standard deviation of 21.27.


Worth Their Weight: Randomized and Regularized Block Kaczmarz Algorithms without Preprocessing

Goldshlager, Gil, Hu, Jiang, Lin, Lin

arXiv.org Machine Learning

Due to the ever growing amounts of data leveraged for machine learning and scientific computing, it is increasingly important to develop algorithms that sample only a small portion of the data at a time. In the case of linear least-squares, the randomized block Kaczmarz method (RBK) is an appealing example of such an algorithm, but its convergence is only understood under sampling distributions that require potentially prohibitively expensive preprocessing steps. To address this limitation, we analyze RBK when the data is sampled uniformly, showing that its iterates converge in a Monte Carlo sense to a $\textit{weighted}$ least-squares solution. Unfortunately, for general problems the condition number of the weight matrix and the variance of the iterates can become arbitrarily large. We resolve these issues by incorporating regularization into the RBK iterations. Numerical experiments, including examples arising from natural gradient optimization, suggest that the regularized algorithm, ReBlocK, outperforms minibatch stochastic gradient descent for realistic problems that exhibit fast singular value decay.


Bypass Exponential Time Preprocessing: Fast Neural Network Training via Weight-Data Correlation Preprocessing

Neural Information Processing Systems

Over the last decade, deep neural networks have transformed our society, and they are already widely applied in various machine learning applications. State-of-the-art deep neural networks are becoming larger in size every year to deliver increasing model accuracy, and as a result, model training consumes substantial computing resources and will only consume more in the future.Using current training methods, in each iteration, to process a data point x \in \mathbb{R} d in a layer, we need to spend \Theta(md) time to evaluate all the m neurons in the layer. This means processing the entire layer takes \Theta(nmd) time for n data points. Recent work [Song, Yang and Zhang, NeurIPS 2021] reduces this time per iteration to o(nmd), but requires exponential time to preprocess either the data or the neural network weights, making it unlikely to have practical usage. In this work, we present a new preprocessing method that simply stores the weight-data correlation in a tree data structure in order to quickly and dynamically detect which neurons fire at each iteration.


Diabetic Retinopathy Classification from Retinal Images using Machine Learning Approaches

Bhattacharjee, Indronil, Al-Mahmud, null, Mahmud, Tareq

arXiv.org Artificial Intelligence

Diabetic Retinopathy is one of the most familiar diseases and is a diabetes complication that affects eyes. Initially, diabetic retinopathy may cause no symptoms or only mild vision problems. Eventually, it can cause blindness. So early detection of symptoms could help to avoid blindness. In this paper, we present some experiments on some features of diabetic retinopathy, like properties of exudates, properties of blood vessels and properties of microaneurysm. Using the features, we can classify healthy, mild non-proliferative, moderate non-proliferative, severe non-proliferative and proliferative stages of DR. Support Vector Machine, Random Forest and Naive Bayes classifiers are used to classify the stages. Finally, Random Forest is found to be the best for higher accuracy, sensitivity and specificity of 76.5%, 77.2% and 93.3% respectively.