AITopics

2511.00489

Country:

Asia > China (0.47)
North America > United States (0.46)
Europe > Austria > Vienna (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-3-2025, 02:56:40 GMT

Export Reviews, Discussions, Author Feedback and Meta-Reviews

I recommend moving the background section in a summarized form to the main paper while moving some of the proofs in the appendix. Unfortunately it was a delicate balancing act to fit both background material as well as our contribution within 8 pages. However, your point is well taken and we will try to fix this in the camera ready.

algorithm, iteration, tensor factorization, (11 more...)

Country: North America > Canada > Quebec > Montreal (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.33)

Neural Information Processing SystemsSep-30-2025, 09:27:09 GMT

Large-scale L-BFGS using MapReduce

L-BFGS has been applied as an effective parameter estimation method for various machine learning algorithms since 1980s. With an increasing demand to deal with massive instances and variables, it is important to scale up and parallelize L-BFGS effectively in a distributed system. In this paper, we study the problem of parallelizing the L-BFGS algorithm in large clusters of tens of thousands of shared-nothing commodity machines. First, we show that a naive implementation of L-BFGS using Map-Reduce requires either a significant amount of memory or a large number of map-reduce steps with negative performance impact. Second, we propose a new L-BFGS algorithm, called Vector-free L-BFGS, which avoids the expensive dot product operations in the two loop recursion and greatly improves computation efficiency with a great degree of parallelism. The algorithm scales very well and enables a variety of machine learning algorithms to handle a massive number of variables over large datasets. We prove the mathematical equivalence of the new Vector-free L-BFGS and demonstrate its excellent performance and scalability using real-world machine learning problems with billions of variables in production clusters.

l-bfg, large-scale l-bfg, name change, (5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsJan-26-2025, 11:13:18 GMT

Reviews: Fast Parallel Algorithms for Statistical Subset Selection Problems

The authors propose a relaxation of submodularity, called differential submodularity, where the marginal gains can be bounded by two submodular functions. They use this concept to provide approximation guarantees for a parallel algorithm, namely adaptive sampling, for maximizing weak submodular functions, and show its applicability to parallel feature selection and experimental design. Overall the paper is well written and the problem is well motivated. The main motivation for parallelized algorithms is their applicability to large datasets. Although we see some speedup for relatively small datasets in the experiments, my main concern is that due to the large number of rounds in the worst case and large sample complexity, the algorithm may not scale to large datasets, especially in the actual distributed setting, (e.g.

algorithm, dataset, statistical subset selection problem, (11 more...)

Technology:

Information Technology > Architecture > Distributed Systems (0.62)
Information Technology > Artificial Intelligence (0.38)

Neural Information Processing SystemsMar-14-2024, 23:39:06 GMT

!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking.

ogwild, processor, speedup, (14 more...)

Country:

North America > United States > Wisconsin > Dane County > Madison (0.14)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
North America > United States > District of Columbia > Washington (0.04)
North America > United States > California (0.04)

Genre: Research Report (0.46)

Industry: Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

arXiv.org Artificial IntelligenceFeb-10-2024

Subgroup Discovery in MOOCs: A Big Data Application for Describing Different Types of Learners

Luna, J. M., Fardoun, H. M., Padillo, F., Romero, C., Ventura, S.

The aim of this paper is to categorize and describe different types of learners in massive open online courses (MOOCs) by means of a subgroup discovery approach based on MapReduce. The final objective is to discover IF-THEN rules that appear in different MOOCs. The proposed subgroup discovery approach, which is an extension of the well-known FP-Growth algorithm, considers emerging parallel methodologies like MapReduce to be able to cope with extremely large datasets. As an additional feature, the proposal includes a threshold value to denote the number of courses that each discovered rule should satisfy. A post-processing step is also included so redundant subgroups can be removed. The experimental stage is carried out by considering de-identified data from the first year of 16 MITx and HarvardX courses on the edX platform. Experimental results demonstrate that the proposed MapReduce approach outperforms traditional sequential subgroup discovery approaches, achieving a runtime that is almost constant for different courses. Additionally, thanks to the final post-processing step, only interesting and not-redundant rules are discovered, hence reducing the number of subgroups in one or two orders of magnitude. Finally, the discovered subgroups are easily used by courses' instructors not only for descriptive purposes but also for additional tasks such as recommendation or personalization.

learner, student, subgroup, (16 more...)

doi: 10.1007/s10115-022-01674-9

2403.05555

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Spain > Andalusia > Córdoba Province > Córdoba (0.04)
Asia > Middle East > Saudi Arabia (0.04)

Genre:

Research Report (1.00)
Instructional Material > Online (1.00)

Industry:

Education > Educational Technology > Educational Software > Computer Based Training (1.00)
Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

arXiv.org Artificial IntelligenceJan-30-2024

Novel Preprocessing Technique for Data Embedding in Engineering Code Generation Using Large Language Model

Lin, Yu-Chen, Kumar, Akhilesh, Chang, Norman, Zhang, Wenliang, Zakir, Muhammad, Apte, Rucha, He, Haiyang, Wang, Chao, Jang, Jyh-Shing Roger

We present four main contributions to enhance the performance of Large Language Models (LLMs) in generating domain-specific code: (i) utilizing LLM-based data splitting and data renovation techniques to improve the semantic representation of embeddings' space; (ii) introducing the Chain of Density for Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text Renovation (ATR) algorithm for assessing data renovation reliability; (iii) developing the Implicit Knowledge Expansion and Contemplation (IKEC) Prompt technique; and (iv) effectively refactoring existing scripts to generate new and high-quality scripts with LLMs. By using engineering simulation software RedHawk-SC as a case study, we demonstrate the effectiveness of our data pre-processing method for expanding and categorizing scripts. When combined with IKEC, these techniques enhance the Retrieval-Augmented Generation (RAG) method in retrieving more relevant information, ultimately achieving a 73.33% "Percentage of Correct Lines" for code generation problems in MapReduce applications.

information, llm, rag method, (14 more...)

2311.16267

Country:

Asia > Taiwan (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
Europe > Monaco (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (1.00)
Workflow (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

#artificialintelligenceNov-3-2022, 04:25:55 GMT

Machine Learning: Clustering & Retrieval

A reader is interested in a specific news article and you want to find similar articles to recommend. What is the right notion of similarity? Moreover, what if there are millions of other documents? Each time you want to a retrieve a new document, do you need to search through all other documents? How do you group similar documents together?

clustering & retrieval, latent dirichlet allocation, machine learning, (3 more...)

#artificialintelligence

Genre: Instructional Material (0.60)

Industry:

Education > Educational Technology > Educational Software > Computer Based Training (0.40)
Education > Educational Setting > Online (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.40)

Tekdogan, Taha, Cakmak, Ali

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

arXiv.org Artificial IntelligenceSep-21-2022

Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from Big Data led to the term Big Data Mining. Shifting the scope of data from small-size, structured, and stable data to huge volume, unstructured, and quickly changing data brings many data management challenges. Different tools cope with these challenges in their own way due to their architectural limitations. There are numerous parameters to take into consideration when choosing the right data management framework based on the task at hand. In this paper, we present a comprehensive benchmark for two widely used Big Data analytics tools, namely Apache Spark and Hadoop MapReduce, on a common data mining task, i.e., classification. We employ several evaluation metrics to compare the performance of the benchmarked frameworks, such as execution time, accuracy, and scalability. These metrics are specialized to measure the performance for classification task. To the best of our knowledge, there is no previous study in the literature that employs all these metrics while taking into consideration task-specific concerns. We show that Spark is 5 times faster than MapReduce on training the model. Nevertheless, the performance of Spark degrades when the input workload gets larger. Scaling the environment by additional clusters significantly improves the performance of Spark. However, similar enhancement is not observed in Hadoop. Machine learning utility of MapReduce tend to have better accuracy scores than that of Spark, like around 3%, even in small size data sets.

data mining, machine learning, mapreduce, (15 more...)

doi: 10.1145/3481646.3481649

2209.10637

Country:

Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
(4 more...)

Genre: Research Report > New Finding (0.69)

Industry:

Health & Medicine (0.68)
Materials > Metals & Mining (0.34)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.31)

Azhir, Elham, Hosseinzadeh, Mehdi, Khan, Faheem, Mosavi, Amir

Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark

arXiv.org Artificial IntelligenceSep-17-2022

Access plan recommendation is a query optimization approach that executes new queries using prior created query execution plans (QEPs). The query optimizer divides the query space into clusters in the mentioned method. However, traditional clustering algorithms take a significant amount of execution time for clustering such large datasets. The MapReduce distributed computing model provides efficient solutions for storing and processing vast quantities of data. Apache Spark and Apache Hadoop frameworks are used in the present investigation to cluster different sizes of query datasets in the MapReduce-based access plan recommendation method. The performance evaluation is performed based on execution time. The results of the experiments demonstrated the effectiveness of parallel query clustering in achieving high scalability. Furthermore, Apache Spark achieved better performance than Apache Hadoop, reaching an average speedup of 2x.

data mining, machine learning, natural language, (18 more...)

2210.07143

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.05)
South America > Peru > Cusco Department > Cusco Province > Cusco (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(9 more...)

Genre: Research Report > New Finding (1.00)

Industry: Telecommunications (0.46)

Technology:

Information Technology > Databases (1.00)
Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)