AITopics | Faili, Heshaam

Collaborating Authors

Faili, Heshaam

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Matina: A Large-Scale 73B Token Persian Text Corpus

Hosseinbeigi, Sara Bourbour, Taherinezhad, Fatemeh, Faili, Heshaam, Baghbani, Hamed, Nadi, Fatemeh, Amiri, Mostafa

arXiv.org Artificial IntelligenceFeb-13-2025

Text corpora are essential for training models used in tasks like summarization, translation, and large language models (LLMs). While various efforts have been made to collect monolingual and multilingual datasets in many languages, Persian has often been underrepresented due to limited resources for data collection and preprocessing. Existing Persian datasets are typically small and lack content diversity, consisting mainly of weblogs and news articles. This shortage of high-quality, varied data has slowed the development of NLP models and open-source LLMs for Persian. Since model performance depends heavily on the quality of training data, we address this gap by introducing the Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed and deduplicated to ensure high data quality. We further assess its effectiveness by training and evaluating transformer-based models on key NLP tasks. Both the dataset and preprocessing codes are publicly available, enabling researchers to build on and improve this resource for future Persian NLP advancements.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.09188

Country: Asia > Middle East (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Persian Typographical Error Type Detection Using Deep Neural Networks on Algorithmically-Generated Misspellings

Dehghani, Mohammad, Faili, Heshaam

arXiv.org Artificial IntelligenceJan-11-2024

Spelling correction is a remarkable challenge in the field of natural language processing. The objective of spelling correction tasks is to recognize and rectify spelling errors automatically. The development of applications that can effectually diagnose and correct Persian spelling and grammatical errors has become more important in order to improve the quality of Persian text. The Typographical Error Type Detection in Persian is a relatively understudied area. Therefore, this paper presents a compelling approach for detecting typographical errors in Persian texts. Our work includes the presentation of a publicly available dataset called FarsTypo, which comprises 3.4 million words arranged in chronological order and tagged with their corresponding part-of-speech. These words cover a wide range of topics and linguistic styles. We develop an algorithm designed to apply Persian-specific errors to a scalable portion of these words, resulting in a parallel dataset of correct and incorrect words. By leveraging FarsTypo, we establish a strong foundation and conduct a thorough comparison of various methodologies employing different architectures. Additionally, we introduce a groundbreaking Deep Sequential Neural Network that utilizes both word and character embeddings, along with bidirectional LSTM layers, for token classification aimed at detecting typographical errors across 51 distinct classes. Our approach is contrasted with highly advanced industrial systems that, unlike this study, have been developed using a diverse range of resources. The outcomes of our final method proved to be highly competitive, achieving an accuracy of 97.62%, precision of 98.83%, recall of 98.61%, and surpassing others in terms of speed.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2305.11731

Country:

Asia > Middle East > Iran (0.14)
Asia > Taiwan (0.14)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Information Technology (1.00)
Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

PerSHOP -- A Persian dataset for shopping dialogue systems modeling

Mahmoudi, Keyvan, Faili, Heshaam

arXiv.org Artificial IntelligenceJan-1-2024

Nowadays, dialogue systems are used in many fields of industry and research. There are successful instances of these systems, such as Apple Siri, Google Assistant, and IBM Watson. Task-oriented dialogue system is a category of these, that are used in specific tasks. They can perform tasks such as booking plane tickets or making restaurant reservations. Shopping is one of the most popular areas on these systems. The bot replaces the human salesperson and interacts with the customers by speaking. To train the models behind the scenes of these systems, annotated data is needed. In this paper, we developed a dataset of dialogues in the Persian language through crowd-sourcing. We annotated these dialogues to train a model. This dataset contains nearly 22k utterances in 15 different domains and 1061 dialogues. This is the largest Persian dataset in this field, which is provided freely so that future researchers can use it. Also, we proposed some baseline models for natural language understanding (NLU) tasks. These models perform two tasks for NLU: intent classification and entity extraction. The F-1 score metric obtained for intent classification is around 91% and for entity extraction is around 93%, which can be a baseline for future research.

artificial intelligence, chatbot, natural language, (16 more...)

arXiv.org Artificial Intelligence

2401.00811

Country:

North America > United States (0.46)
Europe > Spain (0.28)
Asia > Middle East > Iran (0.14)

Genre: Research Report (0.64)

Industry:

Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.93)
Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.88)

Add feedback

Mismatching-Aware Unsupervised Translation Quality Estimation For Low-Resource Languages

Azadi, Fatemeh, Faili, Heshaam, Dousti, Mohammad Javad

arXiv.org Artificial IntelligenceAug-12-2023

Translation Quality Estimation (QE) is the task of predicting the quality of machine translation (MT) output without any reference. This task has gained increasing attention as an important component in the practical applications of MT. In this paper, we first propose XLMRScore, which is a cross-lingual counterpart of BERTScore computed via the XLM-RoBERTa (XLMR) model. This metric can be used as a simple unsupervised QE method, while employing it results in two issues: firstly, the untranslated tokens leading to unexpectedly high translation scores, and secondly, the issue of mismatching errors between source and hypothesis tokens when applying the greedy matching in XLMRScore. To mitigate these issues, we suggest replacing untranslated words with the unknown token and the cross-lingual alignment of the pre-trained model to represent aligned words closer to each other, respectively. We evaluate the proposed method on four low-resource language pairs of WMT21 QE shared task, as well as a new English-Farsi test dataset introduced in this paper. Experiments show that our method could get comparable results with the supervised baseline for two zero-shot scenarios, i.e., with less than 0.01 difference in Pearson correlation, while outperforming unsupervised rivals in all the low-resource language pairs for above 8%, on average.

machine learning, mismatching-aware unsupervised translation quality estimation, natural language, (12 more...)

arXiv.org Artificial Intelligence

2208.00463

Country:

Europe (1.00)
Asia (0.93)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.88)

Add feedback

A New Sentence Ordering Method Using BERT Pretrained Model

Golestani, Melika, Razavi, Seyedeh Zahra, Faili, Heshaam

arXiv.org Artificial IntelligenceAug-26-2021

Building systems with capability of natural language understanding (NLU) has been one of the oldest areas of AI. An essential component of NLU is to detect logical succession of events contained in a text. The task of sentence ordering is proposed to learn succession of events with applications in AI tasks. The performance of previous works employing statistical methods is poor, while the neural networks-based approaches are in serious need of large corpora for model learning. In this paper, we propose a method for sentence ordering which does not need a training phase and consequently a large corpus for learning. To this end, we generate sentence embedding using BERT pre-trained model and measure sentence similarity using cosine similarity score. We suggest this score as an indicator of sequential events' level of coherence. We finally sort the sentences through brute-force search to maximize overall similarities of the sequenced sentences. Our proposed method outperformed other baselines on ROCStories, a corpus of 5-sentence human-made stories. The method is specifically more efficient than neural network-based methods when no huge corpus is available. Among other advantages of this method are its interpretability and needlessness to linguistic knowledge.

deep learning, neural network, similarity, (19 more...)

arXiv.org Artificial Intelligence

2108.11994

Country:

Asia > Middle East > Iran (0.15)
North America > United States > California (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.90)

Add feedback

Persian Wordnet Construction using Supervised Learning

Mousavi, Zahra, Faili, Heshaam

arXiv.org Machine LearningApr-11-2017

This paper presents an automated supervised method for Persian wordnet construction. Using a Persian corpus and a bi-lingual dictionary, the initial links between Persian words and Princeton WordNet synsets have been generated. These links will be discriminated later as correct or incorrect by employing seven features in a trained classification system. The whole method is just a classification system, which has been trained on a train set containing FarsNet as a set of correct instances. State of the art results on the automatically derived Persian wordnet is achieved. The resulted wordnet with a precision of 91.18% includes more than 16,000 words and 22,000 synsets.

artificial intelligence, text processing, wordnet, (19 more...)

arXiv.org Machine Learning

1704.03223

Country:

Asia > Middle East > Iran (0.14)
Europe > Hungary (0.14)
Asia > Thailand (0.14)
Asia > India (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.47)

Add feedback

Dimension Projection among Languages based on Pseudo-relevant Documents for Query Translation

Dadashkarimi, Javid, Shahshahani, Mahsa S., Tebbifakhr, Amirhossein, Faili, Heshaam, Shakery, Azadeh

arXiv.org Artificial IntelligenceOct-8-2016

Using top-ranked documents in response to a query has been shown to be an effective approach to improve the quality of query translation in dictionary-based cross-language information retrieval. In this paper, we propose a new method for dictionary-based query translation based on dimension projection of embedded vectors from the pseudo-relevant documents in the source language to their equivalents in the target language. To this end, first we learn low-dimensional vectors of the words in the pseudo-relevant collections separately and then aim to find a query-dependent transformation matrix between the vectors of translation pairs appeared in the collections. At the next step, representation of each query term is projected to the target language and then, after using a softmax function, a query-dependent translation model is built. Finally, the model is used for query translation. Our experiments on four CLEF collections in French, Spanish, German, and Italian demonstrate that the proposed method outperforms a word embedding baseline based on bilingual shuffling and a further number of competitive baselines. The proposed method reaches up to 87% performance of machine translation (MT) in short queries and considerable improvements in verbose queries.

artificial intelligence, machine translation, translation model, (17 more...)

arXiv.org Artificial Intelligence

1605.07844

Country: Asia > Middle East > Iran (0.14)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback