Singh, Mayank
Unveiling the Multi-Annotation Process: Examining the Influence of Annotation Quantity and Instance Difficulty on Model Performance
Kadasi, Pritam, Singh, Mayank
The NLP community has long advocated for the construction of multi-annotator datasets to better capture the nuances of language interpretation, subjectivity, and ambiguity. This paper conducts a retrospective study to show how performance scores can vary when a dataset expands from a single annotation per instance to multiple annotations. We propose a novel multi-annotator simulation process to generate datasets with varying annotation budgets. We show that similar datasets with the same annotation budget can lead to varying performance gains. Our findings challenge the popular belief that models trained on multi-annotation examples always lead to better performance than models trained on single or few-annotation examples.
Unlocking Model Insights: A Dataset for Automated Model Card Generation
Singh, Shruti, Lodwal, Hitesh, Malwat, Husain, Thakur, Rakesh, Singh, Mayank
Language models (LMs) are no longer restricted to ML community, and instruction-tuned LMs have led to a rise in autonomous AI agents. As the accessibility of LMs grows, it is imperative that an understanding of their capabilities, intended usage, and development cycle also improves. Model cards are a popular practice for documenting detailed information about an ML model. To automate model card generation, we introduce a dataset of 500 question-answer pairs for 25 ML models that cover crucial aspects of the model, such as its training configurations, datasets, biases, architecture details, and training resources. We employ annotators to extract the answers from the original paper. Further, we explore the capabilities of LMs in generating model cards by answering questions. Our initial experiments with ChatGPT-3.5, LLaMa, and Galactica showcase a significant gap in the understanding of research papers by these aforementioned LMs as well as generating factual textual responses. We posit that our dataset can be used to train models to automate the generation of model cards from paper text and reduce human effort in the model card curation process. The complete dataset is available on https://osf.io/hqt7p/?view_only=3b9114e3904c4443bcd9f5c270158d37
Modeling interdisciplinary interactions among Physics, Mathematics & Computer Science
Hazra, Rima, Singh, Mayank, Goyal, Pawan, Adhikari, Bibhas, Mukherjee, Animesh
Interdisciplinarity has over the recent years have gained tremendous importance and has become one of the key ways of doing cutting edge research . In this paper we attempt to model the citation flow across three different fields - Physics (PHY), Mathematics (MA) and Computer Science (CS). For instance, is there a specific pattern in which these fields cite one another? We carry out experiments on a dataset comprising more than 1.2 million articles taken from these three fields. We quantify the citation interactions among these three fields through temporal bucket signatures. We present numerical models based on variants of the recently proposed relay-linking framework to explain the citation dynamics across the three disciplines. These models make a modest attempt to unfold the underlying principles of how citation links could have been formed across the three fields over time.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Workshop, BigScience, :, null, Scao, Teven Le, Fan, Angela, Akiki, Christopher, Pavlick, Ellie, Iliฤ, Suzana, Hesslow, Daniel, Castagnรฉ, Roman, Luccioni, Alexandra Sasha, Yvon, Franรงois, Gallรฉ, Matthias, Tow, Jonathan, Rush, Alexander M., Biderman, Stella, Webson, Albert, Ammanamanchi, Pawan Sasanka, Wang, Thomas, Sagot, Benoรฎt, Muennighoff, Niklas, del Moral, Albert Villanova, Ruwase, Olatunji, Bawden, Rachel, Bekman, Stas, McMillan-Major, Angelina, Beltagy, Iz, Nguyen, Huu, Saulnier, Lucile, Tan, Samson, Suarez, Pedro Ortiz, Sanh, Victor, Laurenรงon, Hugo, Jernite, Yacine, Launay, Julien, Mitchell, Margaret, Raffel, Colin, Gokaslan, Aaron, Simhi, Adi, Soroa, Aitor, Aji, Alham Fikri, Alfassy, Amit, Rogers, Anna, Nitzav, Ariel Kreisberg, Xu, Canwen, Mou, Chenghao, Emezue, Chris, Klamm, Christopher, Leong, Colin, van Strien, Daniel, Adelani, David Ifeoluwa, Radev, Dragomir, Ponferrada, Eduardo Gonzรกlez, Levkovizh, Efrat, Kim, Ethan, Natan, Eyal Bar, De Toni, Francesco, Dupont, Gรฉrard, Kruszewski, Germรกn, Pistilli, Giada, Elsahar, Hady, Benyamina, Hamza, Tran, Hieu, Yu, Ian, Abdulmumin, Idris, Johnson, Isaac, Gonzalez-Dios, Itziar, de la Rosa, Javier, Chim, Jenny, Dodge, Jesse, Zhu, Jian, Chang, Jonathan, Frohberg, Jรถrg, Tobing, Joseph, Bhattacharjee, Joydeep, Almubarak, Khalid, Chen, Kimbo, Lo, Kyle, Von Werra, Leandro, Weber, Leon, Phan, Long, allal, Loubna Ben, Tanguy, Ludovic, Dey, Manan, Muรฑoz, Manuel Romero, Masoud, Maraim, Grandury, Marรญa, ล aลกko, Mario, Huang, Max, Coavoux, Maximin, Singh, Mayank, Jiang, Mike Tian-Jian, Vu, Minh Chien, Jauhar, Mohammad A., Ghaleb, Mustafa, Subramani, Nishant, Kassner, Nora, Khamis, Nurulaqilla, Nguyen, Olivier, Espejel, Omar, de Gibert, Ona, Villegas, Paulo, Henderson, Peter, Colombo, Pierre, Amuok, Priscilla, Lhoest, Quentin, Harliman, Rheza, Bommasani, Rishi, Lรณpez, Roberto Luis, Ribeiro, Rui, Osei, Salomey, Pyysalo, Sampo, Nagel, Sebastian, Bose, Shamik, Muhammad, Shamsuddeen Hassan, Sharma, Shanya, Longpre, Shayne, Nikpoor, Somaieh, Silberberg, Stanislav, Pai, Suhas, Zink, Sydney, Torrent, Tiago Timponi, Schick, Timo, Thrush, Tristan, Danchev, Valentin, Nikoulina, Vassilina, Laippala, Veronika, Lepercq, Violette, Prabhu, Vrinda, Alyafeai, Zaid, Talat, Zeerak, Raja, Arun, Heinzerling, Benjamin, Si, Chenglei, Taลar, Davut Emre, Salesky, Elizabeth, Mielke, Sabrina J., Lee, Wilson Y., Sharma, Abheesht, Santilli, Andrea, Chaffin, Antoine, Stiegler, Arnaud, Datta, Debajyoti, Szczechla, Eliza, Chhablani, Gunjan, Wang, Han, Pandey, Harshit, Strobelt, Hendrik, Fries, Jason Alan, Rozen, Jos, Gao, Leo, Sutawika, Lintang, Bari, M Saiful, Al-shaibani, Maged S., Manica, Matteo, Nayak, Nihal, Teehan, Ryan, Albanie, Samuel, Shen, Sheng, Ben-David, Srulik, Bach, Stephen H., Kim, Taewoon, Bers, Tali, Fevry, Thibault, Neeraj, Trishala, Thakker, Urmish, Raunak, Vikas, Tang, Xiangru, Yong, Zheng-Xin, Sun, Zhiqing, Brody, Shaked, Uri, Yallow, Tojarieh, Hadar, Roberts, Adam, Chung, Hyung Won, Tae, Jaesung, Phang, Jason, Press, Ofir, Li, Conglong, Narayanan, Deepak, Bourfoune, Hatim, Casper, Jared, Rasley, Jeff, Ryabinin, Max, Mishra, Mayank, Zhang, Minjia, Shoeybi, Mohammad, Peyrounette, Myriam, Patry, Nicolas, Tazi, Nouamane, Sanseviero, Omar, von Platen, Patrick, Cornette, Pierre, Lavallรฉe, Pierre Franรงois, Lacroix, Rรฉmi, Rajbhandari, Samyam, Gandhi, Sanchit, Smith, Shaden, Requena, Stรฉphane, Patil, Suraj, Dettmers, Tim, Baruwa, Ahmed, Singh, Amanpreet, Cheveleva, Anastasia, Ligozat, Anne-Laure, Subramonian, Arjun, Nรฉvรฉol, Aurรฉlie, Lovering, Charles, Garrette, Dan, Tunuguntla, Deepak, Reiter, Ehud, Taktasheva, Ekaterina, Voloshina, Ekaterina, Bogdanov, Eli, Winata, Genta Indra, Schoelkopf, Hailey, Kalo, Jan-Christoph, Novikova, Jekaterina, Forde, Jessica Zosa, Clive, Jordan, Kasai, Jungo, Kawamura, Ken, Hazan, Liam, Carpuat, Marine, Clinciu, Miruna, Kim, Najoung, Cheng, Newton, Serikov, Oleg, Antverg, Omer, van der Wal, Oskar, Zhang, Rui, Zhang, Ruochen, Gehrmann, Sebastian, Mirkin, Shachar, Pais, Shani, Shavrina, Tatiana, Scialom, Thomas, Yun, Tian, Limisiewicz, Tomasz, Rieser, Verena, Protasov, Vitaly, Mikhailov, Vladislav, Pruksachatkun, Yada, Belinkov, Yonatan, Bamberger, Zachary, Kasner, Zdenฤk, Rueda, Alice, Pestana, Amanda, Feizpour, Amir, Khan, Ammar, Faranak, Amy, Santos, Ana, Hevia, Anthony, Unldreaj, Antigona, Aghagol, Arash, Abdollahi, Arezoo, Tammour, Aycha, HajiHosseini, Azadeh, Behroozi, Bahareh, Ajibade, Benjamin, Saxena, Bharat, Ferrandis, Carlos Muรฑoz, McDuff, Daniel, Contractor, Danish, Lansky, David, David, Davis, Kiela, Douwe, Nguyen, Duong A., Tan, Edward, Baylor, Emi, Ozoani, Ezinwanne, Mirza, Fatima, Ononiwu, Frankline, Rezanejad, Habib, Jones, Hessie, Bhattacharya, Indrani, Solaiman, Irene, Sedenko, Irina, Nejadgholi, Isar, Passmore, Jesse, Seltzer, Josh, Sanz, Julio Bonis, Dutra, Livia, Samagaio, Mairon, Elbadri, Maraim, Mieskes, Margot, Gerchick, Marissa, Akinlolu, Martha, McKenna, Michael, Qiu, Mike, Ghauri, Muhammed, Burynok, Mykola, Abrar, Nafis, Rajani, Nazneen, Elkott, Nour, Fahmy, Nour, Samuel, Olanrewaju, An, Ran, Kromann, Rasmus, Hao, Ryan, Alizadeh, Samira, Shubber, Sarmad, Wang, Silas, Roy, Sourav, Viguier, Sylvain, Le, Thanh, Oyebade, Tobi, Le, Trieu, Yang, Yoyo, Nguyen, Zach, Kashyap, Abhinav Ramesh, Palasciano, Alfredo, Callahan, Alison, Shukla, Anima, Miranda-Escalada, Antonio, Singh, Ayush, Beilharz, Benjamin, Wang, Bo, Brito, Caio, Zhou, Chenxi, Jain, Chirag, Xu, Chuxin, Fourrier, Clรฉmentine, Periรฑรกn, Daniel Leรณn, Molano, Daniel, Yu, Dian, Manjavacas, Enrique, Barth, Fabio, Fuhrimann, Florian, Altay, Gabriel, Bayrak, Giyaseddin, Burns, Gully, Vrabec, Helena U., Bello, Imane, Dash, Ishani, Kang, Jihyun, Giorgi, John, Golde, Jonas, Posada, Jose David, Sivaraman, Karthik Rangasai, Bulchandani, Lokesh, Liu, Lu, Shinzato, Luisa, de Bykhovetz, Madeleine Hahn, Takeuchi, Maiko, Pร mies, Marc, Castillo, Maria A, Nezhurina, Marianna, Sรคnger, Mario, Samwald, Matthias, Cullan, Michael, Weinberg, Michael, De Wolf, Michiel, Mihaljcic, Mina, Liu, Minna, Freidank, Moritz, Kang, Myungsun, Seelam, Natasha, Dahlberg, Nathan, Broad, Nicholas Michio, Muellner, Nikolaus, Fung, Pascale, Haller, Patrick, Chandrasekhar, Ramya, Eisenberg, Renata, Martin, Robert, Canalli, Rodrigo, Su, Rosaline, Su, Ruisi, Cahyawijaya, Samuel, Garda, Samuele, Deshmukh, Shlok S, Mishra, Shubhanshu, Kiblawi, Sid, Ott, Simon, Sang-aroonsiri, Sinee, Kumar, Srishti, Schweter, Stefan, Bharati, Sushil, Laud, Tanmay, Gigant, Thรฉo, Kainuma, Tomoya, Kusa, Wojciech, Labrak, Yanis, Bajaj, Yash Shailesh, Venkatraman, Yash, Xu, Yifan, Xu, Yingxin, Xu, Yu, Tan, Zhe, Xie, Zhongli, Ye, Zifan, Bras, Mathilde, Belkada, Younes, Wolf, Thomas
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
Analogy-Forming Transformers for Few-Shot 3D Parsing
Gkanatsios, Nikolaos, Singh, Mayank, Fang, Zhaoyuan, Tulsiani, Shubham, Fragkiadaki, Katerina
We present Analogical Networks, a model that encodes domain knowledge explicitly, in a collection of structured labelled 3D scenes, in addition to implicitly, as model parameters, and segments 3D object scenes with analogical reasoning: instead of mapping a scene to part segments directly, our model first retrieves related scenes from memory and their corresponding part structures, and then predicts analogous part structures for the input scene, via an end-to-end learnable modulation mechanism. By conditioning on more than one retrieved memories, compositions of structures are predicted, that mix and match parts across the retrieved memories. One-shot, few-shot or many-shot learning are treated uniformly in Analogical Networks, by conditioning on the appropriate set of memories, whether taken from a single, few or many memory exemplars, and inferring analogous parses. We show Analogical Networks are competitive with state-of-the-art 3D segmentation transformers in many-shot settings, and outperform them, as well as existing paradigms of meta-learning and few-shot learning, in few-shot settings. Analogical Networks successfully segment instances of novel object categories simply by expanding their memory, without any weight updates. Our code and models are publicly available in the project webpage: http://analogicalnets.github.io/. Ask not what it is, ask what it is like. The dominant paradigm in existing deep visual learning is to train high-capacity networks that map input observations to task-specific outputs. Despite their success across a plethora of tasks, these models struggle to perform well in few-shot settings where only a small set of examples are available for learning. Meta-learning approaches provide one promising solution to this by enabling efficient task-specific adaptation of generic models, but this specialization comes at the cost of poor performance on the original tasks as well as the need to adapt separate models for each novel task. We introduce Analogical Networks, a semi-parametric learning framework for 3D scene parsing that pursues analogy-driven prediction: instead of mapping the input scene to part segments directly, the model reasons analogically and maps the input to modifications and compositions of past labelled visual experiences. Analogical Networks encode domain knowledge explicitly in a collection of structured labelled scene memories as well as implicitly, in model parameters. Given an input 3D scene, the model retrieves relevant memories and uses them to modulate inference and segment object parts in the input point cloud.
MMT: A Multilingual and Multi-Topic Indian Social Media Dataset
Dalal, Dwip, Srivastava, Vivek, Singh, Mayank
Social media plays a significant role in cross-cultural communication. A vast amount of this occurs in code-mixed and multilingual form, posing a significant challenge to Natural Language Processing (NLP) tools for processing such information, like language identification, topic modeling, and named-entity recognition. To address this, we introduce a large-scale multilingual, and multi-topic dataset (MMT) collected from Twitter (1.7 million Tweets), encompassing 13 coarse-grained and 63 fine-grained topics in the Indian context. We further annotate a subset of 5,346 tweets from the MMT dataset with various Indian languages and their code-mixed counterparts. Also, we demonstrate that the currently existing tools fail to capture the linguistic diversity in MMT on two downstream tasks, i.e., topic modeling and language identification. To facilitate future research, we will make the anonymized and annotated dataset available in the public domain.
MUTANT: A Multi-sentential Code-mixed Hinglish Dataset
Gupta, Rahul, Srivastava, Vivek, Singh, Mayank
The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of codemixing Figure 1: Example MCT and the corresponding article's to a multi-sentential framework and title form two multilingual data sources: (A) automatically identify MCT in the multilingual Dainik Jagran news article and (B) Man-ki-baat speech articles. The MUTANT dataset comprises transcript. We color code the tokens as: English, Hindi, 67k articles with 85k identified Hinglish and language independent.
ICDAR 2021 Competition on Scientific Table Image Recognition to LaTeX
Kayal, Pratik, Anand, Mrinal, Desai, Harsh, Singh, Mayank
Tables present important information concisely in many scientific documents. Visual features like mathematical symbols, equations, and spanning cells make structure and content extraction from tables embedded in research documents difficult. This paper discusses the dataset, tasks, participants' methods, and results of the ICDAR 2021 Competition on Scientific Table Image Recognition to LaTeX. Specifically, the task of the competition is to convert a tabular image to its corresponding LaTeX source code. We proposed two subtasks. In Subtask 1, we ask the participants to reconstruct the LaTeX structure code from an image. In Subtask 2, we ask the participants to reconstruct the LaTeX content code from an image. This report describes the datasets and ground truth specification, details the performance evaluation metrics used, presents the final results, and summarizes the participating methods. Submission by team VCGroup got the highest Exact Match accuracy score of 74% for Subtask 1 and 55% for Subtask 2, beating previous baselines by 5% and 12%, respectively. Although improvements can still be made to the recognition capabilities of models, this competition contributes to the development of fully automated table recognition systems by challenging practitioners to solve problems under specific constraints and sharing their approaches; the platform will remain available for post-challenge submissions at https://competitions.codalab.org/competitions/26979 .
Weakly-Supervised Deep Learning for Domain Invariant Sentiment Classification
Kayal, Pratik, Singh, Mayank, Goyal, Pawan
The task of learning a sentiment classification model that adapts well to any target domain, different from the source domain, is a challenging problem. Majority of the existing approaches focus on learning a common representation by leveraging both source and target data during training. In this paper, we introduce a two-stage training procedure that leverages weakly supervised datasets for developing simple lift-and-shift-based predictive models without being exposed to the target domain during the training phase. Experimental results show that transfer with weak supervision from a source domain to various target domains provides performance very close to that obtained via supervised training on the target domain itself.
Charting the Right Manifold: Manifold Mixup for Few-shot Learning
Mangla, Puneet, Singh, Mayank, Sinha, Abhishek, Kumari, Nupur, Balasubramanian, Vineeth N, Krishnamurthy, Balaji
Few-shot learning algorithms aim to learn model parameters capable of adapting to unseen classes with the help of only a few labeled examples. A recent regularization technique - Manifold Mixup focuses on learning a general-purpose representation, robust to small changes in the data distribution. Since the goal of few-shot learning is closely linked to robust representation learning, we study Manifold Mixup in this problem setting. Self-supervised learning is another technique that learns semantically meaningful features, using only the inherent structure of the data. This work investigates the role of learning relevant feature manifold for few-shot tasks using self-supervision and regularization techniques. We observe that regularizing the feature manifold, enriched via self-supervised techniques, with Manifold Mixup significantly improves few-shot learning performance. We show that our proposed method S2M2 beats the current state-of-the-art accuracy on standard few-shot learning datasets like CIFAR-FS, CUB and mini-ImageNet by 3-8%. Through extensive experimentation, we show that the features learned using our approach generalize to complex few-shot evaluation tasks, cross-domain scenarios and are robust against slight changes to data distribution.