Goto

Collaborating Authors

 xml file


Blocks Architecture (BloArk): Efficient, Cost-Effective, and Incremental Dataset Architecture for Wikipedia Revision History

arXiv.org Artificial Intelligence

Wikipedia (Wiki) is one of the most widely used and publicly available resources for natural language processing (NLP) applications. Wikipedia Revision History (WikiRevHist) shows the order in which edits were made to any Wiki page since its first modification. While the most up-to-date Wiki has been widely used as a training source, WikiRevHist can also be valuable resources for NLP applications. However, there are insufficient tools available to process WikiRevHist without having substantial computing resources, making additional customization, and spending extra time adapting others' works. Therefore, we report Blocks Architecture (BloArk), an efficiency-focused data processing architecture that reduces running time, computing resource requirements, and repeated works in processing WikiRevHist dataset. BloArk consists of three parts in its infrastructure: blocks, segments, and warehouses. On top of that, we build the core data processing pipeline: builder and modifier. The BloArk builder transforms the original WikiRevHist dataset from XML syntax into JSON Lines (JSONL) format for improving the concurrent and storage efficiency. The BloArk modifier takes previously-built warehouses to operate incremental modifications for improving the utilization of existing databases and reducing the cost of reusing others' works. In the end, BloArk can scale up easily in both processing Wikipedia Revision History and incrementally modifying existing dataset for downstream NLP use cases. The source code, documentations, and example usages are publicly available online and open-sourced under GPL-2.0 license.


Android Malware Detection Based on RGB Images and Multi-feature Fusion

arXiv.org Artificial Intelligence

With the widespread adoption of smartphones, Android malware has become a significant challenge in the field of mobile device security. Current Android malware detection methods often rely on feature engineering to construct dynamic or static features, which are then used for learning. However, static feature-based methods struggle to counter code obfuscation, packing, and signing techniques, while dynamic feature-based methods involve time-consuming feature extraction. Image-based methods for Android malware detection offer better resilience against malware variants and polymorphic malware. This paper proposes an end-to-end Android malware detection technique based on RGB images and multi-feature fusion. The approach involves extracting Dalvik Executable (DEX) files, AndroidManifest.xml files, and API calls from APK files, converting them into grayscale images, and enhancing their texture features using Canny edge detection, histogram equalization, and adaptive thresholding techniques. These grayscale images are then combined into an RGB image containing multi-feature fusion information, which is analyzed using mainstream image classification models for Android malware detection. Extensive experiments demonstrate that the proposed method effectively captures Android malware characteristics, achieving an accuracy of up to 97.25%, outperforming existing detection methods that rely solely on DEX files as classification features. Additionally, ablation experiments confirm the effectiveness of using the three key files for feature representation in the proposed approach.


Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark

arXiv.org Artificial Intelligence

Artificial intelligence (AI) systems possess significant potential to drive societal progress. However, their deployment often faces obstacles due to substantial safety concerns. Safe reinforcement learning (SafeRL) emerges as a solution to optimize policies while simultaneously adhering to multiple constraints, thereby addressing the challenge of integrating reinforcement learning in safety-critical scenarios. In this paper, we present an environment suite called Safety-Gymnasium, which encompasses safety-critical tasks in both single and multi-agent scenarios, accepting vector and vision-only input. Additionally, we offer a library of algorithms named Safe Policy Optimization (SafePO), comprising 16 state-of-the-art SafeRL algorithms. This comprehensive library can serve as a validation tool for the research community. By introducing this benchmark, we aim to facilitate the evaluation and comparison of safety performance, thus fostering the development of reinforcement learning for safer, more reliable, and responsible real-world applications. The website of this project can be accessed at https://sites.google.com/view/safety-gymnasium.


SANGEET: A XML based Open Dataset for Research in Hindustani Sangeet

arXiv.org Artificial Intelligence

It is very important to access a rich music dataset that is useful in a wide variety of applications. Currently, available datasets are mostly focused on storing vocal or instrumental recording data and ignoring the requirement of its visual representation and retrieval. This paper attempts to build an XML-based public dataset, called SANGEET, that stores comprehensive information of Hindustani Sangeet (North Indian Classical Music) compositions written by famous musicologist Pt. Vishnu Narayan Bhatkhande. SANGEET preserves all the required information of any given composition including metadata, structural, notational, rhythmic, and melodic information in a standardized way for easy and efficient storage and extraction of musical information. The dataset is intended to provide the ground truth information for music information research tasks, thereby supporting several data-driven analysis from a machine learning perspective. We present the usefulness of the dataset by demonstrating its application on music information retrieval using XQuery, visualization through Omenad rendering system. Finally, we propose approaches to transform the dataset for performing statistical and machine learning tasks for a better understanding of Hindustani Sangeet. The dataset can be found at https://github.com/cmisra/Sangeet.


Toward a Generic Mapping Language for Transformations between RDF and Data Interchange Formats

arXiv.org Artificial Intelligence

While there exist approaches to integrate heterogeneous data using semantic models, such semantic models can typically not be used by existing software tools. Many software tools - especially in engineering - only have options to import and export data in more established data interchange formats such as XML or JSON. Thus, if an information which is included in a semantic model needs to be used in a such a software tool, automatic approaches for mapping semantic information into an interchange format are needed. We aim to develop a generic mapping approach that allows users to create transformations of semantic information into a data interchange format with an arbitrary structure which can be defined by a user. This mapping approach is currently being elaborated. In this contribution, we report our initial steps targeted to transformations from RDF into XML. At first, a mapping language is introduced which allows to define automated mappings from ontologies to XML. Furthermore, a mapping algorithm capable of executing mappings defined in this language is presented. An evaluation is done with a use case in which engineering information needs to be used in a 3D modeling tool.


GitHub - deepmind/mujoco_menagerie: A collection of high-quality models for the MuJoCo physics engine, curated by DeepMind.

#artificialintelligence

Menagerie is a collection of high-quality models for the MuJoCo physics engine, curated by DeepMind. A physics simulator is only as good as the model it is simulating, and in a powerful simulator like MuJoCo with many modeling options, it is easy to create "bad" models which do not behave as expected. The goal of this collection is to provide the community with a curated library of well-designed models that work well right out of the gate. Menagerie's only requirement is MuJoCo version 2.2.2 or higher. You can download prebuilt binaries from the GitHub releases page, or if you are working with Python, you can install the native bindings from PyPI via pip install mujoco 2.2.2.


Stream Output When Parsing Big Xml With Elixir

#artificialintelligence

There are two big players in elixir's XML parsing ecosystem: I want to read a huge XML file that has some elements very repeated, and want to produce some kind of "iterator" from it. I'd like to produce some iterator that, when iterated, produces this: Saxy is incredibly fast and performant, but it's based on the concept that, as you read the XML file, you "fill" some state object (with whatever you want, and the amount you want, but, nevertheless, you fill it). In this scenario, I could "fill" the state with the list of items. That, of course, is a lot less memory than it would take to hold the entire XML structure in memory. But still it establishes a relationship between the size of the XML file and the size of the stored in-memory list, which I don't like because that means that if I use a big enough file, I can consume more memory than I'm allowed to. SweetXml provides some function called stream_tags and when you see what it does, it seems that it hits the spot!!! because it says it's just what I need: parse an xml and, as it finds certain tags, stream the SweetXml representation of them, and it doesn't build into memory any structure representing xml.


Using Deep Learning for Object Detection in Aerial Images

#artificialintelligence

Machine Learning, Deep Learning, Data scienceโ€ฆ We have been hearing these terms for several years now, and they don't seem to fade away any time soon. I was 14 years old the first time I heard the pair of words "Machine Learning", back in 2018. A year later, I implemented a neural network for a basic image classification using TensorFlow and Keras as part of my high school Machine Learning class. Fast-forward 2 more years of studying Machine Learning theory and hands-on practice, I graduated high school with a major in Software Engineering with a focus on Deep Learning. My final project's theme is "System for dealing with illegal construction using object detection in aerial images".


Using Deep Learning for Object Detection in Aerial Images

#artificialintelligence

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. Machine Learning, Deep Learning, Data scienceโ€ฆ We have been hearing these terms for several years now, and they don't seem to fade away any time soon.